Logistic regression on data including categorical variables
library("dplyr")
library("ProjectTemplate")
load.project()
Realistic data analysis!
Let’s consider a logistic regression problem.
We are given a medical dataset from a health insurance company and want
to explain the withdrawal of the insured from the insurance using the
following variables: gender, region, age, last year’s costs.
First we have to produce a data set where those variables are well defined.
The data set
Reading the preparatory data sets
library("readr")
cache("Vstichtag20090201",{
require(tcltk)
<- tkmessageBox(title = "User have to do some action.",
msgBox message = "Please select the data 'Vstichtag20090201'.", icon = "info", type = "ok")
<- read.table(file.choose(), header=TRUE, sep="\t")
Vstichtag20090201 })
## Le chargement a nécessité le package : formatR
## Skipping cache update for Vstichtag20090201: up to date
<- sapply(Vstichtag20090201,class)
classes head(Vstichtag20090201,3)
## versnr txfanr VEGDAT VEGESC VESPCD VEEDAT VEECOD VEADAT VEAGR VERSFA
## 1 1000004 1000004 19281205 M F 19700701 11 0 NA 1000004
## 2 1000152 1000152 19140830 M F 19830101 11 0 NA 1000152
## 3 1000276 1000276 19170319 W D 19721201 11 0 NA 1000276
## VEKANT VEVERTR VENBSV VEPFNR X_18 X_82 X_46 X_47 X_35 X_20 X_86 X_84 X_85
## 1 GE NA 122 1 0 0 0 0 0 0 0 0
## 2 VD NA 110 1 0 0 0 0 0 0 0 0
## 3 BE NA 157 1 0 0 0 0 0 0 0 0
## X_83 X_25 X_05 X_03 Online Jahrgang Alter ZielKKnr Stichtag Standdatum A1 AP
## 1 0 0 0 0 0 1928 81 NA 20090201 20090201 1 1
## 2 0 0 0 0 0 1914 95 NA 20090201 20090201 1 1
## 3 0 0 0 0 0 1917 92 NA 20090201 20090201 1 1
## U Z H3 H2 K H1 F H4 A5 A6 N A4 HT B J X APC S P AE KTI C XL KUTI A3 G T A H
## 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## 3 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## ZV nurOKP nurZV beide nurOKP_nurZV_beide okp
## 1 1 0 0 1 1 1
## 2 1 0 0 1 1 1
## 3 1 0 0 1 1 1
#=======================
cache("Vstichtag20100201",{
require(tcltk)
<- tkmessageBox(title = "User have to do some action.",
msgBox message = "Please select the data 'Vstichtag20100201'.", icon = "info", type = "ok")
<- read.table(file.choose(), header=TRUE, sep="\t")
Vstichtag20100201 })
## Skipping cache update for Vstichtag20100201: up to date
<- sapply(Vstichtag20090201,class)
classes #=======================
cache("Altersgruppen2009",{
require(tcltk)
<- tkmessageBox(title = "User have to do some action.",
msgBox message = "Please select the data 'Altersgruppen2009'.", icon = "info", type = "ok")
<- read.table(file.choose(), header=TRUE, sep="\t")
Altersgruppen2009 })
## Skipping cache update for Altersgruppen2009: up to date
head(Altersgruppen2009,3)
## Jahrgang Altersgruppe Alter Kat AGvon
## 1 2009 00-18 0 Kinder 0
## 2 2008 00-18 1 Kinder 0
## 3 2007 00-18 2 Kinder 0
#=======================
cache("lexj2009",{
require(tcltk)
<- tkmessageBox(title = "User have to do some action.",
msgBox message = "Please select the data 'lexj2009'.", icon = "info", type = "ok")
<- read_csv2(file.choose(), col_names=TRUE, na="NA", col_types=NULL, n_max=Inf)
lexj2009 })
## Skipping cache update for lexj2009: up to date
head(lexj2009,3)
## # A tibble: 3 × 80
## ATPGID KJ GJ MANDAT GSAG VERS DEBINR ADRSQN TUPNR WKT PLZ
## <lgl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 NA 2009 2009 0102 0027 3203956 1115469 99 996 GE 121300
## 2 NA 2004 2009 0102 0027 3203956 1115469 99 996 GE 121300
## 3 NA 2009 2009 0102 0027 2650088 1086816 99 996 GE 122200
## # ℹ 69 more variables: CODESEX <chr>, KAT <chr>, ALTGR <chr>, GEBJAHR <dbl>,
## # EFFALT <chr>, ABT <chr>, DA <dbl>, KLASSE <chr>, KARENZ <chr>, PEGRP <chr>,
## # REGION <chr>, CODETARIF <chr>, CODESELB <chr>, VERTNR <chr>, VTYP <chr>,
## # RSTYP <dbl>, RSNR <chr>, PAR <chr>, VZNR <chr>, VATYP <chr>, VADAT <dbl>,
## # VASTAT <dbl>, STORNO <chr>, STDAT <dbl>, BVONDAT <dbl>, BBISDAT <dbl>,
## # AZCODE <dbl>, KOA <chr>, SC <chr>, RVKEY <lgl>, ERFNR <chr>, EID <chr>,
## # DIABEZ <chr>, FALLID <dbl>, POSNR <dbl>, REBLK <chr>, UNFALL <chr>, …
#=======================
cache("KOA",{
require(tcltk)
<- tkmessageBox(title = "User have to do some action.",
msgBox message = "Please select the data 'KOA'.", icon = "info", type = "ok")
<- read_csv2(file.choose(), col_names=TRUE, na="NA", col_types=NULL, n_max=Inf)
KOA })
## Skipping cache update for KOA: up to date
<- sapply(KOA,class)
classes_koa head(KOA,10)
## # A tibble: 10 × 1
## `Beschreibung,KOA,OKP,Rechnungsart`
## <chr>
## 1 "A 101 Akupunktur,101 ,A,"
## 2 "A 107 Arztkosten,107 ,A,"
## 3 "A 110 Medikamente OKP,110 ,A,"
## 4 "A 114 Einnahmekontrolle Textanpassung 01.01.2007,114 ,A,"
## 5 "A 115 Medikamenten-Check Textanpassung 01.01.2007,115 ,A,"
## 6 "A 116 Bezugs-Check Textanpassung 01.01.2007,116 ,A,"
## 7 "A 117 Substitution/Generika Textanpassung 01.01.2007,117 ,A,"
## 8 "A 118 Notfalldienst Textanpassung 01.01.2007,118 ,A,"
## 9 "A 120 Chiropraktoren,120 ,A,chiro"
## 10 "A 121 Logop\xe4die,121 ,A,logo"
The withdrawals
A raw of the data set Vstichtag corresponds to a single insured person. We determine the withdrawals:
<- names(Vstichtag20090201)
names <- merge(x=Vstichtag20090201, y=Vstichtag20100201[,c("versnr","VEGESC")], by="versnr", all.x=TRUE) leftJoin
We collect the relevant variables (columns) on the far left of the data set:
<- names(leftJoin)
names <- as.character(names)
names <- as.vector(names)
names <- select(leftJoin,VEGESC.x,VEGESC.y,names) leftJoin
We then define a new variable that identifies an insured person as a withdrawal:
<- mutate(leftJoin,aus=ifelse(is.na(VEGESC.y),1,0)) leftJoin
the withdrawals:
<- filter(leftJoin,aus==1)
Abgaenge head(Abgaenge,3)
## VEGESC.x VEGESC.y versnr txfanr VEGDAT VESPCD VEEDAT VEECOD VEADAT
## 1 M <NA> 1000152 1000152 19140830 F 19830101 11 0
## 2 W <NA> 1000756 1000756 19310825 D 19560601 11 0
## 3 W <NA> 1000829 1000829 19321120 D 19470101 11 0
## VEAGR VERSFA VEKANT VEVERTR VENBSV VEPFNR X_18 X_82 X_46 X_47 X_35 X_20 X_86
## 1 NA 1000152 VD NA 110 1 0 0 0 0 0 0
## 2 NA 1000756 BE NA 183 0 0 0 0 0 0 0
## 3 NA 1000829 BE NA 160 1 0 0 0 0 0 0
## X_84 X_85 X_83 X_25 X_05 X_03 Online Jahrgang Alter ZielKKnr Stichtag
## 1 0 0 0 0 0 0 0 1914 95 NA 20090201
## 2 0 0 0 0 0 0 0 1931 78 NA 20090201
## 3 0 0 0 0 0 0 0 1932 77 NA 20090201
## Standdatum A1 AP U Z H3 H2 K H1 F H4 A5 A6 N A4 HT B J X APC S P AE KTI C XL
## 1 20090201 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 20090201 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 20090201 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## KUTI A3 G T A H ZV nurOKP nurZV beide nurOKP_nurZV_beide okp aus
## 1 0 0 0 0 1 1 1 0 0 1 1 1 1
## 2 0 0 0 0 1 1 1 0 0 1 1 1 1
## 3 0 0 0 0 1 1 1 0 0 1 1 1 1
It should be noted that the death of an insured person is considered a withdrawal. The way to remove these “withdrawals” from our considerations would be to use the information on the reason for the withdrawal, which is supposed to be contained in the variable VEAGR (AustrittsGRund). The problem is that no description of the possible values taken by VEAGR is available. For the sake of simplicity, a withdrawal is considered to be an insured person who had at least one insurance coverage (see the binary variables A (=basic insurance), H1 (=semi-private hospital), H2, H3 (=private hospital), Z (=teeth coverage) and so on) with the insurance company in the first year (2009), but no coverage in the following year.
gender, region, age, last year’s costs
<- select(Abgaenge,versnr,VEGESC.x,VEKANT,Alter,aus)
Abgaenge # an observation ~ a person
<- merge(x=Abgaenge, y=lexj2009[,c("VERS","BETRBRUT","KOA")],by.x="versnr",by.y="VERS",all.x=TRUE)
Abgaenge2 head(Abgaenge2,3)
## versnr VEGESC.x VEKANT Alter aus BETRBRUT KOA
## 1 1000152 M VD 95 1 4.30 115
## 2 1000152 M VD 95 1 27.00 130
## 3 1000152 M VD 95 1 3.25 116
# an observation ~ a person's cost
We create a bar plot in order to visualize the costs (BETRBRUT) per category of cost (KOA) and eventually operate a selection:
unique(Abgaenge2[,"KOA"])
## [1] "115" "130" "116" "107" "110" "111" "117" "207" "230" "950" "210" "118"
## [13] "429" "570" "430" "232" "413" "211" "412" "132" "406" "301" "421" "300"
## [25] "434" "432" "442" "407" "441" "426" "381" "478" "122" "335" "440" NA
## [37] "128" "473" "140" "336" "246" "957" "856" "245" "113" "958" "471" "133"
## [49] "414" "160" "420" "120" "460" "621" "475" "127" "849" "145" "415" "146"
## [61] "150" "231" "840" "842" "427" "436" "853" "612" "610" "619" "634" "571"
## [73] "625" "620" "623" "618" "626" "615" "951" "121" "551" "428" "141" "561"
## [85] "843" "443" "461" "450" "847" "124" "340" "391" "126" "125" "002" "260"
## [97] "437" "262" "101" "435" "263" "998" "472" "341" "447" "575" "445" "850"
## [109] "495" "496" "213" "952" "433" "317" "114" "390" "311" "453" "312" "221"
## [121] "846" "261" "479" "361" "371" "200" "474" "170" "633" "857" "171" "860"
## [133] "365" "658" "201" "624" "854" "550" "313" "500" "691" "851" "572" "351"
## [145] "152" "151" "702" "485" "438" "858" "484" "477" "617" "628" "493" "611"
## [157] "616" "494" "690"
<- group_by(Abgaenge2,KOA)
Abgaenge3 # an observation ~ a cost category
# We notice that some variables have a bad format, especially when we try to sum them and they are not numerical. We therefore correct this:
$BETRBRUT <- as.numeric(Abgaenge3$BETRBRUT)
Abgaenge3<- summarise(Abgaenge3,count=n(),sum=sum(BETRBRUT))
Abgaenge4 <- arrange(Abgaenge4,desc(sum))
Abgaenge4 # an observation ~ a cost category
<- Abgaenge4[1:10,]
Abgaenge4 head(Abgaenge4,3)
## # A tibble: 3 × 3
## KOA count sum
## <chr> <int> <dbl>
## 1 300 4348 22008061.
## 2 301 1959 11562999.
## 3 140 4912 9299465.
barplot(Abgaenge4$sum, names.arg = Abgaenge4$KOA, ylab = "Cost (BETRBRUT)", xlab = "category of cost (KOA)", col = "blue", main = "Cost per category of cost")
We decide to consider solely the costs corresponding to the categories KOA = 300, 301, 140 and 110.
$BETRBRUT <- as.numeric(Abgaenge2$BETRBRUT)
Abgaenge2<- filter(Abgaenge2,KOA %in% c(300,301,140,110))
Abgaenge5 # an observation ~ a person's cost
<- group_by(Abgaenge5,versnr)
Abgaenge6 <- summarise(Abgaenge6,BETRBRUT=sum(BETRBRUT))
Abgaenge6 # an observation ~ a person
<- merge(x=Abgaenge,y=Abgaenge6,by="versnr", all.x=TRUE)
Abgaenge # an observation ~ a person
# warning: new variable created (BETRBRUT) with N/A values
print(names(Abgaenge))
## [1] "versnr" "VEGESC.x" "VEKANT" "Alter" "aus" "BETRBRUT"
names(Abgaenge) <- c("persId","gender","canton","age","out","cost")
head(Abgaenge,3)
## persId gender canton age out cost
## 1 1000152 M VD 95 1 2165.7
## 2 1000756 W BE 78 1 1063.5
## 3 1000829 W BE 77 1 52524.4
To perform a regression, we need to add to the Abgaenge data set observations on insured persons who did not leave the health insurance company (the remainder). We produce the needed data set and “row bind” it to Abgaenge. We obtain the data set insured2009.
<- select(leftJoin,versnr,VEGESC.x,VEKANT,Alter)
leftJoin2 # an observation ~ a person
<- merge(x=leftJoin2, y=lexj2009[,c("VERS","BETRBRUT","KOA")],by.x="versnr",by.y="VERS",all.x=TRUE)
leftJoin3 # an observation ~ a person's cost
$BETRBRUT <- as.numeric(leftJoin3$BETRBRUT)
leftJoin3<- filter(leftJoin3,KOA %in% c(300,301,140,110))
leftJoin4 <- group_by(leftJoin4,versnr)
leftJoin5 <- summarise(leftJoin5,BETRBRUT=sum(BETRBRUT))
leftJoin5 # an observation ~ a person
<- merge(x=leftJoin2,y=leftJoin5,by="versnr",all.x=TRUE)
leftJoin6 # an observation ~ a person
# warning: new variable created (BETRBRUT) with N/A values
print(names(leftJoin6))
## [1] "versnr" "VEGESC.x" "VEKANT" "Alter" "BETRBRUT"
names(leftJoin6) <- c("persId","gender","canton","age","cost")
<- mutate(leftJoin6,out=0)
leftJoin6
<- anti_join(leftJoin6,Abgaenge,by="persId")
remainder # an observation ~ a person
# warning: variable (cost) with N/A values
<- rbind(Abgaenge,remainder)
insured2009 <- mutate(insured2009,cost=ifelse(is.na(cost),0,cost))
insured2009 # an observation ~ a person
# no variable with N/A values
Quality control:
-the number of observations in insured2009 must be equal to that in
Vstichtag20090201;
-the produced tables are checked for duplicate policyholders:
<- Vstichtag20090201
data <- data[duplicated(data[,"versnr"]),]
doublons
<- Abgaenge
data <- data[duplicated(data[,"persId"]),]
doublons
<- insured2009
data <- data[duplicated(data[,"persId"]),] doublons
The categorical variables gender and canton have to be formatted as factors in order to perform a regression:
$gender <- as.factor(insured2009$gender)
insured2009$canton <- as.factor(insured2009$canton)
insured2009head(insured2009,10)
## persId gender canton age out cost
## 1 1000152 M VD 95 1 2165.70
## 2 1000756 W BE 78 1 1063.50
## 3 1000829 W BE 77 1 52524.40
## 4 1000977 M TI 73 1 16185.45
## 5 1001191 W BE 68 1 11163.65
## 6 1001345 W BE 66 1 0.00
## 7 1003062 W GE 49 1 0.00
## 8 1003070 W AG 49 1 2.55
## 9 1003682 W VS 45 1 0.00
## 10 1003720 M VS 44 1 0.00
That’s it! We have obtained the basic data table needed for further statistical analysis: the Abgaenge table with all the variables specified at the beginning.
The statistical model
We want to explain the withdrawal of the insured from the insurance using the following variables: gender, region, age, last year’s costs.
We are given the following explanatory variables:
\(GENDER\) a categorical binary variable with values \(M\) for “male” and \(W\) for “female”
\(AGE\) a metric variable with values in \(\{0,1,2,\ldots,120\}\)
\(COST\) a metric variable with values in \(\mathbb{R}_+\)
\(CANTON\) : a categorical non binary variable
These explanatory variables are used to explain whether or not insured persons leave the health insurance company, i.e. to explain the following response variable:
\(OUT \sim \mathcal{B}ern(\pi)\), with values 0 for “no withdrawal” and 1 for “withdrawal”
where the parameter \(\pi\) depends on the explanatory variables according to the statistical model presented below:
\[ \pi = p(OUT=1 \,|\, X) = \frac{1}{1+\exp(-(X\cdot\beta^T))} \]
where \(X\) -called the design matrix- is a \(n\times (p+1)\)-matrix, \(n\) being the number of observations and \(p\) the number of explanatory variables. The matrix \(X\) has a first column containing solely 1’s, each other column being for one of the \(p\) explanatory variables. The linear predictor \(\eta :=X\cdot\beta^T\) becomes then:
\[ \eta= \begin{pmatrix} \eta_1 \\ \eta_2 \\ \eta_3 \\ \eta_4 \\ \eta_5 \\ \eta_5 \\ \vdots \end{pmatrix} := \begin{pmatrix} 1 & W & c_2 & 67 & 1493\\ 1 & W & c_3 & 73 & 1577\\ 1 & M & c_1 & 26 & 0\\ 1 & M & c_3 & 12 & 142\\ 1 & W & c_1 & 27 & 840\\ 1 & M & c_2 & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots \ & \vdots\\ \end{pmatrix} \cdot \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \end{pmatrix} \]
There’s just a problem here: the second and third columns are not numerical! The solution is to recode them, as explained with an example in the following section.
Full disjunctive coding for categorical variables
The numerical columns of the design matrix \(X\) don’t need to be changed, whereas the categorical columns all have to be recoded in the following manner.
Consider a column \(v\) of \(X\) admitting three categories \(c_1,c_2,c_3\). We proceed to the following “dummy”-recoding:
\[ \begin{pmatrix} c_2\\c_1\\c_1\\c_3\\c_1\\c_2\\ \vdots \end{pmatrix} \longrightarrow \begin{pmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ \vdots & \vdots & \vdots \end{pmatrix} \]
Motivation for such a recoding: It is in fact a question of carrying out as many different regressions as there are categories - we speak of levels. For the \(c_1\) level this gives the simple regression
\[ \eta_{i,c_1} = \beta_{0,c_1}+\beta_{1,c_1}\cdot x_{i,c_1}+\varepsilon_{i,c_1}\,, \] and it is the same for the other categories \(c_2\) and \(c_3\). By using an index \(j\) running through the categories: \(j\in\{c_1,c_2,c_3\}\) we come to consider:
\[ \eta_{i,j} = \beta_{0,j}+\beta_{1,j}\cdot x_{i,j}+\varepsilon_{i,j}\quad, j\in\{c_1,c_2,c_3\} \]
Modelization of the problem of taking account of the categorical variables
The proposed recoding replaces in \(X\) the column \(v\) with as many columns \(v_i\) as there are levels for \(v\). The problem is then that the first column \(w\) of \(X\), that being filled with 1’s, is a linear combination of the \(v_i\) because they always sum to 1 :
\[ w=v_1+v_2+v_3 \]
A design matrix cannot admit a linearly dependent column. A solution were to delete one of the \(k=3\) columns \(v_i\) -say the last column \(v_k\).
Let’s write \(B_v\) the \(n\times k-\)matrix \([v_1|v_2|\cdots|v_k]\). Yet a more general solution consists in multiplying \(B_v\) with a so called contrast matrix \(C_v\) of dimensions \(n\times (k-1)\). The resulting matrix \(B_v\cdot C_v=:\tilde{B}_v\) is then a matrix of dimensions \(n\times(k-1)\):
\[\begin{eqnarray*} \begin{pmatrix} c_2\\c_1\\c_1\\c_3\\c_1\\c_2\\ \vdots \end{pmatrix} &\longrightarrow& \begin{pmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ \vdots & \vdots & \vdots \end{pmatrix} \\ &\longrightarrow& \begin{pmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0\\ 0 & 1 & 0\\ \vdots & \vdots & \vdots \end{pmatrix} \cdot \begin{pmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ \end{pmatrix} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\ 1 & 0 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ \vdots & \vdots \end{pmatrix} =\tilde{B}_v \end{eqnarray*}\]
We then do the same for each categorical column in our \(X\) defined above, which gives us:
\[\begin{eqnarray*} X &=& \begin{pmatrix} 1 & W & ZH & 67 & 1493\\ 1 & W & GE & 73 & 1577\\ 1 & M & BE & 26 & 0\\ 1 & M & ZH & 12 & 142\\ 1 & W & BE & 27 & 840\\ 1 & M & BE & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots\\ \end{pmatrix} \\ &=& \left( \mathbf{1} \;|\: GENDER \;|\: CANTON \;|\: AGE \;|\: COST \right) \\ &\longrightarrow& \left( \mathbf{1} \;|\: \tilde{B}_\mbox{GENDER}\cdot C_\mbox{GENDER} \;|\: \tilde{B}_\mbox{CANTON}\cdot C_\mbox{CANTON} \;|\: AGE \;|\: COST \right) \\ &=& \begin{pmatrix} \begin{array}{c|c|cc|cc} 1 & 0 & 0 & 0 & 67 & 1493\\ 1 & 0 & 0 & 1 & 73 & 1577\\ 1 & 1 & 1 & 0 & 26 & 0\\ 1 & 1 & 0 & 0 & 12 & 142\\ 1 & 0 & 1 & 0 & 27 & 840\\ 1 & 1 & 1 & 0 & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ \end{array} \end{pmatrix} \end{eqnarray*}\]
where, to simplify the presentation, we consider that the variable \(CANTON\) has only 3 levels instead of 26.
The contrast matrix for the variable \(CANTON\) (with not 3 but all 26 levels):
options(width=10000) # to avoid the line-breaking of a table with many columns when printed in the RMarkdown output document
contrasts(insured2009$canton)
## AI AR AU BE BL BS DE FL FR GE GL GR IT JU LU NE NW OW SG SH SO SZ TG TI UR VD VS ZG ZH
## AG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## AI 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## AR 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## AU 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## BE 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## BL 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## BS 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## DE 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## FL 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## FR 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## GE 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## GL 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## GR 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## IT 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## JU 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## LU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## NE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## NW 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## OW 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## SG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## SH 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## SO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## SZ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## TG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## TI 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## UR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## VD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## VS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## ZG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## ZH 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
With our adapted design matrix -which we will also call \(X\)- we can finally write the linear predictor \(\eta=X\cdot\beta^T\):
\[ \eta= \begin{pmatrix} \eta_1 \\ \eta_2 \\ \eta_3 \\ \eta_4 \\ \eta_5 \\ \eta_5 \\ \vdots \end{pmatrix} := \begin{pmatrix} \begin{array}{c|c|cc|cc} 1 & 0 & 0 & 0 & 67 & 1493\\ 1 & 0 & 0 & 1 & 73 & 1577\\ 1 & 1 & 1 & 0 & 26 & 0\\ 1 & 1 & 0 & 0 & 12 & 142\\ 1 & 0 & 1 & 0 & 27 & 840\\ 1 & 1 & 1 & 0 & 54 & 2314\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ \end{array} \end{pmatrix} \cdot \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \end{pmatrix} \]
Model fitting
Have in mind that the linear predictor \(\eta=X\cdot\beta^T\) is equal to the mean parameter \(\mu\), a random real variable whose realization vector is the \(N\times 1\)-vector \((\mu_1,\ldots,\mu_n)^T\).
The likelihood is:
\[ \begin{align} \large L = \displaystyle\prod_{i=1}^N \pi_i^{OUT_i}(1-\pi_i)^{1-OUT_i} \end{align} \]
where \(\pi_i=\frac{1}{1+\exp(-(x_i\cdot\beta^T))}\), \(x_i\) being the \(i\)-th raw of the design matrix \(X\).
The negative log-likelihood for logistic regression is then given by
\[ NLL(\beta) = -\sum_{i=1}^N \left( OUT_i\cdot\log(\pi_i) + (1-OUT_i)\cdot\log(1-\pi_i) \right) \]
Unlike linear regression, we cannot write down the MLE in closed form. Instead, we need to use an optimization algorithm to compute it. For this purpose, we need to derive the gradient and Hessian matrix.
\[ \nabla NLL(\beta) = \begin{pmatrix} \begin{array}{l} \partial_{GENDER} NLL \\ \partial_{CANTON} NLL \\ \partial_{AGE} NLL \\ \partial_{COST} NLL \\ \end{array} \end{pmatrix} \]
Using the chain rule and recalling that \(\pi=\frac{1}{1+\exp(-\eta)}\) and that \(\mu=\eta = X\cdot\beta^T\): \[ \begin{align} \frac{\partial NLL}{\partial \beta_i} = \displaystyle \sum_{n=1}^N \frac{\partial NLL}{\partial \pi_n}\frac{\partial \pi_n}{\partial \eta_n}\frac{\partial \eta_n}{\partial \beta_i} \end{align} \]
Thus, we are looking to obtain three different derivatives.
First we calculate
\[ \begin{align} \frac{\partial NLL}{\partial \pi_n} = OUT_n \frac{1}{\pi_n} + (1-OUT_n) \frac{1}{1-\pi_n}(-1) = \frac{OUT_n}{\pi_n} - \frac{1-OUT_n}{1-\pi_n} \end{align} \]
Next, we solve for the derivative of \(\pi\) with respect to the linear predictor function \(\eta\), this for each observation \(n\):
\[ \begin{align} \large \pi_n(\eta_n) = \frac{1}{1+e^{-\eta_n}} \end{align} \]
\[ \begin{align} \frac{\partial \pi_n}{\partial \eta_n} = \frac{e^{-\eta_n}}{(1+e^{-\eta_n})^2} = \frac{1}{1+e^{-\eta_n}} \frac{e^{-\eta_n}}{1+e^{-\eta_n}} = \pi_n-\pi_n^2 = \pi_n(1-\pi_n) \end{align} \]
because of the algebraic relationship
\[ \frac{1}{1+z}-\left(\frac{1}{1+z}\right)^2=\frac{z}{(1+z)^2} \]
Lastly, we solve for the derivative of the linear predictor function with respect to the weights:
\[ \begin{eqnarray*} \eta_n &=& x_i\cdot\beta^T \\ &=& \beta_0x_{n0}+\beta_1x_{n1}+\beta_2x_{n2}+⋯+\beta_Nx_{np} \end{eqnarray*} \]
\[ \frac{\partial \eta_n}{\partial \beta_i} = x_{ni} \]
We now put it all together:
\[ \begin{eqnarray*} \frac{\partial \,NLL}{\partial \beta_i} &=& - \sum_{n=1}^N\frac{OUT_n}{\pi_n}\pi_n(1-\pi_n)x_{ni}-\frac{1-OUT_n}{1-\pi_n}\pi_n(1-\pi_n)x_{ni} \\ &=& \sum_{n=1}^N OUT_n(1-\pi_n)x_{ni}-(1-OUT_n)\pi_n\,x_{ni} \\ &=& \sum_{n=1}^N[OUT_n-OUT_n\,\pi_n-\pi_n+OUT_n\,\pi]\,x_{ni} \\ &=& \sum_{n=1}^N(\pi_n-OUT_n)\,x_{ni} \\ &=& \frac{\partial\, NLL}{\partial \beta} \\ &=& \sum_{n=1}^{N}(\pi_n-OUT_n)\,x_n \\ &=& X\cdot(\pi-OUT) \end{eqnarray*} \]
Therefore, the gradient with respect to \(\beta\) is the column vector
\[ \mbox{grad } NLL = \frac{\partial \,NLL}{\partial \beta}=X\cdot(\pi-OUT). \]
Analogously we obtain the Hessian matrix:
\[ \begin{eqnarray*} H &=& \frac{\partial \,(\mbox{grad } NLL)^T}{\partial \beta} = \sum_{n=1}^N (\nabla_\beta \mu_i)\cdot x_i^T = \sum_{n=1}^N \mu_i(1-\mu_i)x_i\,x_i^T \\ &=& X^T\cdot S\cdot X \end{eqnarray*} \]
where \(S:=\mbox{diag}(\mu_i\cdot(1-\mu_i))\). One has to verify whether \(H\) is positive definite. When it is the case, the \(NLL\) is convex and has a unique global minimum.
The gradient descent algorithm
Consider a given start value \(\theta_0\) for the parameter vector \(\theta\) one wants to correspond to a global minimum of a given real valued function \(f(\theta)\). The method consists in modifying step wise the parameter vector \[ \theta_{k+1}=\theta_k-\lambda_k\cdot\mbox{grad }f(\theta_k) \]
where \(\lambda_k\) is called, depending on the context, the step size or the learning rate. The gradient is a vector whose direction and sense leads you to a higher value if the function \(f\) is convex. For a minimum (we want to minimize the \(NLL\)), take minus instead of plus \(\mbox{grad }f(\theta_k)\). Therefore one has to control the convexity of the Hessian.
Motivation of the method
By Taylor’s theorem, we have \[ f(\theta_k+\lambda_k\cdot d_k) \approx f(\theta_k)+\lambda_k\cdot(\mbox{grad }f(\theta_k+\lambda_k\cdot d_k))\cdot d_k\,, \]
where \(d_k\) is the descent direction at step \(k\) and \(\mbox{grad }f(\theta_k+\lambda_k\cdot d_k)\) is the gradient at the end of the step. So if the step size \(\lambda_k\) is chosen small enough, then \(f(\theta+\lambda\cdot d) < f(\theta)\). Let us determine \(d_k\) that minimizes
\[ \phi(\lambda_k):=f(\theta_k+\lambda_k\cdot d_k) \]
A necessary condition for the minimum is \(\phi'(\lambda_k)=0\). By the chain rule,
\[ \phi'(\lambda_k) = (d_k)^T\cdot\mbox{grad }f(\theta_k+\lambda_k\cdot d_k) = 0 \]
So we either have \(\mbox{grad }f(\theta_k+\lambda_k\cdot d_k)=0\), which means we have found a stationary point, or
\[ \mbox{grad }f(\theta_k+\lambda_k\cdot d_k) \perp d_k\,, \]
which means that exact search stops at a point where the local gradient (i.e. the gradient at the end of the step) is perpendicular to the search direction. Hence consecutive directions will be orthogonal. That fact motivates the choice of the descent direction vector being \(d_k=\mbox{grad }f(\theta_k)\), because when we choose any other \(d_k\), at the latest after one more step, the new direction will be the local gradient.
Implementation
The design matrix has to be dummy-recoded.
head(insured2009,10)
## persId gender canton age out cost
## 1 1000152 M VD 95 1 2165.70
## 2 1000756 W BE 78 1 1063.50
## 3 1000829 W BE 77 1 52524.40
## 4 1000977 M TI 73 1 16185.45
## 5 1001191 W BE 68 1 11163.65
## 6 1001345 W BE 66 1 0.00
## 7 1003062 W GE 49 1 0.00
## 8 1003070 W AG 49 1 2.55
## 9 1003682 W VS 45 1 0.00
## 10 1003720 M VS 44 1 0.00
options(width=10000) # to avoid the line-breaking of a table with many columns when printed in the RMarkdown output document
library(fastDummies)
<- dummy_cols(insured2009, select_columns = c("gender","canton"),
X remove_selected_columns = TRUE, remove_first_dummy = TRUE)
<- as.matrix(X)
X <- X[,-c(1,3)]
X head(X,10)
## age cost gender_W canton_AI canton_AR canton_AU canton_BE canton_BL canton_BS canton_DE canton_FL canton_FR canton_GE canton_GL canton_GR canton_IT canton_JU canton_LU canton_NE canton_NW canton_OW canton_SG canton_SH canton_SO canton_SZ canton_TG canton_TI canton_UR canton_VD canton_VS canton_ZG canton_ZH
## [1,] 95 2165.70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [2,] 78 1063.50 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 77 52524.40 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 73 16185.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [5,] 68 11163.65 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 66 0.00 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [7,] 49 0.00 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [8,] 49 2.55 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [9,] 45 0.00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [10,] 44 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
# Calculate the rank of the matrix X
<- qr(X)$rank
rank_X if (rank_X == ncol(X)) {
print("The columns of X are linearly independent.")
else {
} print("The columns of X are not linearly independent.")
}
## [1] "The columns of X are linearly independent."
cat("dim(X)=",dim(X),"\n")
## dim(X)= 328253 32
the full model has 4 parameter \(\beta_1,\ldots,\beta_4\) plus the intercept \(\beta_0\), so that the dimension of the parameter vector \(\theta:=\beta\) is 5. After dummy-recoding, the model has 32 parameters.
dim(X)
## [1] 328253 32
We recall that
\[ \theta_{k+1}=\theta_k-\lambda_k\cdot\mbox{grad }f(\theta_k) \] and that
\[
\mbox{grad } NLL = \frac{\partial \,NLL}{\partial
\beta}=X\cdot(\pi-OUT).
\] Choose a start value for the 32-th dimensional \(\theta_0\).
Choose a step size (being supposed a constant for the sake of
simplicity).
<- vector("numeric",length=32) # start value
theta_0 <- 0.05 # size step
lambda
<- t(theta_0)
theta <- theta
erg for(j in 1:1000){
<- X %*% t(theta)
eta <- 1/(1+exp(-eta))
pi <- X[,3]
out <- t(pi-out) %*% X
grad <- theta - lambda * grad
theta if(j %% 50 == 0){
<- rbind(erg,theta)
erg
}
}print(erg)
## age cost gender_W canton_AI canton_AR canton_AU canton_BE canton_BL canton_BS canton_DE canton_FL canton_FR canton_GE canton_GL canton_GR canton_IT canton_JU canton_LU canton_NE canton_NW canton_OW canton_SG canton_SH canton_SO canton_SZ canton_TG canton_TI canton_UR canton_VD canton_VS canton_ZG canton_ZH
## [1,] 0.00 0 0.0 0.00 0.00 0.00 0.000 0.00 0.00 0.00 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.00 0.000 0.00000 0.00 0.00 0.00 0.000 0.00 0.000 0.00 0.00
## [2,] 42621.80 5108308 202439.3 -41.85 32.40 155.30 -1375.025 -170.45 252.45 0.10 -10.975 -134.425 591.125 52.15 -61.975 -1.375 -22.725 -266.025 752.325 -287.90 -45.975 -486.60 20.125 77.75000 -112.15 -52.20 529.25 -167.175 2445.00 108.975 -152.95 -455.20
## [3,] -289061.75 -2743883 399641.5 -94.05 16.70 284.45 -5263.375 -575.70 417.45 0.10 -23.925 -676.725 755.925 87.90 -384.575 -2.775 -83.775 -822.875 1380.675 -656.45 -108.275 -1486.70 -2.325 -88.10000 -293.10 -239.40 210.85 -380.475 4005.30 -103.125 -376.60 -2093.55
## [4,] -338099.10 -7171532 600846.1 -142.95 9.60 395.90 -7679.825 -814.40 601.05 0.15 -35.675 -922.575 1243.025 129.40 -534.125 -4.175 -134.875 -1193.075 1928.125 -985.75 -162.125 -2148.85 -6.075 -124.55000 -430.70 -384.20 607.55 -555.925 5992.90 -58.425 -576.05 -3049.95
## [5,] 133856.35 10989246 807095.4 -179.25 84.30 572.85 -6670.375 -752.05 921.30 0.40 -44.425 -667.275 2241.525 198.10 -328.075 -5.525 -120.575 -1170.275 2769.675 -1203.15 -195.125 -2129.00 56.325 173.45000 -472.20 -324.10 2009.60 -671.225 9219.70 370.275 -675.90 -2371.45
## [6,] 90858.70 3137222 1004669.1 -224.30 116.40 737.80 -8589.575 -981.35 1154.95 0.55 -55.475 -920.575 2665.725 247.80 -443.775 -6.925 -142.625 -1489.775 3567.925 -1502.25 -244.875 -2702.20 75.375 197.70000 -590.55 -387.45 2250.70 -845.175 11453.50 386.575 -845.80 -3037.15
## [7,] -267788.35 -4725050 1200501.0 -278.70 94.10 862.05 -12969.025 -1426.85 1296.95 0.55 -68.475 -1544.975 2780.175 281.80 -802.775 -8.325 -206.125 -2102.725 4188.825 -1892.45 -312.525 -3802.05 48.675 -18.05000 -781.80 -605.95 1806.75 -1067.175 12876.35 127.075 -1101.45 -4920.20
## [8,] 92320.39 13387265 1403315.8 -319.00 150.80 1020.50 -12982.725 -1444.45 1577.05 0.75 -77.925 -1445.575 3646.125 344.00 -682.625 -9.675 -207.575 -2190.225 4952.775 -2142.50 -354.025 -3971.75 94.825 178.02013 -847.20 -604.95 2929.35 -1198.075 15745.95 447.475 -1243.40 -4739.30
## [9,] 53363.04 5535892 1599606.6 -365.60 181.05 1185.75 -15245.725 -1696.80 1799.70 0.90 -88.775 -1772.625 4028.975 393.50 -816.575 -11.075 -234.175 -2540.925 5742.125 -2455.55 -409.225 -4609.55 110.625 165.12013 -972.20 -692.00 3084.70 -1375.825 17842.70 429.025 -1439.10 -5592.25
## [10,] -187206.36 -2324116 1793887.6 -418.25 176.60 1322.15 -19021.575 -2085.85 1963.15 0.95 -101.075 -2336.225 4208.525 432.80 -1109.775 -12.475 -287.425 -3076.475 6409.975 -2824.75 -475.625 -5585.55 96.075 -0.42987 -1142.15 -875.95 2800.45 -1583.875 19398.00 232.675 -1684.65 -7215.60
## [11,] -304238.66 -10105196 1989502.7 -468.60 185.95 1470.95 -22246.725 -2409.40 2152.55 1.10 -112.425 -2812.525 4484.475 477.45 -1330.575 -13.875 -332.425 -3520.575 7116.375 -3173.95 -539.225 -6426.55 91.275 -105.97987 -1290.50 -1028.30 2736.95 -1775.275 21136.80 122.975 -1912.70 -8525.90
## [12,] 125501.34 7967075 2190160.0 -508.45 255.50 1640.35 -22062.325 -2406.40 2448.20 1.30 -121.375 -2721.425 5381.575 541.85 -1190.525 -15.225 -328.775 -3574.925 7926.575 -3417.30 -582.375 -6580.45 141.975 108.82013 -1346.00 -1013.40 3881.00 -1903.575 24045.10 462.525 -2054.30 -8263.30
## [13,] -181664.66 113401 2381509.7 -563.85 234.55 1757.55 -26636.825 -2848.80 2586.50 1.30 -133.975 -3413.975 5475.625 575.35 -1556.175 -16.625 -397.275 -4176.375 8512.475 -3811.30 -656.575 -7717.25 110.525 -134.62987 -1530.55 -1248.75 3396.55 -2121.825 25241.65 191.825 -2323.70 -10227.40
## [14,] -228067.79 -7704596 2574992.5 -612.80 251.70 1912.35 -29570.725 -3141.65 2792.65 1.50 -144.825 -3875.125 5791.425 622.70 -1755.825 -18.025 -435.875 -4577.725 9242.375 -4148.85 -720.525 -8513.75 108.825 -219.37987 -1667.30 -1386.20 3399.10 -2307.525 27010.80 109.175 -2539.10 -11380.01
## [15,] 41858.56 10368937 2773144.9 -657.90 283.15 2057.05 -30984.275 -3273.40 3035.45 1.65 -154.625 -4055.775 6496.875 677.25 -1779.825 -19.375 -458.825 -4808.275 9927.225 -4451.40 -778.625 -9032.20 125.425 -171.17987 -1766.05 -1480.45 4102.90 -2465.775 29287.40 267.375 -2730.35 -11885.96
## [16,] -188419.74 2597151 2963387.8 -712.45 256.85 2180.95 -35603.575 -3706.20 3178.50 1.75 -166.675 -4774.775 6605.975 712.10 -2141.775 -20.775 -527.475 -5390.725 10494.325 -4846.60 -856.275 -10178.25 89.775 -430.12987 -1947.30 -1732.90 3665.80 -2678.825 30431.70 1.025 -2999.05 -13837.11
## [17,] -323068.64 -5262336 3154217.7 -764.30 252.25 2321.10 -39421.125 -4076.90 3356.80 1.85 -177.875 -5399.775 6815.525 753.50 -2436.675 -22.175 -577.125 -5890.525 11162.225 -5214.60 -929.025 -11185.10 70.825 -606.32987 -2108.90 -1933.25 3399.35 -2883.875 31864.25 -195.425 -3241.50 -15412.21
## [18,] 169603.21 12809139 3349775.4 -804.85 304.30 2488.60 -39713.625 -4095.35 3650.85 2.10 -186.025 -5441.775 7671.925 817.50 -2345.225 -23.525 -577.375 -5964.125 11922.225 -5469.60 -980.675 -11464.50 107.725 -452.92987 -2171.05 -1969.60 4442.70 -3019.325 34415.05 79.625 -3388.90 -15287.16
## [19,] -14822.05 4992252 3539599.9 -859.50 270.60 2611.60 -43550.375 -4537.15 3793.35 2.20 -197.825 -6203.475 7742.025 852.45 -2724.425 -24.925 -645.075 -6551.575 12461.575 -5869.35 -1061.825 -12612.25 66.925 -728.82987 -2359.65 -2241.95 3952.10 -3237.175 35165.30 -221.075 -3659.90 -17091.06
## [20,] -249474.85 -2860290 3726017.6 -915.25 231.30 2730.40 -48485.975 -4998.05 3931.65 2.25 -209.775 -7012.225 7805.375 884.80 -3129.375 -26.325 -715.925 -7164.725 12997.225 -6270.20 -1143.325 -13845.50 23.525 -1025.42987 -2552.60 -2529.65 3403.70 -3460.325 36120.70 -554.625 -3934.55 -19140.66
## [21,] -400232.10 -10722034 3915273.4 -968.25 213.70 2868.40 -52738.775 -5401.45 4099.90 2.35 -220.975 -7718.375 7967.025 924.00 -3464.325 -27.725 -770.975 -7703.825 13632.425 -6652.00 -1220.575 -14949.90 -4.075 -1248.77987 -2725.30 -2765.00 3034.45 -3672.925 37382.10 -802.875 -4188.95 -20896.76
Choice of a model
One should first ask himself:
- Is at least one of the predictors useful in predicting the response?
- Do all the predictors help to explain the response, or is only a subset of them useful?
- How well does the model fit the data?
- Given a set of predictor values, what response value should we predict, and how accurate our prediction?
We address each of these questions in turn.
The first question
We test the hypothesis \[ \begin{array}{l} H_0:\beta_1=\beta_2=\cdots=\beta_p=0 \\ \qquad \mbox{against the hypothesis} \\ H_1:\mbox{at least one }\beta_j\neq 0 \end{array} \]
This test is performed by computing the \(F\)-statistic.
About the \(F\)-statistic
Consider two embedded linear models. First model : \(Y=X\cdot\beta+\varepsilon\), where \(\varepsilon\sim\mathcal{N}(0,\sigma^2I)\). This means that \(E[Y]\) is in the space \(Space(X)\) generated by the columns of \(X\). Second model: the same as the first but the last \(q\) \(\beta\)-coefficients are zero, where \(q\leq p\), and \(p\) is the number of variables of the first model. So the second model is: \(Y=X_0\cdot\beta_0+\varepsilon_0\), where \(\varepsilon_0\sim\mathcal{N}(0,\sigma^2I)\) (the same as for the first model). Here \(X_0\) is the matrix composed of the \(p-q\) first columns of \(X\). When the second model is right, \(E[Y]\) is in the subspace \(Space(X_0)\) of \(Space(X)\), generated by the columns of \(X_0\).
In order to decide to keep or not the last \(q\) variables, one wants to test: \[ \begin{array}{l} H_0:\beta_{p-q+1}=\cdots=\beta_p=0 \\ \qquad \mbox{against the hypothesis} \\ H_1:\mbox{each }\beta_j\neq 0 \end{array} \]
What could then be an adequate test statistic? Under the \(H_0\)-hypothesis, we have that \(E[Y]\in Space(X_0)\). In that case the least square method consists in projecting the vector \(Y\) not on \(Space(X)\) but on \(Space(X_0)\) in order to obtain \(\hat{Y}_0\).
The idea of the test is to keep \(H_0\) when \(\hat{Y}_0\) is close to the projection of \(Y\) on \(Space(X)\), what we denote with \(\hat{Y}\). Indeed, if the information brought by both models is almost the same, then better keep the smaller model.
To quantify this closeness between \(\hat{Y_0}\) and \(\hat{Y}\), one has to note that -say- the euclidean distance depends on the measurement units of the data. That’s why such a distance has to be standardized, and why not by dividing with the square of the euclidean norm of the error vector \(\hat{\varepsilon}=||Y-\hat{Y}||\). Thereby the vectors \(\hat{\varepsilon}\) and \(\hat{\varepsilon}=Y-\hat{Y}\) don’t belong to spaces of same dimension, and therefore one has to divide each term by its freedom degree. The following test statistic meets our expectations:
\[ F:=\frac{||\hat{Y}_0-\hat{Y}||^2\big/q}{||Y-\hat{Y}||^2\big/(n-p)} \]
where \(n\) is the number of observations, and \(q\) is the number of variables of the smaller model.
One can show that the denominator is independent of the numerator, each one following a \(\chi^2\) distribution. The quotient follows thus a Fisher distribution.
When there is no relationship between the response and the set of predictors, one would expect the \(F\)-statistic to take a value close to 1. On the other hand, if the alternative hypothesis \(H_1\) is true, we expect \(F\) to be greater than 1.
How large need the \(F\)-statistic to be before we can reject \(H_0\) and conclude that there actually is a relationship? It turns out that the answer depends on the values of \(n\) and \(p\). When \(n\) is large, and that’s definitively the case here, an \(F\)-statistic that is just a little larger than 1 might still provide evidence against \(H_0\). In our case and if one conclude on the basis of the p-value of the \(F\)-test, one has to reject \(H_0\) and consider that there is a relationship.
<- lm(out ~ gender + canton + age + cost, data=insured2009)
model_full summary(model_full)
##
## Call:
## lm(formula = out ~ gender + canton + age + cost, data = insured2009)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.45924 -0.06193 -0.04744 -0.03389 1.00238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.428e-02 2.316e-03 32.080 < 2e-16 ***
## genderW -5.921e-03 7.675e-04 -7.715 1.22e-14 ***
## cantonAI 1.623e-02 1.295e-02 1.254 0.209955
## cantonAR -3.572e-03 5.792e-03 -0.617 0.537381
## cantonAU 2.872e-02 7.465e-03 3.848 0.000119 ***
## cantonBE 1.990e-03 2.287e-03 0.870 0.384089
## cantonBL -1.117e-02 3.221e-03 -3.467 0.000526 ***
## cantonBS 1.150e-02 4.558e-03 2.524 0.011595 *
## cantonDE -2.286e-02 1.098e-01 -0.208 0.835126
## cantonFL 5.188e-02 2.580e-02 2.011 0.044304 *
## cantonFR -5.887e-03 2.828e-03 -2.081 0.037393 *
## cantonGE 2.356e-02 2.798e-03 8.422 < 2e-16 ***
## cantonGL -1.477e-02 9.298e-03 -1.589 0.112052
## cantonGR -4.665e-03 3.130e-03 -1.490 0.136096
## cantonIT -5.390e-02 2.196e-01 -0.245 0.806140
## cantonJU 1.333e-02 6.361e-03 2.095 0.036139 *
## cantonLU 4.176e-02 3.054e-03 13.672 < 2e-16 ***
## cantonNE 7.920e-03 4.100e-03 1.932 0.053376 .
## cantonNW 3.072e-02 4.677e-03 6.568 5.09e-11 ***
## cantonOW 2.096e-01 9.571e-03 21.900 < 2e-16 ***
## cantonSG 4.770e-04 2.708e-03 0.176 0.860209
## cantonSH -4.001e-03 6.061e-03 -0.660 0.509171
## cantonSO 1.688e-02 3.216e-03 5.248 1.54e-07 ***
## cantonSZ -1.031e-02 4.936e-03 -2.090 0.036646 *
## cantonTG 5.365e-03 3.891e-03 1.379 0.167999
## cantonTI -9.246e-03 2.500e-03 -3.698 0.000217 ***
## cantonUR 3.459e-02 5.781e-03 5.983 2.19e-09 ***
## cantonVD 5.211e-03 2.500e-03 2.085 0.037113 *
## cantonVS 8.687e-03 2.978e-03 2.917 0.003538 **
## cantonZG 4.127e-02 4.955e-03 8.329 < 2e-16 ***
## cantonZH 1.564e-02 2.416e-03 6.473 9.62e-11 ***
## age -6.752e-04 1.800e-05 -37.512 < 2e-16 ***
## cost 2.262e-06 6.001e-08 37.689 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2196 on 328220 degrees of freedom
## Multiple R-squared: 0.01164, Adjusted R-squared: 0.01154
## F-statistic: 120.8 on 32 and 328220 DF, p-value: < 2.2e-16
The second question
How to define a subset of the predictors that is sufficient to predict the response?
The above developments bring a first answer to that question: use the \(F\)-statistic corresponding to the above defined test, where not all coefficients \(\beta\) are set equal to zero. One has just to eventually reorder the variables to get all the \(p-q\) kept variables at the beginning and to set the \(q\) last variables equal to zero.
Notice that in the summary table (above), for each individual predictor a \(t\)-statistic and a p-value are reported. These provide information about whether each individual predictor is related to the response, after adjusting for the other predictors. It turns out that each of these are exactly equivalent to the \(F\)-test that omits that single variable from the model, leaving all the others in, quivalent in that sense: the square of each \(t\)-statistic is the corresponding \(F\)-statistic. These \(t\)-tests report the partial effect of adding that variable to the model.
So we want to determine what variables are related to the response. We could look at each individual p-value of the corresponding \(t\)-test, but if \(p\) is large, we are likely to make some false discoveries, because the expected number of p-values being below the threshold of -say- 5% is more than one in our case (we have 32 predictors : \(32\cdot 5\% >1\)), so we expect at least one false discovery.
To solve this difficulty, one may use the BIC criterion for variable selection. In our case however, there is no real necessity for that.
The third question
Two of the most common measures of model fit are the RSE and the \(R^2\).
Recall that for simple regression, \(R^2\) is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals \(Cor(Y,\hat{Y})^2\). Consider the geometrical interpretation of \(R^2\) as \(cos^2(\beta)\). A fitted model maximizes this correlation among all possible linear models. A \(R^2\) close to 1 indicates that the model explains a large portion of the variance in the response. But when more variables are added to the model, the \(R^2\) will always increase. That makes the \(R^2\) an unappropriate measure of comparison between models with different numbers of variables.
The fourth question
Predictions: Consider the confidence interval for prediction
Bilateral interval of (1-)-level for a new \(y_{n+1}\): \[ \left[ x_{n+1}^T\cdot\hat{\beta}\,\pm\, t_{n-p}(1-\alpha/2)\cdot \hat{\sigma}\cdot \sqrt{x_{n+1}^T\cdot(X^TX)^{-1}\cdot x_{n+1}+1}\right]. \]