Logistic迴歸

  • Logistic迴歸介紹

 

Logistic迴歸適用於二值相應變量(0/1)。模型假設 Y 服從二項分佈,線性模型的擬合形式:git

其中 π = μy是 Y的條件均(即給定一系列 X 值時 Y=1的機率),(π/1-π)爲 Y =1 時的優點比,log(π/1-π)爲對數優點比,或logit。本例中,log(π/1-π)爲鏈接函數,機率分佈爲二項分佈,可用以下代碼擬合Logistic迴歸模型  dom

glm(Y~X1+X2+X3,family = binomial(link ="logit"),data =mydata)

函數

當經過一系列連續型和/或類別型預測變量來預測二值的結果變量時,Logistic迴歸是一個很是有用的工具工具

#使用AER包中的數據框Affairs爲例,探究婚外情的迴歸過程
> data(Affairs,package = "AER")#導入包中的數據,在函數中也有require(包名)
> summary(Affairs)             #先看下描述性統計,知道總體的狀況
    affairs          gender         age         yearsmarried    children  religiousness     education       occupation   
 Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171   Min.   :1.000   Min.   : 9.00   Min.   :1.000  
 1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430   1st Qu.:2.000   1st Qu.:14.00   1st Qu.:3.000  
 Median : 0.000                Median :32.00   Median : 7.000             Median :3.000   Median :16.00   Median :5.000  
 Mean   : 1.456                Mean   :32.49   Mean   : 8.178             Mean   :3.116   Mean   :16.17   Mean   :4.195  
 3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000             3rd Qu.:4.000   3rd Qu.:18.00   3rd Qu.:6.000  
 Max.   :12.000                Max.   :57.00   Max.   :15.000             Max.   :5.000   Max.   :20.00   Max.   :7.000  
     rating     
 Min.   :1.000  
 1st Qu.:3.000  
 Median :4.000  
 Mean   :3.932  
 3rd Qu.:5.000  
 Max.   :5.000  
> table(Affairs$affairs) # 生成交叉表格,會自動統計每類的次數

  0   1   2   3   7  12 
451  34  17  19  42  38 

#Logistic迴歸是對二值型結果的統計,因此先將數據轉化爲因子

> Affairs$affairs[Affairs$affairs > 0] <- 1   #[Affairs$affairs > 0]爲真時,賦值爲1
> Affairs$affairs[Affairs$affairs == 0] <- 0
> Affairs$ynaffair <- factor(Affairs$affairs,levels = c(0,1),labels=c("No,Yes"))  #轉化爲因子
> table(Affairs$ynaffair)#在使用table看下結果
No,Yes1 No,Yes2   
    451     150 

#擬合Logistic模型
> fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + 
+                   religiousness + education + occupation +rating,
+                 data=Affairs,family=binomial())
> summary(fit.full)

Call:
glm(formula = ynaffair ~ gender + age + yearsmarried + children + 
    religiousness + education + occupation + rating, family = binomial(), 
    data = Affairs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5713  -0.7499  -0.5690  -0.2539   2.5191  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.37726    0.88776   1.551 0.120807    
gendermale     0.28029    0.23909   1.172 0.241083       #無「*」號表示不顯著,即 p>0.05
age           -0.04426    0.01825  -2.425 0.015301 *     #"*"越多表示越顯著
yearsmarried   0.09477    0.03221   2.942 0.003262 ** 
childrenyes    0.39767    0.29151   1.364 0.172508    
religiousness -0.32472    0.08975  -3.618 0.000297 ***
education      0.02105    0.05051   0.417 0.676851    
occupation     0.03092    0.07178   0.431 0.666630    
rating        -0.46845    0.09091  -5.153 2.56e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 609.51  on 592  degrees of freedom
AIC: 627.51

Number of Fisher Scoring iterations: 4

從結果中能夠看到,性別、孩子、學歷職業等對方程都不顯著,能夠剔除這些再擬合簡單的模型,而後兩個模型進行比較,看下簡單模型是否合理ui

#剔除顯著的變量,再擬合
> fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness + 
+                      rating, data=Affairs, family=binomial())
> summary(fit.reduced)

Call:
glm(formula = ynaffair ~ age + yearsmarried + religiousness + 
    rating, family = binomial(), data = Affairs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6278  -0.7550  -0.5701  -0.2624   2.3998  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.93083    0.61032   3.164 0.001558 ** 
age           -0.03527    0.01736  -2.032 0.042127 *  
yearsmarried   0.10062    0.02921   3.445 0.000571 ***
religiousness -0.32902    0.08945  -3.678 0.000235 ***
rating        -0.46136    0.08884  -5.193 2.06e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 615.36  on 596  degrees of freedom
AIC: 625.36                                              #發現 簡單模型的AIC值比以前的模型的要小,說明是可行的,而後咱們也能夠用anova()對兩次擬合模型進行比較

Number of Fisher Scoring iterations: 4

因爲兩個模型嵌套(fit.reduced是fit.full的一個子集)能夠使用anova()進行比較, 對於廣義線性模型,能夠卡方檢驗spa

##使用anova()對兩個嵌套模型進行比較,廣義線性迴歸使用Chisp(卡方檢驗)
> anova(fit.full,fit.reduced,test="Chisq")
Analysis of Deviance Table

Model 1: ynaffair ~ gender + age + yearsmarried + children + religiousness + 
    education + occupation + rating
Model 2: ynaffair ~ age + yearsmarried + religiousness + rating
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       592     609.51                     
2       596     615.36 -4  -5.8474   0.2108   #卡方值不顯著(p=0.217)代表四個預測變量的新模型與九個完整預測變量的模型擬合程度同樣好
相關文章
相關標籤/搜索