Logistic迴歸適用於二值相應變量(0/1)。模型假設 Y 服從二項分佈,線性模型的擬合形式:git
其中 π = μy是 Y的條件均(即給定一系列 X 值時 Y=1的機率),(π/1-π)爲 Y =1 時的優點比,log(π/1-π)爲對數優點比,或logit。本例中,log(π/1-π)爲鏈接函數,機率分佈爲二項分佈,可用以下代碼擬合Logistic迴歸模型 dom
glm(Y~X1+X2+X3,family = binomial(link ="logit"),data =mydata)
例函數
當經過一系列連續型和/或類別型預測變量來預測二值的結果變量時,Logistic迴歸是一個很是有用的工具工具
#使用AER包中的數據框Affairs爲例,探究婚外情的迴歸過程 > data(Affairs,package = "AER")#導入包中的數據,在函數中也有require(包名) > summary(Affairs) #先看下描述性統計,知道總體的狀況 affairs gender age yearsmarried children religiousness education occupation Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171 Min. :1.000 Min. : 9.00 Min. :1.000 1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 Median : 0.000 Median :32.00 Median : 7.000 Median :3.000 Median :16.00 Median :5.000 Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116 Mean :16.17 Mean :4.195 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000 Max. :20.00 Max. :7.000 rating Min. :1.000 1st Qu.:3.000 Median :4.000 Mean :3.932 3rd Qu.:5.000 Max. :5.000 > table(Affairs$affairs) # 生成交叉表格,會自動統計每類的次數 0 1 2 3 7 12 451 34 17 19 42 38 #Logistic迴歸是對二值型結果的統計,因此先將數據轉化爲因子 > Affairs$affairs[Affairs$affairs > 0] <- 1 #[Affairs$affairs > 0]爲真時,賦值爲1 > Affairs$affairs[Affairs$affairs == 0] <- 0 > Affairs$ynaffair <- factor(Affairs$affairs,levels = c(0,1),labels=c("No,Yes")) #轉化爲因子 > table(Affairs$ynaffair)#在使用table看下結果 No,Yes1 No,Yes2 451 150 #擬合Logistic模型 > fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + + religiousness + education + occupation +rating, + data=Affairs,family=binomial()) > summary(fit.full) Call: glm(formula = ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating, family = binomial(), data = Affairs) Deviance Residuals: Min 1Q Median 3Q Max -1.5713 -0.7499 -0.5690 -0.2539 2.5191 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.37726 0.88776 1.551 0.120807 gendermale 0.28029 0.23909 1.172 0.241083 #無「*」號表示不顯著,即 p>0.05 age -0.04426 0.01825 -2.425 0.015301 * #"*"越多表示越顯著 yearsmarried 0.09477 0.03221 2.942 0.003262 ** childrenyes 0.39767 0.29151 1.364 0.172508 religiousness -0.32472 0.08975 -3.618 0.000297 *** education 0.02105 0.05051 0.417 0.676851 occupation 0.03092 0.07178 0.431 0.666630 rating -0.46845 0.09091 -5.153 2.56e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 675.38 on 600 degrees of freedom Residual deviance: 609.51 on 592 degrees of freedom AIC: 627.51 Number of Fisher Scoring iterations: 4
從結果中能夠看到,性別、孩子、學歷職業等對方程都不顯著,能夠剔除這些再擬合簡單的模型,而後兩個模型進行比較,看下簡單模型是否合理ui
#剔除顯著的變量,再擬合 > fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness + + rating, data=Affairs, family=binomial()) > summary(fit.reduced) Call: glm(formula = ynaffair ~ age + yearsmarried + religiousness + rating, family = binomial(), data = Affairs) Deviance Residuals: Min 1Q Median 3Q Max -1.6278 -0.7550 -0.5701 -0.2624 2.3998 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.93083 0.61032 3.164 0.001558 ** age -0.03527 0.01736 -2.032 0.042127 * yearsmarried 0.10062 0.02921 3.445 0.000571 *** religiousness -0.32902 0.08945 -3.678 0.000235 *** rating -0.46136 0.08884 -5.193 2.06e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 675.38 on 600 degrees of freedom Residual deviance: 615.36 on 596 degrees of freedom AIC: 625.36 #發現 簡單模型的AIC值比以前的模型的要小,說明是可行的,而後咱們也能夠用anova()對兩次擬合模型進行比較 Number of Fisher Scoring iterations: 4
因爲兩個模型嵌套(fit.reduced是fit.full的一個子集)能夠使用anova()進行比較, 對於廣義線性模型,能夠卡方檢驗spa
##使用anova()對兩個嵌套模型進行比較,廣義線性迴歸使用Chisp(卡方檢驗) > anova(fit.full,fit.reduced,test="Chisq") Analysis of Deviance Table Model 1: ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating Model 2: ynaffair ~ age + yearsmarried + religiousness + rating Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 592 609.51 2 596 615.36 -4 -5.8474 0.2108 #卡方值不顯著(p=0.217)代表四個預測變量的新模型與九個完整預測變量的模型擬合程度同樣好