age
預測wage
。使用交叉驗證爲多項式選擇最佳次數。選擇了什麼程度,這與使用ANOVA
進行假設檢驗的結果相好比何?對所得多項式擬合數據進行繪圖。加載工資數據集。保留全部交叉驗證偏差的數組。咱們執行K=10 K倍交叉驗證。python
1. rm(list = ls()) 2. set.seed(1) 5. # 測試偏差 6. cv.MSE <- NA 8. # 循環遍歷年齡 9. for (i in 1:15) { 10. glm.fit <- glm(wage ~ poly(age, i), data = Wage) 11. # 咱們使用cv.glm的交叉驗證並保留cv偏差 12. cv.MSE[i] <- cv.glm(Wage, glm.fit, K = 10)$delta[1] 13. } 14. # 檢查結果對象 15. cv.MSE
1. ## [1] 1675.837 1601.012 1598.801 1594.217 1594.625 1594.888 1595.500 2. ## [8] 1595.436 1596.335 1595.835 1595.970 1597.971 1598.713 1599.253 3. ## [15] 1595.332
咱們經過繪製type = "b"
點與線之間的關係圖來講明結果。數組
1. # 用線圖說明結果 2. plot( x = 1:15, y = cv.MSE, xlab = "power of age", ylab = "CV error", 3. type = "b", pch = 19, lwd = 2, bty = "n", 4. ylim = c( min(cv.MSE) - sd(cv.MSE), max(cv.MSE) + sd(cv.MSE) ) ) 6. # 水平線 7. abline(h = min(cv.MSE) + sd(cv.MSE) , lty = "dotted") 9. # 最小值 10. points( x = which.min(cv.MSE), y = min(cv.MSE), col = "red", pch = "X", cex = 1.5 )
咱們再次以較高的年齡權重對模型進行擬合以進行方差分析。dom
1. ## Analysis of Deviance Table 2. ## 3. ## Model 1: wage ~ poly(age, a) 4. ## Model 2: wage ~ poly(age, a) 5. ## Model 3: wage ~ poly(age, a) 6. ## Model 4: wage ~ poly(age, a) 7. ## Model 5: wage ~ poly(age, a) 8. ## Model 6: wage ~ poly(age, a) 9. ## Model 7: wage ~ poly(age, a) 10. ## Model 8: wage ~ poly(age, a) 11. ## Model 9: wage ~ poly(age, a) 12. ## Model 10: wage ~ poly(age, a) 13. ## Model 11: wage ~ poly(age, a) 14. ## Model 12: wage ~ poly(age, a) 15. ## Model 13: wage ~ poly(age, a) 16. ## Model 14: wage ~ poly(age, a) 17. ## Model 15: wage ~ poly(age, a) 18. ## Resid. Df Resid. Dev Df Deviance F Pr(>F) 19. ## 1 2998 5022216 20. ## 2 2997 4793430 1 228786 143.5637 < 2.2e-16 *** 21. ## 3 2996 4777674 1 15756 9.8867 0.001681 ** 22. ## 4 2995 4771604 1 6070 3.8090 0.051070 . 23. ## 5 2994 4770322 1 1283 0.8048 0.369731 24. ## 6 2993 4766389 1 3932 2.4675 0.116329 25. ## 7 2992 4763834 1 2555 1.6034 0.205515 26. ## 8 2991 4763707 1 127 0.0795 0.778016 27. ## 9 2990 4756703 1 7004 4.3952 0.036124 * 28. ## 10 2989 4756701 1 3 0.0017 0.967552 29. ## 11 2988 4756597 1 103 0.0648 0.799144 30. ## 12 2987 4756591 1 7 0.0043 0.947923 31. ## 13 2986 4756401 1 190 0.1189 0.730224 32. ## 14 2985 4756158 1 243 0.1522 0.696488 33. ## 15 2984 4755364 1 795 0.4986 0.480151 34. ## --- 35. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
根據F檢驗,咱們應該選擇年齡提升到3次的模型,經過交叉驗證 。函數
如今,咱們繪製多項式擬合的結果。測試
1. plot(wage ~ age, data = Wage, col = "darkgrey", bty = "n") 2. ...
wage
使用進行預測age
。繪製得到的擬合圖。1. cv.error <- NA 2. ... 4. # 突出顯示最小值 5. points( x = which.min(cv.error), y = min(cv.error, na.rm = TRUE),
交叉驗證代表,在k = 8的狀況下,測試偏差最小。最小值的1sd以內的最簡約模型具備k = 4,所以將數據分爲5個不一樣的區域。編碼
44spa
1. lm.fit <- glm(wage ~ cut(age, 4), data = Wage) 3. matlines(age.grid, cbind( lm.pred$fit + 2* lm.pred$se.fit, 4. lm.pred$fit - 2* lm.pred$se.fit), 5. col = "red", lty ="dashed")
該Wage
數據集包含了一些其餘的變量,如婚姻情況(maritl
),工做級別(jobclass
),等等。探索其中一些其餘預測變量與的關係wage
,並使用非線性擬合技術將模型擬合到數據中。3d
1. ... 2. summary(Wage[, c("maritl", "jobclass")] )
1. ## maritl jobclass 2. ## 1. Never Married: 648 1. Industrial :1544 3. ## 2. Married :2074 2. Information:1456 4. ## 3. Widowed : 19 5. ## 4. Divorced : 204 6. ## 5. Separated : 55
1. # 關係箱線圖 2. par(mfrow=c(1,2)) 3. plot(Wage$maritl, Wage$wage, frame.plot = "FALSE")
看來一對已婚夫婦平均比其餘羣體掙更多的錢。信息類工做的工資平均高於工業類工做。code
1. m1 <- lm(wage ~ maritl, data = Wage) 2. deviance(m1) # fit (RSS in linear; -2*logLik)
## [1] 4858941
orm
1. m2 <- lm(wage ~ jobclass, data = Wage) 2. deviance(m2)
## [1] 4998547
1. m3 <- lm(wage ~ maritl + jobclass, data = Wage) 2. deviance(m3)
## [1] 4654752
正如預期的那樣,使用最複雜的模型可使樣本內數據擬合最小化。
咱們不能使樣條曲線擬合分類變量。
咱們不能將樣條曲線擬合到因子,但可使用一個樣條曲線擬合一個連續變量並添加其餘預測變量的模型。
1. library(gam) 2. m4 <- gam(...) 3. deviance(m4)
## [1] 4476501
anova(m1, m2, m3, m4)
1. ## Analysis of Variance Table 2. ## 3. ## Model 1: wage ~ maritl 4. ## Model 2: wage ~ jobclass 5. ## Model 3: wage ~ maritl + jobclass 6. ## Model 4: wage ~ maritl + jobclass + s(age, 4) 7. ## Res.Df RSS Df Sum of Sq F Pr(>F) 8. ## 1 2995 4858941 9. ## 2 2998 4998547 -3.0000 -139606 31.082 < 2.2e-16 *** 10. ## 3 2994 4654752 4.0000 343795 57.408 < 2.2e-16 *** 11. ## 4 2990 4476501 4.0002 178252 29.764 < 2.2e-16 *** 12. ## --- 13. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F檢驗代表,咱們從模型四到模型一統計顯著改善的變量有年齡,wage
,maritl
,和jobclass
。
Boston數據迴歸
這個問題使用的變量dis(到五個波士頓就業中心的距離的加權平均值)和nox(每百萬人口中一氧化氮的濃度,單位爲百萬)。咱們將dis做爲預測變量,將nox做爲因變量。
1. rm(list = ls()) 2. set.seed(1) 3. library(MASS) 4. attach(Boston)
1. ## The following objects are masked from Boston (pos = 14): 2. ## 3. ## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, 4. ## rad, rm, tax, zn
poly()
函數擬合三次多項式迴歸來預測nox
使用dis
。報告迴歸輸出,並繪製結果數據和多項式擬合。1. m1 <- lm(nox ~ poly(dis, 3)) 2. summary(m1)
1. ## 2. ## Call: 3. ## lm(formula = nox ~ poly(dis, 3)) 4. ## 5. ## Residuals: 6. ## Min 1Q Median 3Q Max 7. ## -0.121130 -0.040619 -0.009738 0.023385 0.194904 8. ## 9. ## Coefficients: 10. ## Estimate Std. Error t value Pr(>|t|) 11. ## (Intercept) 0.554695 0.002759 201.021 < 2e-16 *** 12. ## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 *** 13. ## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 *** 14. ## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 *** 15. ## --- 16. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 17. ## 18. ## Residual standard error: 0.06207 on 502 degrees of freedom 19. ## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131 20. ## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
1. dislim <- range(dis) 2. ... 3. lines(x = dis.grid, y = lm.pred$fit, col = "red", lwd = 2) 4. matlines(x = dis.grid, y = cbind(lm.pred$fit + 2* lm.pred$se.fit, 5. lm.pred$fit - 2* lm.pred$se.fit) 6. , col = "red", lwd = 1.5, lty = "dashed")
摘要顯示,在nox
使用進行預測時,全部多項式項都是有效的dis
。該圖顯示了一條平滑的曲線,很好地擬合了數據。
咱們繪製1到10度的多項式並保存RSS。
1. # 2. train.rss <- NA 3. ... 4. # 在訓練集中顯示模型擬合 5. train.rss
1. ## [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 2. ## [8] 1.835630 1.833331 1.832171
正如預期的那樣,RSS隨多項式次數單調遞減。
咱們執行LLOCV並手工編碼:
1. cv.error <- matrix(NA, nrow = nrow(Boston), ncol = 10) 3. ... 4. names(result) <- paste( "^", 1:10, sep= "" ) 5. result
1. ## ^1 ^2 ^3 ^4 ^5 ^6 2. ## 0.005471468 0.004022257 0.003822345 0.003820121 0.003785158 0.003711971 3. ## ^7 ^8 ^9 ^10 4. ## 0.003655106 0.003627727 0.003623183 0.003620892
1. plot(result ~ seq(1,10), type = "b", pch = 19, bty = "n", xlab = "powers of dist to empl. center", 2. ylab = "cv error") 3. abline(h = min(cv.error) + sd(cv.error), col = "red", lty = "dashed")
基於交叉驗證,咱們將選擇dis
平方。
bs()
函數擬合迴歸樣條曲線使用nox
進行預測dis
。咱們以[3,6,9]大體相等的4個區間劃分此範圍
1. library(splines) 2. m4 <- lm(nox ~ bs(dis, knots = c(3, 6, 9))) 3. summary(m4)
1. ## 2. ## Call: 3. ## lm(formula = nox ~ bs(dis, knots = c(3, 6, 9))) 4. ## 5. ## Residuals: 6. ## Min 1Q Median 3Q Max 7. ## -0.132134 -0.039466 -0.009042 0.025344 0.187258 8. ## 9. ## Coefficients: 10. ## Estimate Std. Error t value Pr(>|t|) 11. ## (Intercept) 0.709144 0.016099 44.049 < 2e-16 *** 12. ## bs(dis, knots = c(3, 6, 9))1 0.006631 0.025467 0.260 0.795 13. ## bs(dis, knots = c(3, 6, 9))2 -0.258296 0.017759 -14.544 < 2e-16 *** 14. ## bs(dis, knots = c(3, 6, 9))3 -0.233326 0.027248 -8.563 < 2e-16 *** 15. ## bs(dis, knots = c(3, 6, 9))4 -0.336530 0.032140 -10.471 < 2e-16 *** 16. ## bs(dis, knots = c(3, 6, 9))5 -0.269575 0.058799 -4.585 5.75e-06 *** 17. ## bs(dis, knots = c(3, 6, 9))6 -0.303386 0.062631 -4.844 1.70e-06 *** 18. ## --- 19. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 20. ## 21. ## Residual standard error: 0.0612 on 499 degrees of freedom 22. ## Multiple R-squared: 0.7244, Adjusted R-squared: 0.7211 23. ## F-statistic: 218.6 on 6 and 499 DF, p-value: < 2.2e-16
1. # 繪圖結果 2. ... 3. # 4. matlines( dis.grid, 5. ... 6. col = "black", lwd = 2, lty = c("solid", "dashed", "dashed"))
咱們使用3到16之間的dfs擬合迴歸樣條曲線。
1. box <- NA 3. for (i in 3:16) { 4. ... 5. } 7. box[-c(1, 2)]
1. ## [1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653 2. ## [8] 1.792535 1.796992 1.788999 1.782350 1.781838 1.782798 1.783546
ISLR
包中的College
數據集。
1. rm(list = ls()) 2. set.seed(1) 3. library(leaps) 4. attach(College)
1. ## The following objects are masked from College (pos = 14): 2. ## 3. ## Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, 4. ## Outstate, P.Undergrad, perc.alumni, Personal, PhD, Private, 5. ## Room.Board, S.F.Ratio, Terminal, Top10perc, Top25perc
1. # 訓練/測試拆分行的索引號 2. train <- sample( length(Outstate), length(Outstate)/2) 3. test <- -train 4. ... 5. abline(h=max.adjr2 - std.adjr2, col="red", lty=2)
全部cp,BIC和adjr2得分均顯示6是該子集的最小大小。可是,根據1個標準偏差規則,咱們將選擇具備4個預測變量的模型。
1. ... 2. coefi <- coef(m5, id = 4) 3. names(coefi)
## [1] "(Intercept)" "PrivateYes" "Room.Board" "perc.alumni" "Expend"
1. library(gam) 2. ... 3. plot(gam.fit, se=TRUE, col="blue")
1. gam.pred <- predict(gam.fit, College.test) 2. gam.err <- mean((College.test$Outstate - gam.pred)^2) 3. gam.err
## [1] 3745460
1. gam.tss <- mean((College.test$Outstate - mean(College.test$Outstate))^2) 2. test.rss <- 1 - gam.err / gam.tss 3. test.rss
## [1] 0.7696916
summary(gam.fit)
1. ## 2. ## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, 3. ## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, 4. ## df = 2), data = College.train) 5. ## Deviance Residuals: 6. ## Min 1Q Median 3Q Max 7. ## -4977.74 -1184.52 58.33 1220.04 7688.30 8. ## 9. ## (Dispersion Parameter for gaussian family taken to be 3300711) 10. ## 11. ## Null Deviance: 6221998532 on 387 degrees of freedom 12. ## Residual Deviance: 1231165118 on 373 degrees of freedom 13. ## AIC: 6941.542 14. ## 15. ## Number of Local Scoring Iterations: 2 16. ## 17. ## Anova for Parametric Effects 18. ## Df Sum Sq Mean Sq F value Pr(>F) 19. ## Private 1 1779433688 1779433688 539.106 < 2.2e-16 *** 20. ## s(Room.Board, df = 2) 1 1221825562 1221825562 370.171 < 2.2e-16 *** 21. ## s(PhD, df = 2) 1 382472137 382472137 115.876 < 2.2e-16 *** 22. ## s(perc.alumni, df = 2) 1 328493313 328493313 99.522 < 2.2e-16 *** 23. ## s(Expend, df = 5) 1 416585875 416585875 126.211 < 2.2e-16 *** 24. ## s(Grad.Rate, df = 2) 1 55284580 55284580 16.749 5.232e-05 *** 25. ## Residuals 373 1231165118 3300711 26. ## --- 27. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 28. ## 29. ## Anova for Nonparametric Effects 30. ## Npar Df Npar F Pr(F) 31. ## (Intercept) 32. ## Private 33. ## s(Room.Board, df = 2) 1 3.5562 0.06010 . 34. ## s(PhD, df = 2) 1 4.3421 0.03786 * 35. ## s(perc.alumni, df = 2) 1 1.9158 0.16715 36. ## s(Expend, df = 5) 4 16.8636 1.016e-12 *** 37. ## s(Grad.Rate, df = 2) 1 3.7208 0.05450 . 38. ## --- 39. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
非參數Anova檢驗顯示了因變量與支出之間存在非線性關係的有力證據,以及因變量與Grad.Rate或PhD之間具備中等強度的非線性關係(使用p值爲0.05)。
最受歡迎的看法
3.matlab中的偏最小二乘迴歸(PLSR)和主成分迴歸(PCR)
5.R語言迴歸中的Hosmer-Lemeshow擬合優度檢驗