電信公司但願對有流失傾向的客戶進行挽留。數據分析人員從公司提早了客戶基本信息(出生年月等)、社區信息(社區平均收入等)和6個以內的業務信息(通話時間長等),附件爲數據及字段說明。git
導入數據和數據清洗dom
> setwd('C:\\Users\\Xu\\Desktop\\data') > telecom_churn<-read.csv("telecom_churn.csv") > telecom_churn<-na.omit(telecom_churn) > attach(telecom_churn)
隨機抽樣,創建訓練集與測試集函數
> set.seed(100) > select<-sample(1:nrow(telecom_churn),nrow(telecom_churn)*0.7) > train=telecom_churn[select,] #訓練集 > test=telecom_churn[-select,] #樣本集
擬合模型,而且根據逐步迴歸選擇更優的模型,AIC越小模型越優測試
> lg<-glm(churn~AGE+edu_class+incomeCode+duration+feton+peakMinAv+peakMinDiff,family=binomial(link='logit')) > summary(lg) Call: glm(formula = churn ~ AGE + edu_class + incomeCode + duration + feton + peakMinAv + peakMinDiff, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -3.07235 -0.58919 -0.03107 0.62704 2.73329 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.3456176 0.2522880 13.261 < 2e-16 *** AGE -0.0229400 0.0042145 -5.443 5.24e-08 *** edu_class 0.4476978 0.0719659 6.221 4.94e-10 *** incomeCode 0.0071214 0.0033891 2.101 0.0356 * duration -0.2641532 0.0126729 -20.844 < 2e-16 *** feton -1.0583303 0.1153652 -9.174 < 2e-16 *** peakMinAv 0.0001973 0.0004272 0.462 0.6441 #並不顯著 peakMinDiff -0.0028613 0.0003654 -7.830 4.87e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3336.6 on 2423 degrees of freedom Residual deviance: 1898.2 on 2416 degrees of freedom AIC: 1914.2 Number of Fisher Scoring iterations: 6 > lg_ms<-step(lg,direction = "both") #尋找最好的迴歸模型,both向前,向後迴歸 Start: AIC=1914.21 churn ~ AGE + edu_class + incomeCode + duration + feton + peakMinAv + peakMinDiff Df Deviance AIC - peakMinAv 1 1898.4 1912.4 <none> 1898.2 1914.2 - incomeCode 1 1902.6 1916.6 - AGE 1 1928.4 1942.4 - edu_class 1 1938.5 1952.5 - peakMinDiff 1 1967.8 1981.8 - feton 1 1985.8 1999.8 - duration 1 2954.6 2968.6 Step: AIC=1912.43 churn ~ AGE + edu_class + incomeCode + duration + feton + peakMinDiff Df Deviance AIC <none> 1898.4 1912.4 + peakMinAv 1 1898.2 1914.2 - incomeCode 1 1902.8 1914.8 - AGE 1 1930.8 1942.8 - edu_class 1 1938.7 1950.7 - peakMinDiff 1 1968.0 1980.0 - feton 1 1986.1 1998.1 - duration 1 2954.8 2966.8 > summary(lg_ms) Call: glm(formula = churn ~ AGE + edu_class + incomeCode + duration + feton + peakMinDiff, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -3.01069 -0.58881 -0.03095 0.62815 2.72788 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.3989882 0.2247215 15.125 < 2e-16 *** AGE -0.0233168 0.0041359 -5.638 1.72e-08 *** edu_class 0.4478730 0.0719857 6.222 4.92e-10 *** incomeCode 0.0070731 0.0033873 2.088 0.0368 * duration -0.2641731 0.0126728 -20.846 < 2e-16 *** feton -1.0589440 0.1153516 -9.180 < 2e-16 *** peakMinDiff -0.0028390 0.0003616 -7.851 4.13e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3336.6 on 2423 degrees of freedom Residual deviance: 1898.4 on 2417 degrees of freedom AIC: 1912.4 #這個AIC更小,因此這個模型更合適 Number of Fisher Scoring iterations: 6
選擇模型後對訓練集進行預測spa
> train$lg_p<-predict(lg_ms, train) #訓練集中進行預測 > summary(train$lg_p) Min. 1st Qu. Median Mean 3rd Qu. Max. -15.8300 -2.6670 -0.1845 -0.9950 1.3380 5.3550
邏輯迴歸算出來的是將事件發生的機率logit,因此咱們把他轉化回來,這樣更易於咱們理解它code
> train$p<-1/(1+exp(-1*train$lg_p)) #exp():天然對數e爲底指數函數 > summary(train$p) #機率值的概述 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000001 0.0649300 0.4540000 0.4505000 0.7922000 0.9953000
對預測的模型進行評估orm
> test$lg_p<-predict(lg_ms, test)
咱們能夠直接利用pROC包,直接繪製ROC曲線和AUC值事件
> library(pROC) > modelroc <- roc(test$churn,test$lg_p) #test$churn爲預測的變量1爲流失,0爲未流失, test%lg_p則爲擬合的模型 > plot(modelroc,print.auc=T,auc.ploygon=T,grid=c(0.1,0.2), + grid.col=c('green','red'),max.auc.polygon=T, + auc.ploygon.col='skyblue', + print.thres=T) Call: roc.default(response = test$churn, predictor = test$lg_p) Data: test$lg_p in 597 controls (test$churn 0) < 442 cases (test$churn 1). Area under the curve: 0.9131
AUC值爲0.913,說明模型仍是相關能夠的ci
繪製ROC曲線,也能夠這樣,數據分析
> library(ROCR) There were 50 or more warnings (use warnings() to see the first 50) > pred_Te <- prediction(test$p, test$churn) > perf_Te <- performance(pred_Te,"tpr","fpr") > pred_Tr <- prediction(train$p, train$churn) > perf_Tr <- performance(pred_Tr,"tpr","fpr") > plot(perf_Te, col='blue',lty=1); > plot(perf_Tr, col='black',lty=2,add=TRUE); > abline(0,1,lty=2,col='red') > lr_m_auc<-round(as.numeric(performance(pred_Tr,'auc')@y.values),3) > lr_m_str<-paste("Mode_Train-AUC:",lr_m_auc,sep="") > legend(0.3,0.4,c(lr_m_str),2:8) > lr_m_auc<-round(as.numeric(performance(pred_Te,'auc')@y.values),3) > lr_m_ste<-paste("Mode_Test-AUC:",lr_m_auc,sep="") > legend(0.3,0.2,c(lr_m_ste),2:8)
固然這種繪製方法比較麻煩,推薦用第一種那種,這樣若是想要進行電信客戶流失的預警,就能夠用這個模型試試了,算出客戶流失的可能性有多大,針對可能流失的客戶針對性採起挽留方法。