客戶流失預警

時間 2019-12-20

原文原文鏈接

電信公司但願對有流失傾向的客戶進行挽留。數據分析人員從公司提早了客戶基本信息（出生年月等）、社區信息（社區平均收入等）和6個以內的業務信息（通話時間長等），附件爲數據及字段說明。git

導入數據和數據清洗dom

> setwd('C:\\Users\\Xu\\Desktop\\data')
> telecom_churn<-read.csv("telecom_churn.csv")
> telecom_churn<-na.omit(telecom_churn)
> attach(telecom_churn)

隨機抽樣，創建訓練集與測試集函數

> set.seed(100)
> select<-sample(1:nrow(telecom_churn),nrow(telecom_churn)*0.7)
> train=telecom_churn[select,]  #訓練集
> test=telecom_churn[-select,]  #樣本集

擬合模型，而且根據逐步迴歸選擇更優的模型，AIC越小模型越優測試

> lg<-glm(churn~AGE+edu_class+incomeCode+duration+feton+peakMinAv+peakMinDiff,family=binomial(link='logit'))
> summary(lg)

Call:
glm(formula = churn ~ AGE + edu_class + incomeCode + duration + 
    feton + peakMinAv + peakMinDiff, family = binomial(link = "logit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-3.07235  -0.58919  -0.03107   0.62704   2.73329  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.3456176  0.2522880  13.261  < 2e-16 ***
AGE         -0.0229400  0.0042145  -5.443 5.24e-08 ***
edu_class    0.4476978  0.0719659   6.221 4.94e-10 ***
incomeCode   0.0071214  0.0033891   2.101   0.0356 *  
duration    -0.2641532  0.0126729 -20.844  < 2e-16 ***
feton       -1.0583303  0.1153652  -9.174  < 2e-16 ***
peakMinAv    0.0001973  0.0004272   0.462   0.6441      #並不顯著
peakMinDiff -0.0028613  0.0003654  -7.830 4.87e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3336.6  on 2423  degrees of freedom
Residual deviance: 1898.2  on 2416  degrees of freedom
AIC: 1914.2

Number of Fisher Scoring iterations: 6

> lg_ms<-step(lg,direction = "both") #尋找最好的迴歸模型，both向前，向後迴歸
Start:  AIC=1914.21
churn ~ AGE + edu_class + incomeCode + duration + feton + peakMinAv + 
    peakMinDiff

              Df Deviance    AIC
- peakMinAv    1   1898.4 1912.4
<none>             1898.2 1914.2
- incomeCode   1   1902.6 1916.6
- AGE          1   1928.4 1942.4
- edu_class    1   1938.5 1952.5
- peakMinDiff  1   1967.8 1981.8
- feton        1   1985.8 1999.8
- duration     1   2954.6 2968.6

Step:  AIC=1912.43
churn ~ AGE + edu_class + incomeCode + duration + feton + peakMinDiff

              Df Deviance    AIC
<none>             1898.4 1912.4
+ peakMinAv    1   1898.2 1914.2
- incomeCode   1   1902.8 1914.8
- AGE          1   1930.8 1942.8
- edu_class    1   1938.7 1950.7
- peakMinDiff  1   1968.0 1980.0
- feton        1   1986.1 1998.1
- duration     1   2954.8 2966.8
> summary(lg_ms)

Call:
glm(formula = churn ~ AGE + edu_class + incomeCode + duration + 
    feton + peakMinDiff, family = binomial(link = "logit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-3.01069  -0.58881  -0.03095   0.62815   2.72788  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.3989882  0.2247215  15.125  < 2e-16 ***
AGE         -0.0233168  0.0041359  -5.638 1.72e-08 ***
edu_class    0.4478730  0.0719857   6.222 4.92e-10 ***
incomeCode   0.0070731  0.0033873   2.088   0.0368 *  
duration    -0.2641731  0.0126728 -20.846  < 2e-16 ***
feton       -1.0589440  0.1153516  -9.180  < 2e-16 ***
peakMinDiff -0.0028390  0.0003616  -7.851 4.13e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3336.6  on 2423  degrees of freedom
Residual deviance: 1898.4  on 2417  degrees of freedom
AIC: 1912.4   #這個AIC更小，因此這個模型更合適

Number of Fisher Scoring iterations: 6

選擇模型後對訓練集進行預測spa

> train$lg_p<-predict(lg_ms, train)  #訓練集中進行預測
> summary(train$lg_p)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-15.8300  -2.6670  -0.1845  -0.9950   1.3380   5.3550

邏輯迴歸算出來的是將事件發生的機率logit，因此咱們把他轉化回來，這樣更易於咱們理解它code

> train$p<-1/(1+exp(-1*train$lg_p)) #exp()：天然對數e爲底指數函數
> summary(train$p)  #機率值的概述
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000001 0.0649300 0.4540000 0.4505000 0.7922000 0.9953000

對預測的模型進行評估orm

> test$lg_p<-predict(lg_ms, test)

咱們能夠直接利用pROC包，直接繪製ROC曲線和AUC值事件

> library(pROC)
> modelroc <- roc(test$churn,test$lg_p)   #test$churn爲預測的變量1爲流失，0爲未流失， test%lg_p則爲擬合的模型
> plot(modelroc,print.auc=T,auc.ploygon=T,grid=c(0.1,0.2),
+      grid.col=c('green','red'),max.auc.polygon=T,
+      auc.ploygon.col='skyblue',
+      print.thres=T)

Call:
roc.default(response = test$churn, predictor = test$lg_p)

Data: test$lg_p in 597 controls (test$churn 0) < 442 cases (test$churn 1).
Area under the curve: 0.9131

AUC值爲0.913，說明模型仍是相關能夠的ci

繪製ROC曲線，也能夠這樣，數據分析

> library(ROCR)
There were 50 or more warnings (use warnings() to see the first 50)
> pred_Te <- prediction(test$p, test$churn)
> perf_Te <- performance(pred_Te,"tpr","fpr")
> pred_Tr <- prediction(train$p, train$churn)
> perf_Tr <- performance(pred_Tr,"tpr","fpr")
> plot(perf_Te, col='blue',lty=1);
> plot(perf_Tr, col='black',lty=2,add=TRUE);
> abline(0,1,lty=2,col='red')
> lr_m_auc<-round(as.numeric(performance(pred_Tr,'auc')@y.values),3)
> lr_m_str<-paste("Mode_Train-AUC:",lr_m_auc,sep="")
> legend(0.3,0.4,c(lr_m_str),2:8)
> lr_m_auc<-round(as.numeric(performance(pred_Te,'auc')@y.values),3)
> lr_m_ste<-paste("Mode_Test-AUC:",lr_m_auc,sep="")
> legend(0.3,0.2,c(lr_m_ste),2:8)

固然這種繪製方法比較麻煩，推薦用第一種那種，這樣若是想要進行電信客戶流失的預警，就能夠用這個模型試試了，算出客戶流失的可能性有多大，針對可能流失的客戶針對性採起挽留方法。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。