R語言電信公司churn數據客戶流失 k近鄰（knn）模型預測分析

時間 2020-08-06

標籤語言電信公司 churn 數據客戶流失近鄰 knn 模型預測分析欄目職業生涯简体版

原文原文鏈接

原文連接：http://tecdat.cn/?p=5521

Data background

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. node

The data set is Churn . The fields are as follows:算法

State機器學習

discrete.工具

account length學習

continuous.測試

area code大數據

continuous.ui

phone numberspa

discrete..net

international plan

discrete.

voice mail plan

discrete.

number vmail messages

continuous.

total day minutes

continuous.

total day calls

continuous.

total day charge

continuous.

total eve minutes

continuous.

total eve calls

continuous.

total eve charge

continuous.

total night minutes

continuous.

total night calls

continuous.

total night charge

continuous.

total intl minutes

continuous.

total intl calls

continuous.

total intl charge

continuous.

number customer service calls

continuous.

churn

Discrete

Data Preparation and Exploration

查看數據概覽
## state account.length area.code phone.number
## WV : 158 Min. : 1.0 Min. :408.0 327-1058: 1
## MN : 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1
## AL : 124 Median :100.0 Median :415.0 327-2040: 1
## ID : 119 Mean :100.3 Mean :436.9 327-2475: 1
## VA : 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1
## OH : 116 Max. :243.0 Max. :510.0 327-3587: 1
## (Other):4240 (Other) :4994
## international.plan voice.mail.plan number.vmail.messages
## no :4527 no :3677 Min. : 0.000
## yes: 473 yes:1323 1st Qu.: 0.000
## Median : 0.000
## Mean : 7.755
## 3rd Qu.:17.000
## Max. :52.000
## total.day.minutes total.day.calls total.day.charge total.eve.minutes
## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4
## Median :180.1 Median :100 Median :30.62 Median :201.0
## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6
## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1
## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7
## total.eve.calls total.eve.charge total.night.minutes total.night.calls
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00
## Median :100.0 Median :17.09 Median :200.4 Median :100.00
## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92
## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00
## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00
## total.night.charge total.intl.minutes total.intl.calls total.intl.charge
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780
## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
## number.customer.service.calls churn
## Min. :0.00 False.:4293
## 1st Qu.:1.00 True. : 707
## Median :1.00
## Mean :1.57
## 3rd Qu.:2.00
## Max. :9.00

從數據概覽中咱們能夠發現沒有缺失數據，同時能夠發現電話號地區代碼是沒有價值的變量，能夠刪去

Examine the variables graphically

從上面的結果中，咱們能夠看到churn爲no的樣本數目要遠遠大於churn爲yes的樣本，所以全部樣本中churn佔多數。

從上面的結果中，咱們能夠看到除了emailcode和areacode以外，其餘數值變量近似符合正態分佈。

## account.length area.code number.vmail.messages total.day.minutes
## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7
## Median :100.0 Median :415.0 Median : 0.000 Median :180.1
## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3
## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2
## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5
## total.day.calls total.day.charge total.eve.minutes total.eve.calls
## Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
## Median :100 Median :30.62 Median :201.0 Median :100.0
## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
## total.eve.charge total.night.minutes total.night.calls total.night.charge
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
## Median :17.09 Median :200.4 Median :100.00 Median : 9.020
## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
## total.intl.minutes total.intl.calls total.intl.charge
## Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median :10.30 Median : 4.000 Median :2.780
## Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :20.00 Max. :20.000 Max. :5.400
## number.customer.service.calls
## Min. :0.00
## 1st Qu.:1.00
## Median :1.00
## Mean :1.57
## 3rd Qu.:2.00
## Max. :9.00

Relationships between variables

從結果中咱們能夠看到二者之間存在顯著的正相關線性關係。

Using the statistics node, report

## account.length area.code
## account.length 1.0000000000 -0.018054187
## area.code -0.0180541874 1.000000000
## number.vmail.messages -0.0145746663 -0.003398983
## total.day.minutes -0.0010174908 -0.019118245
## total.day.calls 0.0282402279 -0.019313854
## total.day.charge -0.0010191980 -0.019119256
## total.eve.minutes -0.0095913331 0.007097877
## total.eve.calls 0.0091425790 -0.012299947
## total.eve.charge -0.0095873958 0.007114130
## total.night.minutes 0.0006679112 0.002083626
## total.night.calls -0.0078254785 0.014656846
## total.night.charge 0.0006558937 0.002070264
## total.intl.minutes 0.0012908394 -0.004153729
## total.intl.calls 0.0142772733 -0.013623309
## total.intl.charge 0.0012918112 -0.004219099
## number.customer.service.calls -0.0014447918 0.020920513
## number.vmail.messages total.day.minutes
## account.length -0.0145746663 -0.001017491
## area.code -0.0033989831 -0.019118245
## number.vmail.messages 1.0000000000 0.005381376
## total.day.minutes 0.0053813760 1.000000000
## total.day.calls 0.0008831280 0.001935149
## total.day.charge 0.0053767959 0.999999951
## total.eve.minutes 0.0194901208 -0.010750427
## total.eve.calls -0.0039543728 0.008128130
## total.eve.charge 0.0194959757 -0.010760022
## total.night.minutes 0.0055413838 0.011798660
## total.night.calls 0.0026762202 0.004236100
## total.night.charge 0.0055349281 0.011782533
## total.intl.minutes 0.0024627018 -0.019485746
## total.intl.calls 0.0001243302 -0.001303123
## total.intl.charge 0.0025051773 -0.019414797
## number.customer.service.calls -0.0070856427 0.002732576
## total.day.calls total.day.charge
## account.length 0.0282402279 -0.001019198
## area.code -0.0193138545 -0.019119256
## number.vmail.messages 0.0008831280 0.005376796
## total.day.minutes 0.0019351487 0.999999951
## total.day.calls 1.0000000000 0.001935884
## total.day.charge 0.0019358844 1.000000000
## total.eve.minutes -0.0006994115 -0.010747297
## total.eve.calls 0.0037541787 0.008129319
## total.eve.charge -0.0006952217 -0.010756893
## total.night.minutes 0.0028044650 0.011801434
## total.night.calls -0.0083083467 0.004234934
## total.night.charge 0.0028018169 0.011785301
## total.intl.minutes 0.0130972198 -0.019489700
## total.intl.calls 0.0108928533 -0.001306635
## total.intl.charge 0.0131613976 -0.019418755
## number.customer.service.calls -0.0107394951 0.002726370
## total.eve.minutes total.eve.calls
## account.length -0.0095913331 0.009142579
## area.code 0.0070978766 -0.012299947
## number.vmail.messages 0.0194901208 -0.003954373
## total.day.minutes -0.0107504274 0.008128130
## total.day.calls -0.0006994115 0.003754179
## total.day.charge -0.0107472968 0.008129319
## total.eve.minutes 1.0000000000 0.002763019
## total.eve.calls 0.0027630194 1.000000000
## total.eve.charge 0.9999997749 0.002778097
## total.night.minutes -0.0166391160 0.001781411
## total.night.calls 0.0134202163 -0.013682341
## total.night.charge -0.0166420421 0.001799380
## total.intl.minutes 0.0001365487 -0.007458458
## total.intl.calls 0.0083881559 0.005574500
## total.intl.charge 0.0001593155 -0.007507151
## number.customer.service.calls -0.0138234228 0.006234831
## total.eve.charge total.night.minutes
## account.length -0.0095873958 0.0006679112
## area.code 0.0071141298 0.0020836263
## number.vmail.messages 0.0194959757 0.0055413838
## total.day.minutes -0.0107600217 0.0117986600
## total.day.calls -0.0006952217 0.0028044650
## total.day.charge -0.0107568931 0.0118014339
## total.eve.minutes 0.9999997749 -0.0166391160
## total.eve.calls 0.0027780971 0.0017814106
## total.eve.charge 1.0000000000 -0.0166489191
## total.night.minutes -0.0166489191 1.0000000000
## total.night.calls 0.0134220174 0.0269718182
## total.night.charge -0.0166518367 0.9999992072
## total.intl.minutes 0.0001320238 -0.0067209669
## total.intl.calls 0.0083930603 -0.0172140162
## total.intl.charge 0.0001547783 -0.0066545873
## number.customer.service.calls -0.0138363623 -0.0085325365

若是把高相關性的變量保存下來，可能會形成多重共線性問題，所以須要把高相關關係的變量刪去。

Data Manipulation

從結果中能夠看到，total.day.calls和total.day.charge之間存在必定的相關關係。

特別是voicemial爲no的變量之間存在負相關關係。

Discretize (make categorical) a relevant numeric variable

`對變量進行離散化`

construct a distribution of the variable with a churn overlay

construct a histogram of the variable with a churn overlay

Find a pair of numeric variables which are interesting with respect to churn.

從結果中能夠看到，total.day.calls和total.day.charge之間存在必定的相關關係。

Model Building

特別是churn爲no的變量之間存在相關關係。

## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3082150 0.0735760 4.189 2.85e-05 ***
## stateAL 0.0151188 0.0462343 0.327 0.743680
## stateAR 0.0894792 0.0490897 1.823 0.068399 .
## stateAZ 0.0329566 0.0494195 0.667 0.504883
## stateCA 0.1951511 0.0567439 3.439 0.000588 ***
## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 ***
## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 ***
## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402
## total.day.minutes 0.3796323 0.2629027 1.444 0.148802
## total.day.calls 0.0002191 0.0002235 0.981 0.326781
## total.day.charge -2.2207671 1.5464583 -1.436 0.151056
## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533
## total.eve.calls -0.0001585 0.0002238 -0.708 0.478915
## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329
## total.night.minutes 0.0083224 0.0695916 0.120 0.904814
## total.night.calls -0.0001824 0.0002225 -0.820 0.412290
## total.night.charge -0.1760782 1.5464674 -0.114 0.909355
## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080
## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 ***
## total.intl.charge 0.0676460 1.5528267 0.044 0.965254
## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 ***
## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 **
## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***

從結果中看，咱們能夠發現 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium 、    total.day.minutes1short    的變量有重要的影響。

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn

## Direction.2005
## knn.pred 1 2
## 1 760 97
## 2 100 43
[1] 0.803

混淆矩陣（英語：confusion matrix）是可視化工具，特別用於監督學習，在無監督學習通常叫作匹配矩陣。 矩陣的每一列表明一個類的實例預測，而每一行表示一個實際的類的實例。

## Direction.2005
## knn.pred 1 2
## 1 827 104
## 2 33 36
[1] 0.863

從測試集的結果，咱們能夠看到準確度達到86%。

Findings

咱們能夠發現，total.day.calls和total.day.charge之間存在必定的相關關係。特別是churn爲no的變量之間存在相關關係。同時咱們能夠發現 state total.intl.calls 、number.customer.service.calls 、 total.day.minutes1medium、 total.day.minutes1short 的變量有重要的影響。同時咱們能夠發現，total.day.calls和total.day.charge之間存在必定的相關關係。最後從knn模型結果中，咱們能夠發現從訓練集的結果中，咱們能夠看到準確度有80%，從測試集的結果，咱們能夠看到準確度達到86%。說明模型有很好的預測效果。

Python中用PyTorch機器學習分類預測銀行_客戶流失_模型

決策樹算法創建電信_客戶流失_模型

【大數據部落】(數據挖掘)如何用大數據作用戶異常行爲