R語言電信公司churn數據客戶流失 k近鄰(knn)模型預測分析

原文連接:http://tecdat.cn/?p=5521

 

Data background

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. node

The data set  is Churn . The fields are as follows:算法

 

State機器學習

 discrete.工具

account length學習

 continuous.測試

area code大數據

 continuous.ui

phone numberspa

 discrete..net

international plan

 discrete.

voice mail plan

 discrete.

number vmail messages

 continuous.

total day minutes

 continuous.

total day calls

 continuous.

total day charge

 continuous.

total eve minutes

 continuous.

total eve calls

 continuous.

total eve charge

 continuous.

total night minutes

 continuous.

total night calls

 continuous.

total night charge

 continuous.

total intl minutes

 continuous.

total intl calls

 continuous.

total intl charge

 continuous.

number customer service calls

 continuous.

churn

 Discrete

Data Preparation and Exploration 

 

  1. 查看數據概覽
  2. ## state account.length area.code phone.number
  3. ## WV : 158 Min. : 1.0 Min. :408.0 327-1058: 1
  4. ## MN : 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1
  5. ## AL : 124 Median :100.0 Median :415.0 327-2040: 1
  6. ## ID : 119 Mean :100.3 Mean :436.9 327-2475: 1
  7. ## VA : 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1
  8. ## OH : 116 Max. :243.0 Max. :510.0 327-3587: 1
  9. ## (Other):4240 (Other) :4994
  10. ## international.plan voice.mail.plan number.vmail.messages
  11. ## no :4527 no :3677 Min. : 0.000
  12. ## yes: 473 yes:1323 1st Qu.: 0.000
  13. ## Median : 0.000
  14. ## Mean : 7.755
  15. ## 3rd Qu.:17.000
  16. ## Max. :52.000
  17. ## total.day.minutes total.day.calls total.day.charge total.eve.minutes
  18. ## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
  19. ## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4
  20. ## Median :180.1 Median :100 Median :30.62 Median :201.0
  21. ## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6
  22. ## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1
  23. ## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7
  24. ## total.eve.calls total.eve.charge total.night.minutes total.night.calls
  25. ## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
  26. ## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00
  27. ## Median :100.0 Median :17.09 Median :200.4 Median :100.00
  28. ## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92
  29. ## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00
  30. ## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00
  31. ## total.night.charge total.intl.minutes total.intl.calls total.intl.charge
  32. ## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000
  33. ## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
  34. ## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780
  35. ## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771
  36. ## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
  37. ## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
  38. ## number.customer.service.calls churn
  39. ## Min. :0.00 False.:4293
  40. ## 1st Qu.:1.00 True. : 707
  41. ## Median :1.00
  42. ## Mean :1.57
  43. ## 3rd Qu.:2.00
  44. ## Max. :9.00

 從數據概覽中咱們能夠發現沒有缺失數據,同時能夠發現電話號 地區代碼是沒有價值的變量,能夠刪去

 

Examine the variables graphically

 

   

從上面的結果中,咱們能夠看到churn爲no的樣本數目要遠遠大於churn爲yes的樣本,所以全部樣本中churn佔多數。

 

從上面的結果中,咱們能夠看到除了emailcode和areacode以外,其餘數值變量近似符合正態分佈。

  1. ## account.length area.code number.vmail.messages total.day.minutes
  2. ## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0
  3. ## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7
  4. ## Median :100.0 Median :415.0 Median : 0.000 Median :180.1
  5. ## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3
  6. ## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2
  7. ## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5
  8. ## total.day.calls total.day.charge total.eve.minutes total.eve.calls
  9. ## Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
  10. ## 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
  11. ## Median :100 Median :30.62 Median :201.0 Median :100.0
  12. ## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
  13. ## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
  14. ## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
  15. ## total.eve.charge total.night.minutes total.night.calls total.night.charge
  16. ## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
  17. ## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
  18. ## Median :17.09 Median :200.4 Median :100.00 Median : 9.020
  19. ## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
  20. ## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
  21. ## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
  22. ## total.intl.minutes total.intl.calls total.intl.charge
  23. ## Min. : 0.00 Min. : 0.000 Min. :0.000
  24. ## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
  25. ## Median :10.30 Median : 4.000 Median :2.780
  26. ## Mean :10.26 Mean : 4.435 Mean :2.771
  27. ## 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
  28. ## Max. :20.00 Max. :20.000 Max. :5.400
  29. ## number.customer.service.calls
  30. ## Min. :0.00
  31. ## 1st Qu.:1.00
  32. ## Median :1.00
  33. ## Mean :1.57
  34. ## 3rd Qu.:2.00
  35. ## Max. :9.00

Relationships between variables

從結果中咱們能夠看到二者之間存在顯著的正相關線性關係。


 

Using the statistics node, report

  1. ## account.length area.code
  2. ## account.length 1.0000000000 -0.018054187
  3. ## area.code -0.0180541874 1.000000000
  4. ## number.vmail.messages -0.0145746663 -0.003398983
  5. ## total.day.minutes -0.0010174908 -0.019118245
  6. ## total.day.calls 0.0282402279 -0.019313854
  7. ## total.day.charge -0.0010191980 -0.019119256
  8. ## total.eve.minutes -0.0095913331 0.007097877
  9. ## total.eve.calls 0.0091425790 -0.012299947
  10. ## total.eve.charge -0.0095873958 0.007114130
  11. ## total.night.minutes 0.0006679112 0.002083626
  12. ## total.night.calls -0.0078254785 0.014656846
  13. ## total.night.charge 0.0006558937 0.002070264
  14. ## total.intl.minutes 0.0012908394 -0.004153729
  15. ## total.intl.calls 0.0142772733 -0.013623309
  16. ## total.intl.charge 0.0012918112 -0.004219099
  17. ## number.customer.service.calls -0.0014447918 0.020920513
  18. ## number.vmail.messages total.day.minutes
  19. ## account.length -0.0145746663 -0.001017491
  20. ## area.code -0.0033989831 -0.019118245
  21. ## number.vmail.messages 1.0000000000 0.005381376
  22. ## total.day.minutes 0.0053813760 1.000000000
  23. ## total.day.calls 0.0008831280 0.001935149
  24. ## total.day.charge 0.0053767959 0.999999951
  25. ## total.eve.minutes 0.0194901208 -0.010750427
  26. ## total.eve.calls -0.0039543728 0.008128130
  27. ## total.eve.charge 0.0194959757 -0.010760022
  28. ## total.night.minutes 0.0055413838 0.011798660
  29. ## total.night.calls 0.0026762202 0.004236100
  30. ## total.night.charge 0.0055349281 0.011782533
  31. ## total.intl.minutes 0.0024627018 -0.019485746
  32. ## total.intl.calls 0.0001243302 -0.001303123
  33. ## total.intl.charge 0.0025051773 -0.019414797
  34. ## number.customer.service.calls -0.0070856427 0.002732576
  35. ## total.day.calls total.day.charge
  36. ## account.length 0.0282402279 -0.001019198
  37. ## area.code -0.0193138545 -0.019119256
  38. ## number.vmail.messages 0.0008831280 0.005376796
  39. ## total.day.minutes 0.0019351487 0.999999951
  40. ## total.day.calls 1.0000000000 0.001935884
  41. ## total.day.charge 0.0019358844 1.000000000
  42. ## total.eve.minutes -0.0006994115 -0.010747297
  43. ## total.eve.calls 0.0037541787 0.008129319
  44. ## total.eve.charge -0.0006952217 -0.010756893
  45. ## total.night.minutes 0.0028044650 0.011801434
  46. ## total.night.calls -0.0083083467 0.004234934
  47. ## total.night.charge 0.0028018169 0.011785301
  48. ## total.intl.minutes 0.0130972198 -0.019489700
  49. ## total.intl.calls 0.0108928533 -0.001306635
  50. ## total.intl.charge 0.0131613976 -0.019418755
  51. ## number.customer.service.calls -0.0107394951 0.002726370
  52. ## total.eve.minutes total.eve.calls
  53. ## account.length -0.0095913331 0.009142579
  54. ## area.code 0.0070978766 -0.012299947
  55. ## number.vmail.messages 0.0194901208 -0.003954373
  56. ## total.day.minutes -0.0107504274 0.008128130
  57. ## total.day.calls -0.0006994115 0.003754179
  58. ## total.day.charge -0.0107472968 0.008129319
  59. ## total.eve.minutes 1.0000000000 0.002763019
  60. ## total.eve.calls 0.0027630194 1.000000000
  61. ## total.eve.charge 0.9999997749 0.002778097
  62. ## total.night.minutes -0.0166391160 0.001781411
  63. ## total.night.calls 0.0134202163 -0.013682341
  64. ## total.night.charge -0.0166420421 0.001799380
  65. ## total.intl.minutes 0.0001365487 -0.007458458
  66. ## total.intl.calls 0.0083881559 0.005574500
  67. ## total.intl.charge 0.0001593155 -0.007507151
  68. ## number.customer.service.calls -0.0138234228 0.006234831
  69. ## total.eve.charge total.night.minutes
  70. ## account.length -0.0095873958 0.0006679112
  71. ## area.code 0.0071141298 0.0020836263
  72. ## number.vmail.messages 0.0194959757 0.0055413838
  73. ## total.day.minutes -0.0107600217 0.0117986600
  74. ## total.day.calls -0.0006952217 0.0028044650
  75. ## total.day.charge -0.0107568931 0.0118014339
  76. ## total.eve.minutes 0.9999997749 -0.0166391160
  77. ## total.eve.calls 0.0027780971 0.0017814106
  78. ## total.eve.charge 1.0000000000 -0.0166489191
  79. ## total.night.minutes -0.0166489191 1.0000000000
  80. ## total.night.calls 0.0134220174 0.0269718182
  81. ## total.night.charge -0.0166518367 0.9999992072
  82. ## total.intl.minutes 0.0001320238 -0.0067209669
  83. ## total.intl.calls 0.0083930603 -0.0172140162
  84. ## total.intl.charge 0.0001547783 -0.0066545873
  85. ## number.customer.service.calls -0.0138363623 -0.0085325365
 
若是把高相關性的變量保存下來,可能會形成多重共線性問題,所以須要把高相關關係的變量刪去。

Data Manipulation

 
從結果中能夠看到,total.day.calls和total.day.charge之間存在必定的相關關係。
特別是voicemial爲no的變量之間存在負相關關係。

 

Discretize (make categorical) a relevant numeric variable

 

 

 

對變量進行離散化

 

construct a distribution of the variable with a churn overlay

construct a histogram of the variable with a churn overlay

 

 

Find a pair of numeric variables which are interesting with respect to churn.

 
從結果中能夠看到,total.day.calls和total.day.charge之間存在必定的相關關係。
 

Model Building

特別是churn爲no的變量之間存在相關關係。
 

  1. ## Estimate Std. Error t value Pr(>|t|)
  2. ## (Intercept) 0.3082150 0.0735760 4.189 2.85e-05 ***
  3. ## stateAL 0.0151188 0.0462343 0.327 0.743680
  4. ## stateAR 0.0894792 0.0490897 1.823 0.068399 .
  5. ## stateAZ 0.0329566 0.0494195 0.667 0.504883
  6. ## stateCA 0.1951511 0.0567439 3.439 0.000588 ***
  7. ## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 ***
  8. ## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 ***
  9. ## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402
  10. ## total.day.minutes 0.3796323 0.2629027 1.444 0.148802
  11. ## total.day.calls 0.0002191 0.0002235 0.981 0.326781
  12. ## total.day.charge -2.2207671 1.5464583 -1.436 0.151056
  13. ## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533
  14. ## total.eve.calls -0.0001585 0.0002238 -0.708 0.478915
  15. ## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329
  16. ## total.night.minutes 0.0083224 0.0695916 0.120 0.904814
  17. ## total.night.calls -0.0001824 0.0002225 -0.820 0.412290
  18. ## total.night.charge -0.1760782 1.5464674 -0.114 0.909355
  19. ## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080
  20. ## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 ***
  21. ## total.intl.charge 0.0676460 1.5528267 0.044 0.965254
  22. ## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 ***
  23. ## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 **
  24. ## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***

 

從結果中看,咱們能夠發現 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium 、    total.day.minutes1short    的變量有重要的影響。

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn

  1. ## Direction.2005
  2. ## knn.pred 1 2
  3. ## 1 760 97
  4. ## 2 100 43
  5. [1] 0.803
 
混淆矩陣(英語:confusion matrix)是可視化工具,特別用於監督學習,在無監督學習通常叫作匹配矩陣。 矩陣的每一列表明一個類的實例預測,而每一行表示一個實際的類的實例。
  1. ## Direction.2005
  2. ## knn.pred 1 2
  3. ## 1 827 104
  4. ## 2 33 36
  5. [1] 0.863

 

從測試集的結果,咱們能夠看到準確度達到86%。

 

Findings

 

咱們能夠發現 ,total.day.calls和total.day.charge之間存在必定的相關關係。特別是churn爲no的變量之間存在相關關係。同時咱們能夠發現 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium、    total.day.minutes1short    的變量有重要的影響。同時咱們能夠發現,total.day.calls和total.day.charge之間存在必定的相關關係。最後從knn模型結果中,咱們能夠發現從訓練集的結果中,咱們能夠看到準確度有80%,從測試集的結果,咱們能夠看到準確度達到86%。說明模型有很好的預測效果。
 

相關文章:

 Python中用PyTorch機器學習分類預測銀行_客戶流失_模型

決策樹算法創建電信_客戶流失_模型

【大數據部落】(數據挖掘)如何用大數據作用戶異常行爲

相關文章
相關標籤/搜索