如何作到機器學習競賽Kaggle排名前2%

原創文章,同步首發自做者我的博客 。轉載請務必在文章開頭顯眼處註明出處app

摘要

本文詳述瞭如何經過數據預覽,探索式數據分析,缺失數據填補,刪除關聯特徵以及派生新特徵等方法,在Kaggle的Titanic倖存預測這一分類問題競賽中得到前2%排名的具體方法。dom

競賽內容介紹

Titanic倖存預測是Kaggle上參賽人數最多的競賽之一。它要求參賽選手經過訓練數據集分析出什麼類型的人更可能倖存,並預測出測試數據集中的全部乘客是否生還。機器學習

該項目是一個二元分類問題學習

如何取得排名前2%的成績

加載數據

在加載數據以前,先經過以下代碼加載以後會用到的全部R庫測試

library(readr) # File read / write
library(ggplot2) # Data visualization
library(ggthemes) # Data visualization
library(scales) # Data visualization
library(plyr)
library(stringr) # String manipulation
library(InformationValue) # IV / WOE calculation
library(MLmetrics) # Mache learning metrics.e.g. Recall, Precision, Accuracy, AUC
library(rpart) # Decision tree utils
library(randomForest) # Random Forest
library(dplyr) # Data manipulation
library(e1071) # SVM
library(Amelia) # Missing value utils
library(party) # Conditional inference trees
library(gbm) # AdaBoost
library(class) # KNN
library(scales)

經過以下代碼將訓練數據和測試數據分別加載到名爲train和test的data.frame中rest

train <- read_csv("train.csv")
test <- read_csv("test.csv")

因爲以後須要對訓練數據和測試數據作相同的轉換,爲避免重複操做和出現不一至的狀況,更爲了不可能碰到的Categorical類型新level的問題,這裏建議將訓練數據和測試數據合併,統一操做。code

data <- bind_rows(train, test)
train.row <- 1:nrow(train)
test.row <- (1 + nrow(train)):(nrow(train) + nrow(test))

數據預覽

先觀察數據orm

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  NA "C85" NA "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

從上可見,數據集包含12個變量,1309條數據,其中891條爲訓練數據,418條爲測試數據ip

  • PassengerId 整型變量,標識乘客的ID,遞增變量,對預測無幫助
  • Survived 整型變量,標識該乘客是否倖存。0表示遇難,1表示倖存。將其轉換爲factor變量比較方便處理
  • Pclass 整型變量,標識乘客的社會-經濟狀態,1表明Upper,2表明Middle,3表明Lower
  • Name 字符型變量,除包含姓和名之外,還包含Mr. Mrs. Dr.這樣的具備西方文化特色的信息
  • Sex 字符型變量,標識乘客性別,適合轉換爲factor類型變量
  • Age 整型變量,標識乘客年齡,有缺失值
  • SibSp 整型變量,表明兄弟姐妹及配偶的個數。其中Sib表明Sibling也即兄弟姐妹,Sp表明Spouse也即配偶
  • Parch 整型變量,表明父母或子女的個數。其中Par表明Parent也即父母,Ch表明Child也即子女
  • Ticket 字符型變量,表明乘客的船票號
  • Fare 數值型,表明乘客的船票價
  • Cabin 字符型,表明乘客所在的艙位,有缺失值
  • Embarked 字符型,表明乘客登船口岸,適合轉換爲factor型變量

探索式數據分析

乘客社會等級越高,倖存率越高

對於第一個變量Pclass,先將其轉換爲factor類型變量。ci

data$Survived <- factor(data$Survived)

可經過以下方式統計出每一個Pclass倖存和遇難人數,以下

ggplot(data = data[1:nrow(train),], mapping = aes(x = Pclass, y = ..count.., fill=Survived)) + 
  geom_bar(stat = "count", position='dodge') + 
  xlab('Pclass') + 
  ylab('Count') + 
  ggtitle('How Pclass impact survivor') + 
  scale_fill_manual(values=c("#FF0000", "#00FF00")) +
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,Pclass=1的乘客大部分倖存,Pclass=2的乘客接近一半倖存,而Pclass=3的乘客只有不到25%倖存。

爲了更爲定量的計算Pclass的預測價值,能夠算出Pclass的WOE和IV以下。從結果能夠看出,Pclass的IV爲0.5,且「Highly Predictive」。由此能夠暫時將Pclass做爲預測模型的特徵變量之一。

WOETable(X=factor(data$Pclass[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
##   CAT GOODS BADS TOTAL     PCT_G     PCT_B        WOE         IV
## 1   1   136   80   216 0.3976608 0.1457195  1.0039160 0.25292792
## 2   2    87   97   184 0.2543860 0.1766849  0.3644848 0.02832087
## 3   3   119  372   491 0.3479532 0.6775956 -0.6664827 0.21970095
IV(X=factor(data$Pclass[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.5009497
## attr(,"howgood")
## [1] "Highly Predictive"

不一樣Title的乘客倖存率不一樣

乘客姓名重複度過低,不適合直接使用。而姓名中包含Mr. Mrs. Dr.等具備文化特徵的信息,可將之抽取出來。

本文使用以下方式從姓名中抽取乘客的Title

data$Title <- sapply(data$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
data$Title <- sub(' ', '', data$Title)
data$Title[data$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
data$Title[data$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
data$Title[data$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
data$Title <- factor(data$Title)

抽取完乘客的Title後,統計出不一樣Title的乘客的倖存與遇難人數

ggplot(data = data[1:nrow(train),], mapping = aes(x = Title, y = ..count.., fill=Survived)) + 
  geom_bar(stat = "count", position='stack') + 
  xlab('Title') + 
  ylab('Count') + 
  ggtitle('How Title impact survivor') + 
  scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + 
  geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可看出,Title爲Mr的乘客倖存比例很是小,而Title爲Mrs和Miss的乘客倖存比例很是大。這裏使用WOE和IV來定量計算Title這一變量對於最終的預測是否有用。從計算結果可見,IV爲1.520702,且"Highly Predictive"。所以,可暫將Title做爲預測模型中的一個特徵變量。

WOETable(X=data$Title[1:nrow(train)], Y=data$Survived[1:nrow(train)])
##       CAT GOODS BADS TOTAL       PCT_G       PCT_B         WOE            IV
## 1     Col     1    1     2 0.002873563 0.001808318  0.46315552  4.933741e-04
## 2      Dr     3    4     7 0.008620690 0.007233273  0.17547345  2.434548e-04
## 3    Lady     2    1     3 0.005747126 0.001808318  1.15630270  4.554455e-03
## 4  Master    23   17    40 0.066091954 0.030741410  0.76543639  2.705859e-02
## 5    Miss   127   55   182 0.364942529 0.099457505  1.30000942  3.451330e-01
## 6    Mlle     3    3     3 0.008620690 0.005424955  0.46315552  1.480122e-03
## 7      Mr    81  436   517 0.232758621 0.788426763 -1.22003757  6.779360e-01
## 8     Mrs    99   26   125 0.284482759 0.047016275  1.80017883  4.274821e-01
## 9      Ms     1    1     1 0.002873563 0.001808318  0.46315552  4.933741e-04
## 10    Rev     6    6     6 0.017241379 0.010849910  0.46315552  2.960244e-03
## 11    Sir     2    3     5 0.005747126 0.005424955  0.05769041  1.858622e-05
IV(X=data$Title[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## [1] 1.487853
## attr(,"howgood")
## [1] "Highly Predictive"

女性倖存率遠高於男性

對於Sex變量,由Titanic號沉沒的背景可知,逃生時遵循「婦女與小孩先走」的規則,由此猜測,Sex變量應該對預測乘客倖存有幫助。

以下數據驗證了這一猜測,大部分女性(233/(233+81)=74.20%)得以倖存,而男性中只有很小部分(109/(109+468)=22.85%)倖存。

data$Sex <- as.factor(data$Sex)
ggplot(data = data[1:nrow(train),], mapping = aes(x = Sex, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  xlab('Sex') + 
  ylab('Count') + 
  ggtitle('How Sex impact survivo') + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

經過計算WOE和IV可知,Sex的IV爲1.34且"Highly Predictive",可暫將Sex做爲特徵變量。

WOETable(X=data$Sex[1:nrow(train)], Y=data$Survived[1:nrow(train)])
##      CAT GOODS BADS TOTAL     PCT_G    PCT_B        WOE        IV
## 1 female   233   81   314 0.6812865 0.147541  1.5298770 0.8165651
## 2   male   109  468   577 0.3187135 0.852459 -0.9838327 0.5251163
IV(X=data$Sex[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## [1] 1.341681
## attr(,"howgood")
## [1] "Highly Predictive"

未成年人倖存率高於成年人

結合背景,按照「婦女與小孩先走」的規則,未成年人應該有更大可能倖存。以下圖所示,Age < 18的乘客中,倖存人數確實高於遇難人數。同時青壯年乘客中,遇難人數遠高於倖存人數。

ggplot(data = data[(!is.na(data$Age)) & row(data[, 'Age']) <= 891, ], aes(x = Age, color=Survived)) + 
  geom_line(aes(label=..count..), stat = 'bin', binwidth=5)  + 
  labs(title = "How Age impact survivor", x = "Age", y = "Count", fill = "Survived")
## Warning: Ignoring unknown aesthetics: label

配偶及兄弟姐妹數適中的乘客更易倖存

對於SibSp變量,分別統計出倖存與遇難人數。

ggplot(data = data[1:nrow(train),], mapping = aes(x = SibSp, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  labs(title = "How SibSp impact survivor", x = "Sibsp", y = "Count", fill = "Survived") + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,SibSp爲0的乘客,倖存率低於1/3;SibSp爲1或2的乘客,倖存率高於50%;SibSp大於等於3的乘客,倖存率很是低。可經過計算WOE與IV定量計算SibSp對預測的貢獻。IV爲0.1448994,且"Highly Predictive"。

WOETable(X=as.factor(data$SibSp[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
##   CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
## 1   0   210  398   608 0.593220339 0.724954463 -0.2005429 0.026418349
## 2   1   112   97   209 0.316384181 0.176684882  0.5825894 0.081387334
## 3   2    13   15    28 0.036723164 0.027322404  0.2957007 0.002779811
## 4   3     4   12    16 0.011299435 0.021857923 -0.6598108 0.006966604
## 5   4     3   15    18 0.008474576 0.027322404 -1.1706364 0.022063953
## 6   5     5    5     5 0.014124294 0.009107468  0.4388015 0.002201391
## 7   8     7    7     7 0.019774011 0.012750455  0.4388015 0.003081947
IV(X=as.factor(data$SibSp[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1448994
## attr(,"howgood")
## [1] "Highly Predictive"

父母與子女數爲1到3的乘客更可能倖存

對於Parch變量,分別統計出倖存與遇難人數。

ggplot(data = data[1:nrow(train),], mapping = aes(x = Parch, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  labs(title = "How Parch impact survivor", x = "Parch", y = "Count", fill = "Survived") + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,Parch爲0的乘客,倖存率低於1/3;Parch爲1到3的乘客,倖存率高於50%;Parch大於等於4的乘客,倖存率很是低。可經過計算WOE與IV定量計算Parch對預測的貢獻。IV爲0.1166611,且"Highly Predictive"。

WOETable(X=as.factor(data$Parch[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
##   CAT GOODS BADS TOTAL       PCT_G       PCT_B        WOE          IV
## 1   0   233  445   678 0.671469741 0.810564663 -0.1882622 0.026186312
## 2   1    65   53   118 0.187319885 0.096539162  0.6628690 0.060175728
## 3   2    40   40    80 0.115273775 0.072859745  0.4587737 0.019458440
## 4   3     3    2     5 0.008645533 0.003642987  0.8642388 0.004323394
## 5   4     4    4     4 0.011527378 0.007285974  0.4587737 0.001945844
## 6   5     1    4     5 0.002881844 0.007285974 -0.9275207 0.004084922
## 7   6     1    1     1 0.002881844 0.001821494  0.4587737 0.000486461
IV(X=as.factor(data$Parch[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1166611
## attr(,"howgood")
## [1] "Highly Predictive"

FamilySize爲2到4的乘客倖存可能性較高

SibSp與Parch都說明,當乘客無親人時,倖存率較低,乘客有少數親人時,倖存率高於50%,而當親人數太高時,倖存率反而下降。在這裏,能夠考慮將SibSp與Parch相加,生成新的變量,FamilySize。

data$FamilySize <- data$SibSp + data$Parch + 1
ggplot(data = data[1:nrow(train),], mapping = aes(x = FamilySize, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  xlab('FamilySize') + 
  ylab('Count') + 
  ggtitle('How FamilySize impact survivor') + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

計算FamilySize的WOE和IV可知,IV爲0.3497672,且「Highly Predictive」。由SibSp與Parch派生出來的新變量FamilySize的IV高於SibSp與Parch的IV,所以,可將這個派生變量FamilySize做爲特徵變量。

WOETable(X=as.factor(data$FamilySize[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
##   CAT GOODS BADS TOTAL       PCT_G      PCT_B        WOE           IV
## 1   1   163  374   537 0.459154930 0.68123862 -0.3945249 0.0876175539
## 2   2    89   72   161 0.250704225 0.13114754  0.6479509 0.0774668616
## 3   3    59   43   102 0.166197183 0.07832423  0.7523180 0.0661084057
## 4   4    21    8    29 0.059154930 0.01457195  1.4010615 0.0624634998
## 5   5     3   12    15 0.008450704 0.02185792 -0.9503137 0.0127410643
## 6   6     3   19    22 0.008450704 0.03460838 -1.4098460 0.0368782940
## 7   7     4    8    12 0.011267606 0.01457195 -0.2571665 0.0008497665
## 8   8     6    6     6 0.016901408 0.01092896  0.4359807 0.0026038712
## 9  11     7    7     7 0.019718310 0.01275046  0.4359807 0.0030378497
IV(X=as.factor(data$FamilySize[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.3497672
## attr(,"howgood")
## [1] "Highly Predictive"

共票號乘客倖存率高

對於Ticket變量,重複度很是低,沒法直接利用。先統計出每張票對應的乘客數。

ticket.count <- aggregate(data$Ticket, by = list(data$Ticket), function(x) sum(!is.na(x)))

這裏有個猜測,票號相同的乘客,是一家人,極可能同時倖存或者同時遇難。現將全部乘客按照Ticket分爲兩組,一組是使用單獨票號,另外一組是與他人共享票號,並統計出各組的倖存與遇難人數。

data$TicketCount <- apply(data, 1, function(x) ticket.count[which(ticket.count[, 1] == x['Ticket']), 2])
data$TicketCount <- factor(sapply(data$TicketCount, function(x) ifelse(x > 1, 'Share', 'Unique')))
ggplot(data = data[1:nrow(train),], mapping = aes(x = TicketCount, y = ..count.., fill=Survived)) + 
  geom_bar(stat = 'count', position='dodge') + 
  xlab('TicketCount') + 
  ylab('Count') + 
  ggtitle('How TicketCount impact survivor') + 
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

由上圖可見,未與他人同票號的乘客,只有130/(130+351)=27%倖存,而與他人同票號的乘客有212/(212+198)=51.7%倖存。計算TicketCount的WOE與IV以下。其IV爲0.2751882,且"Highly Predictive"

WOETable(X=data$TicketCount[1:nrow(train)], Y=data$Survived[1:nrow(train)])
##      CAT GOODS BADS TOTAL    PCT_G     PCT_B        WOE        IV
## 1  Share   212  198   410 0.619883 0.3606557  0.5416069 0.1403993
## 2 Unique   130  351   481 0.380117 0.6393443 -0.5199641 0.1347889
IV(X=data$TicketCount[1:nrow(train)], Y=data$Survived[1:nrow(train)])
## [1] 0.2751882
## attr(,"howgood")
## [1] "Highly Predictive"

支出船票費越高倖存率越高

對於Fare變量,由下圖可知,Fare越大,倖存率越高。

ggplot(data = data[(!is.na(data$Fare)) & row(data[, 'Fare']) <= 891, ], aes(x = Fare, color=Survived)) + 
  geom_line(aes(label=..count..), stat = 'bin', binwidth=10)  + 
  labs(title = "How Fare impact survivor", x = "Fare", y = "Count", fill = "Survived")

不一樣倉位的乘客倖存率不一樣

對於Cabin變量,其值以字母開始,後面伴以數字。這裏有一個猜測,字母表明某個區域,數據表明該區域的序號。相似於火車票即有車廂號又有座位號。所以,這裏可嘗試將Cabin的首字母提取出來,並分別統計出不一樣首字母倉位對應的乘客的倖存率。

ggplot(data[1:nrow(train), ], mapping = aes(x = as.factor(sapply(data$Cabin[1:nrow(train)], function(x) str_sub(x, start = 1, end = 1))), y = ..count.., fill = Survived)) +
  geom_bar(stat = 'count', position='dodge') + 
  xlab('Cabin') +
  ylab('Count') +
  ggtitle('How Cabin impact survivor') +
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

由上圖可見,倉位號首字母爲B,C,D,E,F的乘客倖存率均高於50%,而其它倉位的乘客倖存率均遠低於50%。倉位變量的WOE及IV計算以下。因而可知,Cabin的IV爲0.1866526,且「Highly Predictive」

data$Cabin <- sapply(data$Cabin, function(x) str_sub(x, start = 1, end = 1))
WOETable(X=as.factor(data$Cabin[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
##   CAT GOODS BADS TOTAL      PCT_G      PCT_B        WOE          IV
## 1   A     7    8    15 0.05109489 0.11764706 -0.8340046 0.055504815
## 2   B    35   12    47 0.25547445 0.17647059  0.3699682 0.029228917
## 3   C    35   24    59 0.25547445 0.35294118 -0.3231790 0.031499197
## 4   D    25    8    33 0.18248175 0.11764706  0.4389611 0.028459906
## 5   E    24    8    32 0.17518248 0.11764706  0.3981391 0.022907100
## 6   F     8    5    13 0.05839416 0.07352941 -0.2304696 0.003488215
## 7   G     2    2     4 0.01459854 0.02941176 -0.7004732 0.010376267
## 8   T     1    1     1 0.00729927 0.01470588 -0.7004732 0.005188134
IV(X=as.factor(data$Cabin[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1866526
## attr(,"howgood")
## [1] "Highly Predictive"

Embarked爲S的乘客倖存率較低

Embarked變量表明登船碼頭,現經過統計不一樣碼頭登船的乘客倖存率來判斷Embarked是否可用於預測乘客倖存狀況。

ggplot(data[1:nrow(train), ], mapping = aes(x = Embarked, y = ..count.., fill = Survived)) +
  geom_bar(stat = 'count', position='dodge') + 
  xlab('Embarked') +
  ylab('Count') +
  ggtitle('How Embarked impact survivor') +
  geom_text(stat = "count", aes(label = ..count..), position=position_dodge(width=1), , vjust=-0.5) + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")

從上圖可見,Embarked爲S的乘客倖存率僅爲217/(217+427)=33.7%,而Embarked爲C或爲NA的乘客倖存率均高於50%。初步判斷Embarked可用於預測乘客是否倖存。Embarked的WOE和IV計算以下。

WOETable(X=as.factor(data$Embarked[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
##   CAT GOODS BADS TOTAL      PCT_G     PCT_B        WOE           IV
## 1   C    93   75   168 0.27352941 0.1366120  0.6942642 9.505684e-02
## 2   Q    30   47    77 0.08823529 0.0856102  0.0302026 7.928467e-05
## 3   S   217  427   644 0.63823529 0.7777778 -0.1977338 2.759227e-02
IV(X=as.factor(data$Embarked[1:nrow(train)]), Y=data$Survived[1:nrow(train)])
## [1] 0.1227284
## attr(,"howgood")
## [1] "Highly Predictive"

從上述計算結果可見,IV爲0.1227284,且「Highly Predictive」。

填補缺失值

列出全部缺失數據

attach(data)
  missing <- list(Pclass=nrow(data[is.na(Pclass), ]))
  missing$Name <- nrow(data[is.na(Name), ])
  missing$Sex <- nrow(data[is.na(Sex), ])
  missing$Age <- nrow(data[is.na(Age), ])
  missing$SibSp <- nrow(data[is.na(SibSp), ])
  missing$Parch <- nrow(data[is.na(Parch), ])
  missing$Ticket <- nrow(data[is.na(Ticket), ])
  missing$Fare <- nrow(data[is.na(Fare), ])
  missing$Cabin <- nrow(data[is.na(Cabin), ])
  missing$Embarked <- nrow(data[is.na(Embarked), ])
  for (name in names(missing)) {
    if (missing[[name]][1] > 0) {
      print(paste('', name, ' miss ', missing[[name]][1], ' values', sep = ''))
    }
  }
detach(data)
## [1] "Age miss 263 values"
## [1] "Fare miss 1 values"
## [1] "Cabin miss 1014 values"
## [1] "Embarked miss 2 values"

預測乘客年齡

缺失年齡信息的乘客數爲263,缺失量比較大,不適合使用中位數或者平均值填補。通常經過使用其它變量預測或者直接將缺失值設置爲默認值的方法填補,這裏經過其它變量來預測缺失的年齡信息。

age.model <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize, data=data[!is.na(data$Age), ], method='anova')
data$Age[is.na(data$Age)] <- predict(age.model, data[is.na(data$Age), ])

中位數填補缺失的Embarked值

從以下數據可見,缺失Embarked信息的乘客的Pclass均爲1,且Fare均爲80。

data[is.na(data$Embarked), c('PassengerId', 'Pclass', 'Fare', 'Embarked')]
## # A tibble: 2 × 4
##   PassengerId Pclass  Fare Embarked
##         <int>  <int> <dbl>    <chr>
## 1          62      1    80     <NA>
## 2         830      1    80     <NA>

由下圖所見,Embarked爲C且Pclass爲1的乘客的Fare中位數爲80。

ggplot(data[!is.na(data$Embarked),], aes(x=Embarked, y=Fare, fill=factor(Pclass))) +
  geom_boxplot() +
  geom_hline(aes(yintercept=80), color='red', linetype='dashed', lwd=2) +
  scale_y_continuous(labels=dollar_format()) + theme_few()

Fare median value of each Embarked and Pclass

所以能夠將缺失的Embarked值設置爲'C'。

data$Embarked[is.na(data$Embarked)] <- 'C'
data$Embarked <- as.factor(data$Embarked)

中位數填補一個缺失的Fare值

因爲缺失Fare值的記錄很是少,通常可直接使用平均值或者中位數填補該缺失值。這裏使用乘客的Fare中位數填補缺失值。

data$Fare[is.na(data$Fare)] <- median(data$Fare, na.rm=TRUE)

將缺失的Cabin設置爲默認值

缺失Cabin信息的記錄數較多,不適合使用中位數或者平均值填補,通常經過使用其它變量預測或者直接將缺失值設置爲默認值的方法填補。因爲Cabin信息不太容易從其它變量預測,而且在上一節中,將NA單獨對待時,其IV已經比較高。所以這裏直接將缺失的Cabin設置爲一個默認值。

data$Cabin <- as.factor(sapply(data$Cabin, function(x) ifelse(is.na(x), 'X', str_sub(x, start = 1, end = 1))))

訓練模型

set.seed(415)
model <- cforest(Survived ~ Pclass + Title + Sex + Age + SibSp + Parch + FamilySize + TicketCount + Fare + Cabin + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))

交叉驗證

通常狀況下,應該將訓練數據分爲兩部分,一部分用於訓練,另外一部分用於驗證。或者使用k-fold交叉驗證。本文將全部訓練數據都用於訓練,而後隨機選取30%數據集用於驗證。

cv.summarize <- function(data.true, data.predict) {
  print(paste('Recall:', Recall(data.true, data.predict)))
  print(paste('Precision:', Precision(data.true, data.predict)))
  print(paste('Accuracy:', Accuracy(data.predict, data.true)))
  print(paste('AUC:', AUC(data.predict, data.true)))
}
set.seed(415)
cv.test.sample <- sample(1:nrow(train), as.integer(0.3 * nrow(train)), replace = TRUE)
cv.test <- data[cv.test.sample,]
cv.prediction <- predict(model, cv.test, OOB=TRUE, type = "response")
cv.summarize(cv.test$Survived, cv.prediction)
## [1] "Recall: 0.947976878612717"
## [1] "Precision: 0.841025641025641"
## [1] "Accuracy: 0.850187265917603"
## [1] "AUC: 0.809094822285082"

預測

predict.result <- predict(model, data[(1+nrow(train)):(nrow(data)), ], OOB=TRUE, type = "response")
output <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(output, file = "cit1.csv", row.names = FALSE)

該模型預測結果在Kaggle的得分爲0.80383,排第992名,前992/6292=15.8%。

調優

去掉關聯特徵

因爲FamilySize結合了SibSp與Parch的信息,所以能夠嘗試將SibSp與Parch從特徵變量中移除。

set.seed(415)
model <- cforest(Survived ~ Pclass + Title + Sex + Age + FamilySize + TicketCount + Fare + Cabin + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit2.csv", row.names = FALSE)

該模型預測結果在Kaggle的得分仍爲0.80383。

去掉IV較低的Cabin

因爲Cabin的IV值相對較低,所以能夠考慮將其從模型中移除。

set.seed(415)
model <- cforest(Survived ~ Pclass + Title + Sex + Age + FamilySize + TicketCount + Fare + Embarked, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit3.csv", row.names = FALSE)

該模型預測結果在Kaggle的得分仍爲0.80383。

增長派生特徵

對於Name變量,上文從中派生出了Title變量。因爲如下緣由,可推測乘客的姓氏可能具備必定的預測做用

  • 部分西方國家中人名的重複度較高,而姓氏重複度較低,姓氏具備必定辨識度
  • 部分國家的姓氏具備必定的身份識別做用
  • 姓氏相同的乘客,多是一家人(這一點也基於西方國家姓氏重複度較低這一特色),而一家人同時倖存或遇難的可能性較高

考慮到只出現一次的姓氏不可能同時出如今訓練集和測試集中,不具辨識度和預測做用,所以將只出現一次的姓氏均命名爲'Small'

data$Surname <- sapply(data$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
data$FamilyID <- paste(as.character(data$FamilySize), data$Surname, sep="")
data$FamilyID[data$FamilySize <= 2] <- 'Small'
# Delete erroneous family IDs
famIDs <- data.frame(table(data$FamilyID))
famIDs <- famIDs[famIDs$Freq <= 2,]
data$FamilyID[data$FamilyID %in% famIDs$Var1] <- 'Small'
# Convert to a factor
data$FamilyID <- factor(data$FamilyID)
set.seed(415)
model <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + FamilySize + FamilyID + TicketCount, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit4.csv", row.names = FALSE)

該模型預測結果在Kaggle的得分爲0.82297,排第207名,前207/6292=3.3%

其它

經試驗,將缺失的Embarked補充爲出現最多的S而非C,成績有所提高。但該方法理論依據不強,而且該成績只是Public排行榜成績,並不是最終成績,並不能說明該方法必定優於其它方法。所以本文並不推薦該方法,只是做爲一種可能的思路,供你們參考學習。

data$Embarked[c(62,830)] = "S"
data$Embarked <- factor(data$Embarked)
set.seed(415)
model <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + FamilySize + FamilyID + TicketCount, data = data[train.row, ], controls=cforest_unbiased(ntree=2000, mtry=3))
predict.result <- predict(model, data[test.row, ], OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = predict.result)
write.csv(submit, file = "cit5.csv", row.names = FALSE)

該模型預測結果在Kaggle的得分仍爲0.82775,排第114名,前114/6292=1.8%
Kaggle rank first 2%

總結

本文詳述瞭如何經過數據預覽,探索式數據分析,缺失數據填補,刪除關聯特徵以及派生新特徵等方法,在Kaggle的Titanic倖存預測這一分類問題競賽中得到前2%排名的具體方法。

下篇預告

下一篇文章將側重講解使用機器學習解決工程問題的通常思路和方法。

相關文章
相關標籤/搜索