本文以Kaggle比賽Titanic入手,介紹了特徵工程的幾個方法,最後訓練了三個模型(RF,GBDT,SVM)並使用了一個集成方法(Voting Classifier)進行預測。html
完整代碼及數據能夠在ReMachineLearning(titanic) - Github中獲取git
下面是kaggle對於這個比賽的介紹。github
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.算法
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.bash
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.數據結構
簡單來講就是已知乘客的各類信息,去推斷這名乘客是否會倖存的一個二分類問題。app
本文首先使用特徵工程的一些方法,而後使用隨機森林(Random Forests),GBDT(Gradient Boosting Decision Tree)和SVM(Support Vector Machine)做爲訓練模型,最後用了投票分類器(Voting Classifier)集成方法作了模型的集成。dom
本文基於Python 三、sklearn以及Pandas(強烈建議按照Anaconda),數據源於Kaggle,代碼以及數據均可以在Github上獲取機器學習
先簡單介紹下給的訓練數據的csv結構ide
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
複製代碼
從數據裏面能夠看到數據中性別(Sex)是一個枚舉字符串male
或female
,爲了讓計算機更好的處理這列數據,咱們將其數值化處理
# API文檔 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html
# map方法數字化處理並改變pandas的列類型
df['Sex'] = df['Sex'].map({'female': 1, 'male': 0}).astype(int)
複製代碼
用df.info()
能夠簡單的發現哪類數據是不全的
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null int64
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(6), object(4)
memory usage: 83.6+ KB
複製代碼
能夠發現Age、Cabin,Embarked信息是不全的,這時候咱們須要對這些進行補全。對於機器學習如何處理缺失數據的理論學習能夠參考連接中的內容。
在這裏例子裏面個人解決方法是:
Sibsp
指的是一同乘船的兄弟姐妹或配偶,Parch
指的是一同乘船的父母或兒女。
添加familysize列做爲sibsp和parch的和,表明總的家庭成員數。 這種從原始數據中挖掘隱藏屬性或屬性組合的方法叫作派生屬性。可是這些挖掘出來的特徵有幾個問題:
其實這裏我有些沒有理解爲何要把Sibsp和Parch相加爲Familysize,我無法解釋這樣作的合理性。
Fare
指的是票價
拿到數據後咱們首先要作的是分析數據及特徵。
pandas能夠經過DataFrame的corr方法計算數據的相關係數(方法包括三種:pearson,kendall,spearman),這裏也不擴展,反正可以計算出數據之間的相關性。
# API文檔 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html
def plot_corr(df,size=10):
'''Function plots a graphical correlation matrix for each pair of columns in the dataframe. Input: df: pandas DataFrame size: vertical and horizontal size of the plot'''
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
for (i, j), z in np.ndenumerate(corr):
ax.text(j, i, '{:.2f}'.format(z), ha='center', va='center')
plt.xticks(range(len(corr.columns)), corr.columns)
plt.yticks(range(len(corr.columns)), corr.columns)
# 特徵相關性圖表
plot_corr(df)
複製代碼
以前也說到了咱們會用到三個模型以及一個集成方法,這塊沒有什麼好細說的。
使用了隨機森林(Random Forests),GBDT(Gradient Boosting Decision Tree)和SVM(Support Vector Machine)做爲訓練模型,最後用了投票分類器(Voting Classifier)集成方法作了模型的集成
使用Voting以及單個模型的得分以下:
這篇博文裏面涉及的算法每個我後期都會單獨寫一篇博文(坑我先挖了),這篇博客從開始寫到如今差很少2個月了=,=,中間斷了一次如今才寫完。