Kaggle入門級賽題：泰坦尼克號生還者預測——數據挖掘篇

時間 2019-11-09

標籤 kaggle 入門生還者預測數據挖掘简体版

原文原文鏈接

本次分享的項目來自 Kaggle 的經典賽題：泰坦尼克號生還者預測。分爲數據分析和數據挖掘兩部分介紹。上一篇爲數據分析篇，本篇爲數據挖掘篇。segmentfault

數據挖掘

本篇的內容有如下幾部分：app

對一些異常和缺失數據進行清洗。
進行特徵的轉換，好比定類的 Sex 特徵：female／male須要轉換爲模型可接受的 0／1 值，也叫量化過程。
除了提供的變量外，嘗試作出一些認爲很是有影響力的「衍生變量」，並加入到數據中。
整理數據，創建一個模型，輸出預測結果。

特徵工程

對缺失值進行處理

在上一篇的分析中咱們對特徵缺失值狀況進行了統計：dom

接下來分別對 Fare，Embarked，Cabin，Age 四個特徵對缺失值進行處理。機器學習

Fare 特徵缺失值處理

查看Fare特徵缺失狀況：函數

df[df['Fare'].isnull()]

發現只有一個缺失值，是一位年齡大於 60 歲的男性，乘坐的船艙等級爲 3。學習

這裏咱們選擇不刪除這個值，而是用類似特徵替換的方法來填補缺失值。與缺失值具備類似特徵的其它樣本數據：spa

df.loc[(df['Pclass']==3)&(df['Sex']=='male')&(df['Age']>60)]

咱們用以上樣本Fare的均值來填補這個缺失值:3d

df['surname'] = df["Name"].apply(lambda x: x.split(',')[0].lower())
fare_mean_estimated = df.loc[(df['Pclass']==3)&(df['Age']>60)&(df['Sex']=='male')].Fare.mean()
df.loc[df['surname']=='storey','Fare'] = fare_mean_estimated

Embarked 特徵缺失值處理

查看Embarded特徵缺失狀況：rest

df[df['Embarked'].isnull()]

在上篇分析中咱們知道在 S 港口登錄的乘客人數最多，這裏採用 S 港口進行填補：code

df['Embarked'] = df['Embarked'].fillna('S')

Cabin 特徵缺失值處理

Cabin 特徵缺失嚴重，所以咱們根據有無 Cabin 信息提取出一個新特徵:

data_train['Has_Cabin'] = data_train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
data_test['Has_Cabin'] = data_test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

獲得處理後的數據集形以下列形式：

Age 特徵缺失值處理

Age 特徵存在一部分缺失值，且數值較多，咱們在這一步處理缺失值後，也將對 Age 特徵根據區間進行分類。

首先處理缺失值：

full_data = [data_train, data_test]
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)

對衆多的 Age 特徵進行分組：

data_train['CategoricalAge'] = pd.cut(data_train['Age'], 5)
for dataset in full_data:
    # Mapping Age
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

丟掉多餘的特徵：

drop_elements = ['PassengerId']
data_train = data_train.drop(drop_elements, axis = 1)
data_train = data_train.drop(['CategoricalAge'], axis = 1)

最後，獲得下列形式的數據集：

衍生變量

在這部分咱們將對 Name 特徵進行處理，即對特徵進行衍生產生新特徵變量。

# 定義函數從 name 中提取 title
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
    
# 建立新特徵 title
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
    
# 將不常見的 title 用"Rare"替換掉
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

對數據進行數值化處理：

for dataset in full_data:
    # Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    
    # Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Fare
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare']                                 = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare']                                     = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

去掉多餘的特徵：

drop_elements = ['Name', 'Ticket']
drop_elements = ['Name', 'Ticket']
data_train = data_train.drop(drop_elements, axis = 1)
data_test = data_test.drop(drop_elements, axis = 1)

獲得以下形式的數據集：

特徵選擇

特徵權重

查看到有以下特徵：

Index(['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Family',
       'Has_Cabin', 'Title'],
      dtype='object')

咱們採用 ANOVA 方差分析的 F 值來對各個特徵變量打分，打分的意義是：各個特徵變量對目標變量的影響權重。代碼以下：

from sklearn.feature_selection import SelectKBest, f_classif,chi2

target = data_train["Survived"].values
features= ['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Family',
       'Name_length', 'Has_Cabin', 'Title']

train = data_train.copy()
test = data_train.copy()

selector = SelectKBest(f_classif, k=len(features))
selector.fit(train[features], target)
scores = -np.log10(selector.pvalues_)
indices = np.argsort(scores)[::-1]
print("Features importance :")
for f in range(len(scores)):
    print("%0.2f %s" % (scores[indices[f]],features[indices[f]]))

獲得結果：

特徵相關性分析

對每一個特徵進行相關性分析，查看熱力圖：

features_selected = features
df_corr = data_train[features_selected].copy()

colormap = plt.cm.RdBu
plt.figure(figsize=(20,20))
sns.heatmap(df_corr.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

相關性大的特徵容易形成過擬合現象，所以須要進行剔除。最好的狀況就是：全部特徵相關性很低，各自的方差或者說信息量很高。

模型訓練

建立模型

劃分數據集：

from sklearn.model_selection import train_test_split

X_all = data_train.drop(['Survived'], axis=1)
y_all = data_train['Survived']

num_test = 0.20
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)

這裏採用隨機森林 RandomForest 模型，創建模型：

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV

clf = RandomForestClassifier()

# 設定參數
parameters = {'n_estimators': [4, 6, 9], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1,5,8]
             }

acc_scorer = make_scorer(accuracy_score)

grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

clf = grid_obj.best_estimator_

clf.fit(X_train, y_train)

獲得模型：

模型預測

predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

獲得預測值爲0.8435754189944135，提交到 kaggle 上打分0.77990，需進一步的改進。

K 折交叉驗證

from sklearn.cross_validation import KFold

def run_kfold(clf):
    kf = KFold(891, n_folds=10)
    outcomes = []
    fold = 0
    for train_index, test_index in kf:
        fold += 1
        X_train, X_test = X_all.values[train_index], X_all.values[test_index]
        y_train, y_test = y_all.values[train_index], y_all.values[test_index]
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        outcomes.append(accuracy)
        print("Fold {0} accuracy: {1}".format(fold, accuracy))     
    mean_outcome = np.mean(outcomes)
    print("Mean Accuracy: {0}".format(mean_outcome)) 

run_kfold(clf)

獲得：

輸出結果

ids = data_test['PassengerId']
predictions = clf.predict(data_test.drop('PassengerId', axis=1))

output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('titanic-predictions.csv', index = False)