Kaggle入門Titanic——模型創建

時間 2019-11-19

標籤 kaggle 入門 titanic 模型創建简体版

原文原文鏈接

0，介紹dom

經過前面的特徵分析，咱們已經獲得的想要的訓練集和測試集，這樣咱們就能夠利用這些訓練集訓練模型，並經過模型對測試集進行預測。咱們獲得的訓練集和測試集結構以下所示。測試

print(train.head(5))
print(test.head(5))

   Survived  Pclass  Sex  Age  Fare  Embarked  FamilySize  IsAlone  Title
0         0       3    1    1     0         0           2        0      1
1         1       1    0    2     3         1           2        0      3
2         1       3    0    1     1         0           1        1      2
3         1       1    0    2     3         0           2        0      3
4         0       3    1    2     1         0           1        1      1
   Pclass  Sex  Age  Fare  Embarked  FamilySize  IsAlone  Title
0       3    1    2     0         2           1        1      1
1       3    0    2     0         0           2        0      3
2       2    1    3     1         2           1        1      1
3       3    1    1     1         0           1        1      1
4       3    0    1     1         0           3        0      3

1，幾種基本模型spa

這裏用的是sklearn庫和xgboost，結果以下所示，須要注意的是爲了防止過擬合，在這裏咱們對訓練集進行劃分，每次選擇訓練集的90%對模型進行訓練，而後對測試集進行預測，計算預測的準確率，如此重複十次，將獲得的結果進行平均做爲咱們最後獲得的準確率。rest

mport matplotlib.pyplot as plt
import xgboost as xgb

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

##全部的分類器模型,採用默認參數
classifiers = [
    KNeighborsClassifier(3),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    SVC(probability=True),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    LogisticRegression(),
    xgb.XGBClassifier()
    ]

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)#生成十組訓練集和測試集，每組測試集爲1/10
x = train[:, 1:]
y = train[:, 0]
accuracy = np.zeros(len(classifiers))#每一個模型的準確率
for train_index, test_index in sss.split(x, y):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf_num = 0
    for clf in classifiers:
        clf_name = clf.__class__.__name__
        clf.fit(x_train, y_train)
        accuracy[clf_num] += (y_test == clf.predict(x_test)).mean()#該模型的準確率，十次平均
        clf_num += 1

accuracy = accuracy / 10
plt.bar(np.arange(len(classifiers)), accuracy, width=0.5, color='b')
plt.xlabel('Alog')  
plt.ylabel('Accuracy')  
plt.xticks(np.arange(len(classifiers)) + 0.25, 
           ('KNN', 'DT', 'RF', 'SVC', 'AdaB', 'GBC', 'GNB',
            'LDA', 'QDA', 'LR', 'xgb'))

結果以下所示，各類模型的準確率都在0.8左右，其中GradientBoostingClassifier，SVC和xgBoost的效果比較好。code

array([ 0.79444444,  0.82666667,  0.82555556,  0.82888889,  0.82222222,
        0.83444444,  0.79222222,  0.79666667,  0.81222222,  0.80777778,
        0.82888889])

採用GradientBoostingClassifier模型對測試數據進行預測，並提交結果，得分爲0.77033。blog

gbc = GradientBoostingClassifier().fit(x, y)
test_predictions = gbc.predict(test)
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': test_predictions.astype(int) })
StackingSubmission.to_csv("gbc.csv", index=False)

2，模型融合ci

除了基本的模型之外，若是咱們對基本模型的基礎上對幾個模型進行融合可以獲得更佳的效果。it

（1）咱們經過幾個基本模型獲得告終果以後，能夠對這幾個模型進行簡單的加權投票來獲得最終的結果，咱們這裏採用的是最簡單的平均加權。io

（2）咱們能夠在基本模型之上採用第二層模型來進行非線性擬合獲得最終結果，咱們這裏採用xgboos做爲第二層模型來進行模型融合。ast

（3）咱們能夠對訓練集進行處理，每次採用訓練集不一樣部分對同個模型進行訓練，也能夠獲得不一樣的模型，最後再進行加權投票。前面一步的時候也用到了這個方法。

平均加權。將上述訓練獲得的模型對測試集進行測試，獲得對應結果，能夠獲得這些模型之間的相關係數圖，如圖所示。對於模型融合而言，通常要選擇相關性小的模型進行融合，這樣能夠提升模型多樣化，使總的模型能夠學到更多的信息，提升模型的準確率。

import matplotlib.pylab as pyl
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)#生成十組訓練集和測試集，每組測試集爲1/10
x = train[:, 1:]
y = train[:, 0]
x1_test = np.zeros((test.shape[0], len(classifiers)))#存儲第一層測試集的輸出結果
accuracy = np.zeros(len(classifiers))#每一個模型的準確率
for train_index, test_index in sss.split(x, y):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf_num = 0
    for clf in classifiers:
        clf_name = clf.__class__.__name__
        clf.fit(x_train, y_train)
        x1_test[:, clf_num] += clf.predict(test)#直接對測試集進行預測，總共有十次，進行平均
        accuracy[clf_num] += (y_test == clf.predict(x_test)).mean()#該模型的準確率，十次平均
        clf_num += 1
        
x1_test = x1_test / 10
accuracy = accuracy / 10
plt.bar(np.arange(len(classifiers)), accuracy, width=0.5, color='b')
plt.xlabel('Alog')  
plt.ylabel('Accuracy')  
plt.xticks(np.arange(len(classifiers)) + 0.25, 
           ['KNN', 'DT', 'RF', 'SVC', 'AdaB', 'GBC', 'GNB',
            'LDA', 'QDA', 'LR', 'xgb'])

pyl.pcolor(np.corrcoef(x1_test.T), cmap = 'Blues')
pyl.colorbar() 
pyl.xticks(np.arange(0.5, 11.5),
           ['KNN', 'DT', 'RF', 'SVC', 'AdaB', 'GBC', 'GNB','LDA', 'QDA', 'LR', 'xgb'])
pyl.yticks(np.arange(0.5, 11.5),
           ['KNN', 'DT', 'RF', 'SVC', 'AdaB', 'GBC', 'GNB','LDA', 'QDA', 'LR', 'xgb'])
pyl.show()

根據上面的相關係數圖，咱們選擇了KNN，DT，RF，LR和GBC這五個模型獲得的結果進行平均投票，獲得的分數爲0.78947。

index = [0, 1, 2, 5, 9]
linear_prediction = x1_test[:, index].mean(axis=1)
linear_prediction[linear_prediction >= 0.5] = 1
linear_prediction[linear_prediction < 0.5] = 0
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': linear_prediction.astype(int) })
StackingSubmission.to_csv("linear_prediction.csv", index=False)

非線性融合。在第二層模型中應用xgboost對第一層模型輸出的結果進行擬合，提交結果獲得的分數是0.76555。在第一層的時候，爲了使第一層的輸出可以做爲第二層的訓練集，須要對訓練集進行分塊，大小爲0.1，而後用訓練集的0.9對模型進行訓練，獲得的模型對訓練集剩下的0.1進行預測，這樣就能將第一層的輸出用於第二層的訓練集。可是結果並非很理想，緣由多是第二層輸入特徵太少，用模型進行擬合效果不佳。

import xgboost as xgb
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

##全部的分類器模型,採用默認參數
classifiers = [
    KNeighborsClassifier(3),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    LogisticRegression()
    ]

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)#生成十組訓練集和測試集，每組測試集爲1/10
x = train[:, 1:]#原始數據，在下面會進行劃分，方便線下求準確率，而且是重複十次防止過擬合
y = train[:, 0]
x1_train = np.zeros((x.shape[0], len(classifiers)))#存儲第一層訓練集的輸出結果
x1_test = np.zeros((test.shape[0], len(classifiers)))#存儲第一層測試集的輸出結果
accuracy = np.zeros(len(classifiers))#每一個模型的準確率
#在這裏進行模型融合的時候，採用的是在訓練集中每次選取必定數目的訓練集進行訓練，而後對其他的訓練集和測試
#集進行預測，輸出做爲第二層的輸入，這樣就能夠再次進行訓練
for train_index, test_index in sss.split(x, y):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf_num = 0
    for clf in classifiers:
        clf_name = clf.__class__.__name__
        clf.fit(x_train, y_train)
        x1_train[test_index, clf_num] = clf.predict(x_test)#下層模型的訓練集輸入是上層模型對於對應測試集的預測輸出
        x1_test[:, clf_num] += clf.predict(test)#直接對測試集進行預測，總共有十次，進行平均
        accuracy[clf_num] += (y_test == x1_train[test_index, clf_num]).mean()#該模型的準確率，十次平均
        clf_num += 1

x2_train, x2_test, y2_train, y2_test = train_test_split(x1_train, y, test_size=0.1, random_state=0)

gbm = xgb.XGBClassifier().fit(x2_train, y2_train)
predictions = gbm.predict(x2_test)
print((y2_test == predictions).mean())

test_predictions = gbm.predict(x1_test)
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': test_predictions.astype(int) })
StackingSubmission.to_csv("xgboost_Stacking.csv", index=False)