以前使用邏輯迴歸算法獲得的生還預測kaggle打分是0.75119分,emmm,能夠說是比較差的一個分數了,下面進行調整。html
-----------------------------------------------------------------------------------------------算法
一、判斷擬合狀態app
因爲過擬合和欠擬合兩種狀況下對於數據集的處理不一樣,因此首先須要判斷現有模型是過擬合仍是欠擬合。dom
百度百科欠擬合函數
百度百科過擬合學習
咱們能夠經過繪製學習曲線(learning curve)來進行判斷(樣本數爲橫座標,準確率爲縱座標)測試
learning curve 官方文檔 learning curve 官方示例代碼優化
首先定義學習曲線的繪製函數:spa
from sklearn.learning_curve import learning_curve from sklearn import cross_validation #定義學習曲線繪製 def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=True): plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") return plt
根據已有模型,設置上面函數中的參數並調用:rest
#具體參數 estimator = lrModel title = 'Learning Curves (LogisticRegression)' X, y = data_train[inputcolumns], data_train[outpucolumns] cv = cross_validation.ShuffleSplit(891, n_iter=100, test_size=0.2, random_state=0) #調用 plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4) plt.show()
獲得以下的圖形,屬於欠擬合的狀態,因此後續還須要作更多的特徵工程:
二、特徵工程
考慮到對數據集的處理,能夠從如下的幾個方向進行更深層的考慮:
1)未使用到的姓名、船票編號列是否可以加以利用
2)Parch和Sibsp兩個變量分別表明同船的兄弟/姐妹和分母/小孩的個數,求和是否可以表明同船的家族大小(人數)
3)缺失年齡的隨機森林擬合欠妥,是否有更好解決方法
通過更深刻的觀察,先對數據作如下處理(從易到難):
1)Parch和Sibsp求和獲得家族人數
#將Parch 和 SibSp 變量求和獲得家族大小 data['family_size'] = data['Parch'] + data['SibSp']
2)根據Ticket 分組,獲得人均票價,再根據人均票價進行離散化
#根據Ticket進行分組,獲得人均票價, 再根據票價區間進行離散化 data['Fare'] = data['Fare'] / data.groupby(by=['Ticket'])['Fare'].transform('count') data['Fare'].describe() def fare_level(s): if s <= 5 : #低價票 m = 0 elif s>5 and s<=20: #普通票 m = 1 elif s>20 and s<=40: #一等票 m = 2 else: m = 3 #特等票 return m data['Fare_level'] = data['Fare'].apply(fare_level)
3)將對結果影響最大的兩個因素——sex 和 pclass 進行合併,生成一個新的變量
data['Sex_Pclass'] = data.Sex + "_" + data.Pclass.map(str) dummies_Sex_Pclass = pd.get_dummies(data['Sex_Pclass'], prefix= 'Sex_Pclass') data = pd.concat([data, dummies_Sex_Pclass], axis=1)
4)缺失年齡填補,這裏使用線性迴歸和隨機森林的均值
age_data = data[['Age','Fare_level', 'family_size', 'Pclass','Sex_Pclass_female_1', 'Sex_Pclass_female_2', 'Sex_Pclass_female_3', 'Sex_Pclass_male_1', 'Sex_Pclass_male_2', 'Sex_Pclass_male_3', 'embarked_C','embarked_Q','embarked_S']] fcolumns = ['Fare_level', 'family_size', 'Pclass', 'Sex_Pclass_female_1', 'Sex_Pclass_female_2', 'Sex_Pclass_female_3', 'Sex_Pclass_male_1', 'Sex_Pclass_male_2', 'Sex_Pclass_male_3', 'embarked_C','embarked_Q','embarked_S'] tcolumns = ['Age'] age_data_known = age_data[age_data.Age.notnull()] age_data_unknown = age_data[age_data.Age.isnull()] x = age_data_known[fcolumns]#特徵變量 y = age_data_known[tcolumns]#目標變量 from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import LinearRegression from sklearn.model_selection import GridSearchCV #線性迴歸 lr = LinearRegression() lr_grid_pattern = {'fit_intercept': [True], 'normalize': [True]} lr_grid = GridSearchCV(lr, lr_grid_pattern, cv=10, n_jobs=1, verbose=1, scoring='neg_mean_squared_error') lr_grid.fit(age_data_known[fcolumns], age_data_known[tcolumns]) print('Age feature Best LR Params:' + str(lr_grid.best_params_)) print('Age feature Best LR Score:' + str(lr_grid.best_score_)) lr = lr_grid.predict(age_data_unknown[fcolumns]).tolist() lr = sum(lr, []) #隨機森林迴歸 rfr = RandomForestRegressor() rfr_grid_pattern = {'max_depth': [3], 'max_features': [3]} rfr_grid = GridSearchCV(rfr, rfr_grid_pattern, cv=10, n_jobs=1, verbose=1, scoring='neg_mean_squared_error') rfr_grid.fit(age_data_known[fcolumns], age_data_known[tcolumns]) print('Age feature Best LR Params:' + str(rfr_grid.best_params_)) print('Age feature Best LR Score:' + str(rfr_grid.best_score_)) rfr = rfr_grid.predict(age_data_unknown[fcolumns]).tolist() #取兩者均值 predictresult = pd.DataFrame() predictresult['lr'] = lr predictresult['rfr'] = rfr predictresult['result'] = (predictresult['lr'] + predictresult['rfr']) / 2 data.loc[data['Age'].isnull(), 'Age'] = predictresult['result']
5)根據年齡段,進行離散化
def age_level(s): if s <= 14 : #兒童 m = 0 elif s>14 and s<=35: #青年 m = 1 elif s>35 and s<=60: #中年 m = 2 else: m = 3 #老年 return m data['age_level'] = data['Age'].apply(age_level)
三、單個模型擬合
仍是使用邏輯迴歸模型,對上述處理過的數據集進行擬合
data_train = data.drop(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Sex_Pclass'], axis=1, inplace=False) #data_train.columns #再次進行單個模型擬合 from sklearn import linear_model lrModel = linear_model.LogisticRegression(penalty='l1') inputcolumns = ['family_size', 'Fare_level', 'Sex_Pclass_female_1', 'Sex_Pclass_female_2', 'Sex_Pclass_female_3', 'Sex_Pclass_male_1', 'Sex_Pclass_male_2', 'Sex_Pclass_male_3', 'embarked_C', 'embarked_Q', 'embarked_S', 'age_level'] outpucolumns = ['Survived'] lrModel.fit(data_train[inputcolumns], data_train[outpucolumns]) lrModel.score(data_train[inputcolumns], data_train[outpucolumns])
-----------------------------------------------------------------------
對測試集作一樣處理後,獲得的預測結果上傳kaggle,評分0.77,上升了0.02。 >_<
後續將再進行交叉驗證和模型融合的優化