本次分享的項目來自 Kaggle 的經典賽題:泰坦尼克號生還者預測。分爲數據分析和數據挖掘兩部分介紹。上一篇爲數據分析篇,本篇爲數據挖掘篇。segmentfault
本篇的內容有如下幾部分:app
female/male
須要轉換爲模型可接受的 0/1 值
,也叫量化過程。在上一篇的分析中咱們對特徵缺失值狀況進行了統計:dom
接下來分別對 Fare,Embarked,Cabin,Age 四個特徵對缺失值進行處理。機器學習
查看Fare
特徵缺失狀況:函數
df[df['Fare'].isnull()]
發現只有一個缺失值,是一位年齡大於 60 歲的男性,乘坐的船艙等級爲 3。學習
這裏咱們選擇不刪除這個值,而是用類似特徵替換的方法來填補缺失值。與缺失值具備類似特徵的其它樣本數據:spa
df.loc[(df['Pclass']==3)&(df['Sex']=='male')&(df['Age']>60)]
咱們用以上樣本Fare
的均值來填補這個缺失值:3d
df['surname'] = df["Name"].apply(lambda x: x.split(',')[0].lower()) fare_mean_estimated = df.loc[(df['Pclass']==3)&(df['Age']>60)&(df['Sex']=='male')].Fare.mean() df.loc[df['surname']=='storey','Fare'] = fare_mean_estimated
查看Embarded
特徵缺失狀況:rest
df[df['Embarked'].isnull()]
在上篇分析中咱們知道在 S 港口登錄的乘客人數最多,這裏採用 S 港口進行填補:code
df['Embarked'] = df['Embarked'].fillna('S')
Cabin 特徵缺失嚴重,所以咱們根據有無 Cabin 信息提取出一個新特徵:
data_train['Has_Cabin'] = data_train["Cabin"].apply(lambda x: 0 if type(x) == float else 1) data_test['Has_Cabin'] = data_test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
獲得處理後的數據集形以下列形式:
Age 特徵存在一部分缺失值,且數值較多,咱們在這一步處理缺失值後,也將對 Age 特徵根據區間進行分類。
首先處理缺失值:
full_data = [data_train, data_test] for dataset in full_data: age_avg = dataset['Age'].mean() age_std = dataset['Age'].std() age_null_count = dataset['Age'].isnull().sum() age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count) dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list dataset['Age'] = dataset['Age'].astype(int)
對衆多的 Age 特徵進行分組:
data_train['CategoricalAge'] = pd.cut(data_train['Age'], 5) for dataset in full_data: # Mapping Age dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0 dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1 dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2 dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3 dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;
丟掉多餘的特徵:
drop_elements = ['PassengerId'] data_train = data_train.drop(drop_elements, axis = 1) data_train = data_train.drop(['CategoricalAge'], axis = 1)
最後,獲得下列形式的數據集:
在這部分咱們將對 Name 特徵進行處理,即對特徵進行衍生產生新特徵變量。
# 定義函數從 name 中提取 title def get_title(name): title_search = re.search(' ([A-Za-z]+)\.', name) # If the title exists, extract and return it. if title_search: return title_search.group(1) return "" # 建立新特徵 title for dataset in full_data: dataset['Title'] = dataset['Name'].apply(get_title) # 將不常見的 title 用"Rare"替換掉 for dataset in full_data: dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') dataset['Title'] = dataset['Title'].replace('Ms', 'Miss') dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
對數據進行數值化處理:
for dataset in full_data: # Sex dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int) # titles title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} dataset['Title'] = dataset['Title'].map(title_mapping) dataset['Title'] = dataset['Title'].fillna(0) # Embarked dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int) # Fare dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0 dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1 dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2 dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3 dataset['Fare'] = dataset['Fare'].astype(int)
去掉多餘的特徵:
drop_elements = ['Name', 'Ticket'] drop_elements = ['Name', 'Ticket'] data_train = data_train.drop(drop_elements, axis = 1) data_test = data_test.drop(drop_elements, axis = 1)
獲得以下形式的數據集:
查看到有以下特徵:
Index(['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Family', 'Has_Cabin', 'Title'], dtype='object')
咱們採用 ANOVA 方差分析的 F 值來對各個特徵變量打分,打分的意義是:各個特徵變量對目標變量的影響權重。代碼以下:
from sklearn.feature_selection import SelectKBest, f_classif,chi2 target = data_train["Survived"].values features= ['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Family', 'Name_length', 'Has_Cabin', 'Title'] train = data_train.copy() test = data_train.copy() selector = SelectKBest(f_classif, k=len(features)) selector.fit(train[features], target) scores = -np.log10(selector.pvalues_) indices = np.argsort(scores)[::-1] print("Features importance :") for f in range(len(scores)): print("%0.2f %s" % (scores[indices[f]],features[indices[f]]))
獲得結果:
對每一個特徵進行相關性分析,查看熱力圖:
features_selected = features df_corr = data_train[features_selected].copy() colormap = plt.cm.RdBu plt.figure(figsize=(20,20)) sns.heatmap(df_corr.corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
相關性大的特徵容易形成過擬合現象,所以須要進行剔除。最好的狀況就是:全部特徵相關性很低,各自的方差或者說信息量很高。
劃分數據集:
from sklearn.model_selection import train_test_split X_all = data_train.drop(['Survived'], axis=1) y_all = data_train['Survived'] num_test = 0.20 X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=23)
這裏採用隨機森林 RandomForest 模型,創建模型:
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import make_scorer, accuracy_score from sklearn.model_selection import GridSearchCV clf = RandomForestClassifier() # 設定參數 parameters = {'n_estimators': [4, 6, 9], 'max_features': ['log2', 'sqrt','auto'], 'criterion': ['entropy', 'gini'], 'max_depth': [2, 3, 5, 10], 'min_samples_split': [2, 3, 5], 'min_samples_leaf': [1,5,8] } acc_scorer = make_scorer(accuracy_score) grid_obj = GridSearchCV(clf, parameters, scoring=acc_scorer) grid_obj = grid_obj.fit(X_train, y_train) clf = grid_obj.best_estimator_ clf.fit(X_train, y_train)
獲得模型:
predictions = clf.predict(X_test) print(accuracy_score(y_test, predictions))
獲得預測值爲0.8435754189944135
,提交到 kaggle 上打分0.77990
,需進一步的改進。
from sklearn.cross_validation import KFold def run_kfold(clf): kf = KFold(891, n_folds=10) outcomes = [] fold = 0 for train_index, test_index in kf: fold += 1 X_train, X_test = X_all.values[train_index], X_all.values[test_index] y_train, y_test = y_all.values[train_index], y_all.values[test_index] clf.fit(X_train, y_train) predictions = clf.predict(X_test) accuracy = accuracy_score(y_test, predictions) outcomes.append(accuracy) print("Fold {0} accuracy: {1}".format(fold, accuracy)) mean_outcome = np.mean(outcomes) print("Mean Accuracy: {0}".format(mean_outcome)) run_kfold(clf)
獲得:
ids = data_test['PassengerId'] predictions = clf.predict(data_test.drop('PassengerId', axis=1)) output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions }) output.to_csv('titanic-predictions.csv', index = False)
雖然這個入門賽題提交了比賽成績,已經完成了這個賽題,暫時告一段落,目前排名 4986。但對於它的學習纔剛剛開始,還有不少地方能夠改進,還有不少值得學習的地方。如如下幾點:
參考連接:
【Kaggle入門級競賽top5%排名經驗分享】— 建模篇
【機器學習】Cross-Validation(交叉驗證)詳解
不足之處,歡迎指正。