學習完了決策樹的ID三、C4.五、CART算法,找一個試手的地方,Kaggle的練習賽泰坦尼特很不錯,記錄下python
首先註冊一個帳號,而後在頂部菜單欄Competitions裏面搜索Titanic,找到Titanic練習賽,練習賽就用用於幫助新手入門的,在比賽的頁面有不少的入門推薦,很值得去一看。算法
數據集在比賽面板菜單欄的Data裏面,有三個數據集bash
拿到數據集之後咱們須要看看裏面是寫什麼內容,數據完整不完整之類的。在網站上對數據列的說明以下:機器學習
使用的Python代碼以下:學習
import pandas as pd train_data = pd.read_csv("../docs/train.csv") test_data = pd.read_csv("../docs/test.csv") # 瞭解數據表的基本狀況:行數、列數、每列的數據類型、數據完整度; print(train_data.info()) print("_"*30) # 瞭解數據表的統計狀況:總數、平均值、標準差、最小值、最大值 print(train_data.describe()) print("_"*30) # 查看字符串類型(非數字)的總體狀況 print(train_data.describe(include=['O'])) print("_"*30) # 查看前五行數據 print(train_data.head()) print("_"*30) # 查看後五行數據 print(train_data.tail()) print("_"*30)
運行結果大體以下:測試
RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB None ______________________________ PassengerId Survived Pclass ... SibSp Parch Fare count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200 [8 rows x 7 columns] ______________________________ Name Sex Ticket Cabin Embarked count 891 891 891 204 889 unique 891 2 681 147 3 top Allen, Mr. William Henry male 347082 B96 B98 S freq 1 577 7 4 644 ______________________________ PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 NaN S [5 rows x 12 columns] ______________________________ PassengerId Survived Pclass ... Fare Cabin Embarked 886 887 0 2 ... 13.00 NaN S 887 888 1 1 ... 30.00 B42 S 888 889 0 3 ... 23.45 NaN S 889 890 1 1 ... 30.00 C148 C 890 891 0 3 ... 7.75 NaN Q [5 rows x 12 columns]
經過探索發現Age、Fare和Cabin這三個數據有所缺失,Age和Fare都爲數值型,簡單的使用平均數進行補齊,Cabin爲字符串型,其中S最多,簡單將缺失的填爲S吧網站
train_data["Age"].fillna(train_data["Age"].mean(), inplace=True) test_data["Age"].fillna(test_data["Age"].mean(), inplace=True) train_data["Fare"].fillna(train_data["Fare"].mean(), inplace=True) test_data["Fare"].fillna(test_data["Fare"].mean(), inplace=True) train_data["Embarked"].fillna("S", inplace=True) test_data["Embarked"].fillna("S", inplace=True)
經過數據探索,PassengerId爲乘客編號,對分類沒用,Name爲乘客姓名,也沒用;Cabin字段缺失值太多,暫時放棄;Ticket爲船票號碼,雜亂無章且無規律,放棄;剩下的字段就是:Plass、Sex、Age、SibSp、Parch、Fare、Embarked,接下來就將這些特徵做爲訓練數據,並將數據中的符號化對象轉換成數字對象進行表示。spa
from sklearn.feature_extraction import DictVectorizer features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] train_features = train_data[features] train_labels = train_data['Survived'] test_features = test_data[features] dvec=DictVectorizer(sparse=False) train_features = dvec.fit_transform(train_features.to_dict(orient='record'))
使用Python的機器學習庫中決策樹訓練一個模型,.net
from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier(criterion="entropy") clf.fit(train_features, train_labels)
對測試數據集裏面的數據進行測試,並將結果輸出到csv文件中,用於Kaggle的提交code
test_features = dvec.transform(test_features.to_dict(orient="record")) pred_labels = clf.predict(test_features) print(test_features) print(pred_labels) print("_"*30) with open("submission.csv", encoding="utf-8", mode="w", newline="") as f: write = csv.writer(f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL) write.writerow(["PassengerId", "Survived"]) count = 0 for item in test_data.values: print(item[0]) write.writerow([item[0], pred_labels[count]]) count = count + 1
能夠作一些驗證,前者是簡單的使用訓練數據進行驗證,後者是K折交叉驗證
import numpy as np from sklearn.model_selection import cross_val_score acc_decision_tree = round(clf.score(train_features, train_labels), 6) print(acc_decision_tree) print("_"*30) print(np.mean(cross_val_score(clf, train_features, train_labels, cv=10)))
點擊Kaggle泰坦尼特比賽頁面的Submit Predictions,提交上一步生成的結果文件submission.csv,比賽名次在1萬多名左右,嘿嘿,名次不是關鍵,此次的嘗試仍是有點意思,簡單瞭解了整個預測過程和Kaggle的流程。