kaggle中數據解釋:https://www.kaggle.com/c/titanic/datagit
數據形式:github
讀取數據,並顯示數據信息app
data_train = pd.read_csv("./data/train.csv")
print(data_train.info())
數據結果以下:dom
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
PassengerId => 乘客ID Survive => 乘客是否生還(僅在訓練集中有,測試集中沒有) Pclass => 乘客等級(1/2/3等艙位) Name => 乘客姓名 Sex => 性別 Age => 年齡 SibSp => 堂兄弟/妹個數 Parch => 父母與小孩個數 Ticket => 船票信息 Fare => 票價 Cabin => 客艙 Embarked => 登船港口
# # 統計 存活/死亡 人數 def sur_die_analysis(data_train): fig = plt.figure() fig.set(alpha=0.2) # 設定圖表顏色alpha參數 data_train.Survived.value_counts().plot(kind='bar')# 柱狀圖 plt.title(u"獲救狀況 (1爲獲救)") # 標題 plt.ylabel(u"人數") plt.show()
# PClass def pclass_analysis(data_train): fig = plt.figure() fig.set(alpha=0.2) # 設定圖表顏色alpha參數 sur_data = data_train.Pclass[data_train.Survived == 1].value_counts() die_data = data_train.Pclass[data_train.Survived == 0].value_counts() pd.DataFrame({'Survived':sur_data,'Died':die_data}).plot(kind='bar') plt.ylabel(u"人數") plt.title(u"乘客等級分佈") plt.show()
經過數據分佈能夠很明顯的看出 Pclass 爲 1/2 的乘客存活率比 3 的高不少函數
#Sex def sex_analysis(data_train): no_survived_g = data_train.Sex[data_train.Survived == 0].value_counts() no_survived_g.to_csv("no_survived_g.csv") survived_g = data_train.Sex[data_train.Survived == 1].value_counts() df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g}) df_g.plot(kind='bar', stacked=True) plt.title('性別存活率分析') plt.xlabel('People') plt.ylabel('Survive') plt.show()
女性的存活率比男性高測試
# age : 將年齡分紅十段,分別統計 存活人數和死亡人數 def age_analysis(data_train): data_series = pd.DataFrame(columns=['Survived', 'dies']) cloms = [] for num in range(0, 10): clo = "" + str(num * 10) + "-" + str((num + 1) * 10) cloms.append(clo) sur_df = data_train.Age[(10 * (num + 1) > data_train.Age) & (10 * num < data_train.Age) & (data_train.Survived == 1)].shape[0] die_df = data_train.Age[(10 * (num + 1) > data_train.Age) & (10 * num < data_train.Age) & (data_train.Survived == 0)].shape[0] data_series.loc[num] = [sur_df,die_df] data_series.index = cloms data_series.plot(kind='bar', stacked=True) plt.ylabel(u"存活率") # 設定縱座標名稱 plt.grid(b=True, which='major', axis='y') plt.title(u"按年齡看獲救分佈") plt.show()
低年齡段的獲救的百分比明顯佔的比例較多優化
定義Family項,表明家庭成員數量,並離散分類爲三個等級:編碼
0: 表明沒有任何成員spa
1: 1-4.net
2: > 4
# Family: Sibsp + Parch 家庭成員人數 def family_analysis(data_train): data_train['Family'] = data_train['SibSp'] + data_train['Parch'] data_train.loc[(data_train.Family == 0), 'Family'] = 0 data_train.loc[((data_train.Family > 0) & (data_train.Family < 4)), 'Family'] = 1 data_train.loc[((data_train.Family >= 4)), 'Family'] = 2 no_survived_g = data_train.Family[data_train.Survived == 0].value_counts() survived_g = data_train.Family[data_train.Survived == 1].value_counts() df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g}) df_g.plot(kind='bar', stacked=True) plt.title('家庭成員分析') plt.xlabel('等級:0-無 1-(1~4) 2-(>4)') plt.ylabel('存活狀況') plt.show()
因爲數據分佈很不均衡,sibsp 是否和存活率的關係,能夠將全部列都除以該列總人數。這裏再也不贅述。
費用統計:
當費用升高到必定時,存活人數已經超過了死亡人數
# Fare def fare_analysis(data_train): # data_train.Fare[data_train.Survived == 1].plot(kind='kde') # data_train.Fare[data_train.Survived == 0].plot(kind='kde') # data_train["Fare"].plot(kind='kde') # plt.legend(('survived', 'died','all'), loc='best') # plt.show() data_train['NewFare'] = data_train['Fare'] data_train.loc[(data_train.Fare < 50), 'NewFare'] = 0 data_train.loc[((data_train.Fare>=50) & (data_train.Fare<100)), 'NewFare'] = 1 data_train.loc[((data_train.Fare >= 100) & (data_train.Fare < 150)), 'NewFare'] = 2 data_train.loc[((data_train.Fare >= 150) & (data_train.Fare < 200)), 'NewFare'] = 3 data_train.loc[(data_train.Fare >= 200), 'NewFare'] = 4 no_survived_g = data_train.NewFare[data_train.Survived == 0].value_counts() survived_g = data_train.NewFare[data_train.Survived == 1].value_counts() df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g}) df_g.plot(kind='bar', stacked=True) plt.title('費用-生存分析') plt.xlabel('費用等級') plt.ylabel('存活狀況') plt.show()
很明顯能夠看出 費用等級較高的人存活率會高不少。
上述只是任意的選取了五個費用段,做爲五類,可是具體是多少類才能最好的擬合數據?
這裏能夠經過聚類的方法查找最佳的分類個數,再將每一個費用數據映射爲其中一類:
def fare_kmeans(data_train):
for i in range(2,10):
clusters = KMeans(n_clusters=i)
clusters.fit(data_train['Fare'].values.reshape(-1,1))
# intertia_ 參數是衡量聚類的效果,越大則代表效果越差
print("" + str(i) + "" + str(clusters.inertia_))
打印結果:
2 846932.9762272763 3 399906.26606199215 4 195618.50643749788 5 104945.73652631264 6 52749.474696547695 7 35141.316334118805 8 26030.553497795216 9 19501.242236941747
由此能夠看出看出當 類別數爲 5 時分類的效果最好。因此這裏將全部的費用映射到爲這五類。
#將費用進行聚類,發現 類別數爲 5 時聚合的效果最好 def fare_kmeans(data_train): clusters = KMeans(n_clusters=5) clusters.fit(data_train['Fare'].values.reshape(-1, 1)) predict = clusters.predict(data_train['Fare'].values.reshape(-1, 1)) print(predict) data_train['NewFare'] = predict print(data_train[['NewFare','Survived']].groupby(['NewFare'],as_index=False).mean()) print("" + str(clusters.inertia_))
等級映射後每一個等級的存活率以下:(效果明顯比上面隨便分類的好)
NewFare Survived 0 0 0.319832 1 1 0.647059 2 2 0.606557 3 3 1.000000 4 4 0.757576
#Embarked 上船港口狀況 def embarked_analysis(data_train): no_survived_g = data_train.Embarked[data_train.Survived == 0].value_counts() survived_g = data_train.Embarked[data_train.Survived == 1].value_counts() df_g = pd.DataFrame({'Survived': survived_g, 'Died': no_survived_g}) df_g.plot(kind='bar', stacked=True) plt.title('登錄港口-存活狀況分析') plt.xlabel('Embarked') plt.ylabel('Survive') plt.show()
至於就登錄港口而言,三個港口並看不出明顯的差距,C港生還率略高於S港與Q港。
由開頭部分數據信息能夠看出,有幾欄的數據是部分缺失的: Age / Cabin / Embarked
對於缺失數據這裏選擇簡單填充的方式進行處理:(能夠以中值/均值/衆數等方式填充)
同時對費用進行分類
def dataPreprocess(df): df.loc[df['Sex'] == 'male', 'Sex'] = 0 df.loc[df['Sex'] == 'female', 'Sex'] = 1 # 因爲 Embarked中有兩個數據未填充,須要先將數據填滿 df['Embarked'] = df['Embarked'].fillna('S') # 部分年齡數據未空, 填充爲 均值 df['Age'] = df['Age'].fillna(df['Age'].median()) df.loc[df['Embarked']=='S', 'Embarked'] = 0 df.loc[df['Embarked'] == 'C', 'Embarked'] = 1 df.loc[df['Embarked'] == 'Q', 'Embarked'] = 2 df['FamilySize'] = df['SibSp'] + df['Parch'] df['IsAlone'] = 0 df.loc[df['FamilySize']==0,'IsAlone'] = 1 df.drop('FamilySize',axis = 1) df.drop('Parch',axis=1) df.drop('SibSp',axis=1) return fare_kmeans(df) def fare_kmeans(data_train): clusters = KMeans(n_clusters=5) clusters.fit(data_train['Fare'].values.reshape(-1, 1)) predict = clusters.predict(data_train['Fare'].values.reshape(-1, 1)) data_train['NewFare'] = predict data_train.drop('Fare') # print(data_train[['NewFare','Survived']].groupby(['NewFare'],as_index=False).mean()) # print(" " + str(clusters.inertia_)) return data_train
這裏對與分類特徵經過了普通的編碼方式進行實現,也能夠經過onehot編碼使每種分類之間的間隔相等。
上述感性的認識了各個特徵與存活率之間的關係,其實sklearn庫中提供了對每一個特徵打分的函數,能夠很方便的看出各個特徵的重要性
predictors = ["Pclass", "Sex", "Age", "NewFare", "Embarked",'IsAlone'] # Perform feature selection selector = SelectKBest(f_classif, k=5) selector.fit(data_train[predictors], data_train["Survived"]) # Plot the raw p-values for each feature,and transform from p-values into scores scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass","Sex","Title",and "Fare" are the best? plt.bar(range(len(predictors)),scores) plt.xticks(range(len(predictors)),predictors, rotation='vertical') plt.show()
上圖能夠看到輸入的6個特徵中那些特徵比較重要
def linearRegression(df): predictors = ['Pclass', 'Sex', 'Age', 'IsAlone', 'NewFare', 'Embarked'] #predictors = ['Pclass', 'Sex', 'Age', 'IsAlone', 'NewFare', 'EmbarkedS','EmbarkedC','EmbarkedQ'] alg = LinearRegression() X = df[predictors] Y = df['Survived'] X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2) # 打印 訓練集 測試集 樣本數量 print (X_train.shape) print (Y_train.shape) print (X_test.shape) print (Y_test.shape) # 進行擬合 alg.fit(X_train, Y_train) print (alg.intercept_) print (alg.coef_) Y_predict = alg.predict(X_test) Y_predict[Y_predict >= 0.5 ] = 1 Y_predict[Y_predict < 0.5] = 0 acc = sum(Y_predict==Y_test) / len(Y_predict) return acc
測試模型預測準確率: 0.79
選取最有價值的5個特徵進行模型訓練,並驗證模型的效果:
def randomForest(data_train): # Pick only the four best features. predictors = ["Pclass", "Sex", "NewFare", "Embarked", 'IsAlone'] X_train, X_test, Y_train, Y_test = train_test_split(data_train[predictors], data_train['Survived'], test_size=0.2) alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4) alg.fit(X_train, Y_train) Y_predict = alg.predict(X_test) acc = sum(Y_predict == Y_test) / len(Y_predict) return acc
通過測試該模型的準確率爲 0.811
初步緣由分析: 選取的5個特徵中沒有Age,Age可能由於缺失很大部分數據對預測的準確率有必定的影響。
代碼已經提交git: https://github.com/lsfzlj/kaggle
歡迎指正交流
參考:
https://blog.csdn.net/han_xiaoyang/article/details/49797143
https://blog.csdn.net/CSDN_Black/article/details/80309542
https://www.kaggle.com/sinakhorami/titanic-best-working-classifier