1,介紹app
Titanic: Machine Learning from Disaster是kaggle比賽的入門訓練,具體介紹能夠看連接,數據在官網上下載,但須要註冊登陸。訓練集在train.csv中,測試集在test.csv。這裏對特徵的處理主要是來自Sina的Titanic best working Classifier。測試
首先對訓練集的信息進行了解,從中能夠看出訓練集有891個樣本,10個特徵,分別是乘客姓名,船票類型,性別,年齡,兄弟姐妹在船上的人數,父母小孩在船上的人數,船票號碼,船票價格,客艙號碼,終點位置。這些特徵裏面有的特徵存在缺失值,並且特徵數據有離散型數據,連續型數據和字符串數據,須要對這些進行處理,下面經過對特徵進行分析而後提取咱們想要的特徵。spa
import numpy as np import pandas as pd import re as re train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 66.2+ KB
2,特徵分析code
看了總體數據之後,接下來對各個特徵分別進行處理,看看特徵對生還率的影響。blog
(1)船票類型Pclasselement
船票類型總共有三種,是離散型數據,因此沒有進行特別處理,並且能夠看出不一樣船票類型對生還率影響挺大的。字符串
print (train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean()) Pclass Survived 0 1 0.629630 1 2 0.472826 2 3 0.242363
(2)性別Sexget
性別對最後結果有着重要影響,能夠做爲一個重要的特徵。這個特徵爲數據爲字符串,後續處理須要映射爲0,1。pandas
print (train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean()) Sex Survived 0 female 0.742038 1 male 0.188908
(3)家庭大小FamilySizeit
在這裏,將SibSp和Parch兩個合併爲一個特徵,家庭大小,並同時進行擴展爲是否獨自一人的特徵。
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1 print (train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean()) train['IsAlone'] = 0 train.loc[train['FamilySize'] == 1, 'IsAlone'] = 1 print (train[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()) FamilySize Survived 0 1 0.303538 1 2 0.552795 2 3 0.578431 3 4 0.724138 4 5 0.200000 5 6 0.136364 6 7 0.333333 7 8 0.000000 8 11 0.000000 IsAlone Survived 0 0 0.505650 1 1 0.303538
(4)終點站Embarked
終點站這個特徵有C,Q,S三個值,這個特徵有缺失值,將其填充爲S。一樣須要進行映射爲0,1,2。
train['Embarked'] = train['Embarked'].fillna('S') print (train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean()) Embarked Survived 0 C 0.553571 1 Q 0.389610 2 S 0.339009
(5)船票票價Fare
船票票價這個特徵是一個連續型數據,咱們對其進行處理平分爲四等分,後面分別映射爲0,1,2,3。
train['Fare'] = train['Fare'].fillna(train['Fare'].median()) train['CategoricalFare'] = pd.qcut(train['Fare'], 4) print (train[['CategoricalFare', 'Survived']].groupby(['CategoricalFare'], as_index=False).mean()) CategoricalFare Survived 0 [0, 7.91] 0.197309 1 (7.91, 14.454] 0.303571 2 (14.454, 31] 0.454955 3 (31, 512.329] 0.581081
(6)年齡Age
年齡這個特徵一樣也是連續型數據,並且缺失值比較多,咱們能夠將缺失值當作一個類別進行處理,其餘的年齡能夠等分爲五種類別,在後續的數據清理中處理,這裏就總共有六種類別。
age_null_count = train['Age'].isnull().sum() print(train['Survived'][train['Age'].isnull()].mean()) train['CategoricalAge'] = pd.cut(train['Age'], 5) print (train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean()) 0.293785310734 CategoricalAge Survived 0 (0.34, 16.336] 0.550000 1 (16.336, 32.252] 0.369942 2 (32.252, 48.168] 0.404255 3 (48.168, 64.084] 0.434783 4 (64.084, 80] 0.090909
(7)姓名Name
姓名這個特徵是字符串數據,從中挖掘出特徵比較難,這裏採用的是從姓名中找出稱呼,並將其中較少的幾個,好比'Lady', 'Countess','Capt', 'Col'等歸爲一類,總共有五類。
ef get_title(name): title_search = re.search(' ([A-Za-z]+)\.', name) #若是稱呼存在,返回稱呼 if title_search: return title_search.group(1) return "" train['Title'] = train['Name'].apply(get_title) print(pd.crosstab(train['Title'], train['Sex'])) train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col',\ 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') train['Title'] = train['Title'].replace('Mlle', 'Miss') train['Title'] = train['Title'].replace('Ms', 'Miss') train['Title'] = train['Title'].replace('Mme', 'Mrs') print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()) ex female male Title Capt 0 1 Col 0 2 Countess 1 0 Don 0 1 Dr 1 6 Jonkheer 0 1 Lady 1 0 Major 0 2 Master 0 40 Miss 182 0 Mlle 2 0 Mme 1 0 Mr 0 517 Mrs 125 0 Ms 1 0 Rev 0 6 Sir 0 1 Title Survived 0 Master 0.575000 1 Miss 0.702703 2 Mr 0.156673 3 Mrs 0.793651 4 Rare 0.347826
(8)其餘
Ticket和Cabin這兩個特徵沒有進行挖掘,主要Ticket這個特徵對於每一個乘客來講都是獨特的,從中挖掘信息比較難。而Cabin這個特徵丟失值比較多,因此不對它進行處理。
train['Ticket'].head(5) 0 A/5 21171 1 PC 17599 2 STON/O2. 3101282 3 113803 4 373450 Name: Ticket, dtype: object
3,數據清理
將train和test放到一個列表中,同時對訓練集和測試集進行處理,按照上述的分析進行處理,主要包括字符型特徵的映射和連續型數據的分類映射。
full_data = [train, test] for dataset in full_data: # 性別映射爲0,1 dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int) # 將SibSp和Parch兩個合併爲一個特徵,家庭大小,並同時擴展爲是否獨自一人的特徵 dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1 dataset['IsAlone'] = 0 dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1 # 稱呼分別爲0,1,2,3,4,5,5爲沒有稱呼 dataset['Title'] = dataset['Name'].apply(get_title) dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\ 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare') dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss') dataset['Title'] = dataset['Title'].replace('Ms', 'Miss') dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs') title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} dataset['Title'] = dataset['Title'].map(title_mapping) dataset['Title'] = dataset['Title'].fillna(0) # 終點站,缺失值補充爲S,有三種類型 dataset['Embarked'] = dataset['Embarked'].fillna('S') dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int) # 票價,0,1,2,3四種 dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].median()) dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0 dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1 dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2 dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3 dataset['Fare'] = dataset['Fare'].astype(int) # 年齡,缺失值爲類別5 dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0 dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1 dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2 dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3 dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 dataset.loc[ dataset['Age'].isnull(), 'Age'] = 5 dataset['Age'] = dataset['Age'].astype(int)
4,特徵選擇
把多餘的特徵去除,剩下咱們想要的特徵。最後的特徵以下所示,主要包括Pclass,Sex,Age,Fare,Embarked,FamilySize,IsAlone,Title這八個特徵,在訓練的時候須要將Survived這項提取出來,做爲訓練集的目標值。
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp','Parch'] train = train.drop(drop_elements, axis = 1) train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1) test = test.drop(drop_elements, axis = 1)
print(train.head(5)) train = train.values test = test.values Survived Pclass Sex Age Fare Embarked FamilySize IsAlone Title 0 0 3 1 1 0 0 2 0 1 1 1 1 0 2 3 1 2 0 3 2 1 3 0 1 1 0 1 1 2 3 1 1 0 2 3 0 2 0 3 4 0 3 1 2 1 0 1 1 1
5,總結
到這裏,對數據的處理就完成了,下面就能夠用獲得的特徵對模型進行訓練了。