機器學習初步練習題

時間 2019-12-08

標籤機器學習初步練習題简体版

原文原文鏈接

1. 寫一個函數，能將一個多類別變量轉爲多個二元虛擬變量，不能使用 sklearn 庫。

將一個多類別變量轉爲多個二元虛擬變量，是數據預處理時經常使用的一種方法。舉個例子：python

以性別 Sex 爲例，本來一個變量，由於其取值能夠是['male','female']，而將其平展開爲 Sex_male 和 Sex_female 兩個變量。git

本來 Sex 取值爲 male 的，在轉換後的新變量 Sex_male 下取值爲 1，在新變量 Sex_female 下取值爲 0
本來 Sex 取值爲 female 的，在轉換後的新變量 Sex_male 下取值爲 0，在新變量 Sex_female 下取值爲 1

由於有些數據挖掘算法，特別是某些分類算法，要求屬性是分類屬性形式，發現關聯模式的算法還要求數據是二元屬性形式，這樣就須要將連續屬性轉換成分類屬性，即離散化，而且連續和離散屬性可能都須要轉換成一個或多個二元屬性（二元化）。若是一個分類屬性類別過多，且某些值出現不頻繁，則可根據挖掘任務合併某些值以減小類別的數目。算法

在 Python 編程中，通常是將 DataFrame 中的一列變爲多列，一列即一個 Series，轉換成一個包含多 Series 即多列的 DataFrame，而後再將生成的 DataFrame 附加到原 DataFrame 上。編程

pandas.get_dummies 提供了這種功能。下面以性別 Sex 爲例作個演示。app

import pandas as pd
import numpy as np

data = {
    'Name': ['Richard Dawkins', 'Eileen Chang', 'Steven Pinker', 'Madonna Ciccone', 'Herbert A. Simon'],
    'Year': [1941, 1920, 1954, 1958, 1916 ],
    'Sex': ['male', 'female', 'male', 'female', 'male']
}
famous_df = pd.DataFrame(data)
famous_df

	Name	Sex	Year
0	Richard Dawkins	male	1941
1	Eileen Chang	female	1920
2	Steven Pinker	male	1954
3	Madonna Ciccone	female	1958
4	Herbert A. Simon	male	1916

print type(famous_df['Sex']) # 輸入 Series
dummies_Sex = pd.get_dummies(famous_df['Sex'], prefix= 'Sex') # 對 Sex 變量二元化，prefix 爲新變量的前綴，默認是原列名
print type(dummies_Sex) # 輸出 DataFrame
print dummies_Sex.head()

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
   Sex_female  Sex_male
0         0.0       1.0
1         1.0       0.0
2         0.0       1.0
3         1.0       0.0
4         0.0       1.0

famous_df = pd.concat([famous_df, dummies_Sex], axis=1) # 使用 concat 把生成的 DataFrame 附加到原 DataFrame 上
print famous_df

Name     Sex  Year  Sex_female  Sex_male
0   Richard Dawkins    male  1941         0.0       1.0
1      Eileen Chang  female  1920         1.0       0.0
2     Steven Pinker    male  1954         0.0       1.0
3   Madonna Ciccone  female  1958         1.0       0.0
4  Herbert A. Simon    male  1916         0.0       1.0

Sex 變量變成了兩個二元變量 Sex_female 和 Sex_male。dom

這裏咱們寫個自定義函數來實現跟 pandas.get_dummies 同樣的功能。函數

函數名：OneVarToMany測試

參數：rest

data : Series，一個多類別變量code

prefix : string, 新變量名前綴，默認爲 None

prefix_sep : string, 若是有前綴，前綴和變量名的分隔符，默認爲 '_'

many_na : bool, 是否爲 NaN 值添加一列，若是爲 False 則忽略 NaNs，默認爲 False

返回值：

dfmany : DataFrame

def OneVarToMany(data, prefix=None, prefix_sep='_', dummy_na=False): # 二元化函數
    dfmany = pd.DataFrame() # 先定義個空的 DataFrame
    n = data.count() # 取得樣本個數
    colNames = data.unique() # 取分類變量的值做爲新變量名
    if(dummy_na == False):
        colNames = colNames[colNames!='nan']
    
    for i in range(colNames.shape[0]): # 遍歷每一個變量名，建立相應的 Series 並附加到 dfmany
        colName = colNames[i]
        seriesobj = pd.Series(np.zeros(n, dtype = 'int8')) # 先生成長度爲 n 全爲 0 的向量
        seriesobj.ix[data[data == colName].index] = 1 # 修改相應的值爲 1
        if(prefix != None):
            colName = prefix + prefix_sep + colName
        seriesobj.name = colName # 修改列名
        dfmany = pd.concat([dfmany, seriesobj], axis=1) # 將新生成的 Series 附加到 DataFrame 上
    return dfmany

在數據集上測試下該函數。

sdata = {
    'Name': ['Richard Dawkins', 'Eileen Chang', 'Steven Pinker', 'Madonna Ciccone', 'Herbert A. Simon'],
    'Year': [1941, 1920, 1954, 1958, 1916 ],
    'Sex': ['male', 'female', 'male', 'female', 'male']
}
df = pd.DataFrame(sdata)
dummies_Sex = OneVarToMany(df['Sex'], prefix= 'Sex')
df = pd.concat([df, dummies_Sex], axis=1)
print df

Name     Sex  Year  Sex_male  Sex_female
0   Richard Dawkins    male  1941         1           0
1      Eileen Chang  female  1920         0           1
2     Steven Pinker    male  1954         1           0
3   Madonna Ciccone  female  1958         0           1
4  Herbert A. Simon    male  1916         1           0

結果跟 pandas.get_dummies 同樣。

2. 寫一個函數，實現交叉驗證的功能，不能使用sklearn庫。

交叉驗證，就是把數據分爲兩部分，一部分用於訓練，一部分用於驗證。

sklearn.cross_validation.train_test_split 即實現這個功能。舉個例子：

import numpy as np
from sklearn.cross_validation import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5) # 生成 X 和 y
print X
print y

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
[0, 1, 2, 3, 4]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42) # test_size 指定測試集的比例
print X_train
print X_test
print y_train
print y_test

[[4 5]
 [0 1]
 [6 7]]
[[2 3]
 [8 9]]
[2, 0, 3]
[1, 4]

下面自定義函數實現相似功能，至關於 sklearn.cross_validation.train_test_split 的簡化版，輸入數據類型限定爲 numpy.array。

函數名：data_split

參數：

x_arrays : numpy.array，要分離的自變量數據

y_arrays : numpy.array，要分離的因變量數據

train_size : float, int(默認爲 float)，若是是 float，應在 0 到 1 之間，則按比例提取數據到 train 中，若是是 int，則表示訓練樣本個數

random_state : int 隨機種子

返回值：

splitting : list, length = 2 * len(arrays)，返回分離後的訓練集和測試集

import random
def data_split(x_arrays, y_arrays, train_size = 0.75, random_state= 12345):
    n = x_arrays.shape[0] # 取得所有樣本個數
    sample_count = 0
    
    # 先排除一些輸入錯誤的狀況
    if(y_arrays.shape[0] != n): # 自變量和因變量矩陣的樣本個數不等
        raise Exception("The length of independent variable and dependent variable are not equal.")
        
    if(type(train_size) == int): # 計算訓練集的樣本個數
        if(train_size > n):
            raise Exception("The train_size cannot be bigger than the length of the whole datasets.")
        sample_count = train_size
    elif(type(train_size) == float): # 若是 train_size 是浮點數，應該在 0 和 1 之間，表示比例
        if(train_size<0 or train_size>1):
            raise Exception("The train_size must be between 0 and 1.")
        sample_count = int(train_size*n)
    else:
        raise Exception("The train_size must be int or float.")
    
    # 開始幹正事兒
    random.seed(random_state) # 設隨機種子
    listrange = range(0, n)
    train_index = random.sample(listrange, sample_count) # 從 n 個樣本中隨機挑出 sample_count 個做爲訓練集
    test_index = filter(lambda x : x not in train_index, listrange) 
    X_train = x_arrays[train_index,:] # 訓練集自變量
    X_test = x_arrays[test_index,:] # 測試集自變量
    y_train = y_arrays[train_index] # 訓練集因變量
    y_test = y_arrays[test_index] # 測試集因變量
    
    return X_train, X_test, y_train, y_test

測試一下。

X, y = np.arange(10).reshape((5, 2)), np.arange(5) # 在前面 train_test_split 的例子中 y 是 list，這裏方便起見，y 爲 numpy.array
X_train, X_test, y_train, y_test = data_split(X, y, train_size=0.75, random_state=42)
print X_train
print X_test
print y_train
print y_test

[[6 7]
 [0 1]
 [8 9]]
[[2 3]
 [4 5]]
[3 0 4]
[1 2]

結果跟 sklearn.cross_validation.train_test_split 同樣。

3. 使用 sklearn 庫中的其餘分類方法，來預測 titanic 的生存狀況。

1912 年 4 月 15 日，載着 1316 號乘客和 891 名船員的豪華巨輪「泰坦尼克號」與冰山相撞而沉沒，這場海難被認爲是 20 世紀人間十大災難之一。船上共 2208 名船員和乘客，但船上的救生艇僅能供 1178 人使用，最終只有 705 人生還。

關於這場災難詳情，有很多記錄可供查閱。之前看到一條有趣的微博，不知道是否是真的：

1898年，一個美國做家寫了一篇小說，講一艘名爲 Titan 的豪華遊輪，從英國出發作穿越大西洋的處女航，結果撞冰山沉沒了。因爲救生艇太少，死了不少人。14 年之後，Titanic的悲劇發生了。這個做家叫 Morgan Robertson，小說名爲＂The Wreck of the Titan or， Futility＂。小說裏的 Titan 和現實中的 Titanic 在長度、噸位，客容量、推行器數量，救生艇數量等方面都驚人的類似。這個跟香港風水大師預言日本地震海嘯的故事有一拼。——子夏曰

這個數據集是 1309 名乘客的資料。

數據包含的字段以下：

PassengerID
Survived(存活與否)
Pclass（客艙等級）
Name（姓名）
Sex（性別）
Age（年齡）
SibSp（親戚和配偶在船數量）
Parch（父母孩子的在船數量）
Ticket（票編號）
Fare（價格）
Cabin（客艙位置）
Embarked（上船的港口編號）

數據讀取

passenger_train = pd.read_csv('./Data/Titanic/train.csv', header=0)
passenger_test = pd.read_csv('./Data/Titanic/test.csv', header=0)
passenger_train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

數據探索

print passenger_train.PassengerId.count() # 看下訓練集的人數
print passenger_test.PassengerId.count() # 測試集的人數

891
418

print passenger_train.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

訓練集中一共 891 名乘客，存活的是 342 人，死亡 549 人。

先看下是否是小孩的生存率要高些。這裏小孩定義爲 14 歲如下。

passenger_train.info() # 查看信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

數據基本完整，只有 Age、Cabin 和 Embarked 三個字段有數據缺失。先對年齡字段作個補全。

passenger_train.ix[passenger_train['Age'].isnull(),'Age'] = passenger_train['Age'].median() # 年齡取箇中間值

%matplotlib inline
import matplotlib.pyplot as plt;

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
passenger_train.Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0]); # 這裏加上 sort_index() 是爲了統一 index 中的次序
passenger_train[passenger_train.Age<14].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1]); # 小孩的死亡和生存數量對比
passenger_train[passenger_train.Age>=14].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[2]); # 大人的死亡生存數量對比

print passenger_train[passenger_train.Age<14].PassengerId.count() # 小孩數量
print passenger_train[passenger_train.Age>=14].PassengerId.count() # 大人數量

71
643

由上圖可見，在總體人數生存人數比死亡人數低的狀況下，小孩的生存數量比死亡數量高，大人一定死亡得更多。確實是充分照顧了小孩兒的。

再來看下性別對生存存亡的影響。

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
passenger_train.Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0], title='All');
passenger_train[passenger_train.Sex=='female'].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1], title='female');
passenger_train[passenger_train.Sex=='male'].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[2], title='male');

print passenger_train[passenger_train.Sex=='female'].PassengerId.count() # 女性數量
print passenger_train[passenger_train.Sex=='male'].PassengerId.count() # 男性數量

314
577

上面的結果讓人動容，女性的生存人數遠遠大於死亡人數，男性的死亡人數遠遠大於生存人數。從上面的柱狀圖對比，能夠想象男性在這場災難中展示出的讓人敬重的紳士風度。

再來看下客艙等級對存亡的影響。

泰坦尼克號的奢華和精緻堪稱空前。船上配有室內游泳池、健身房、土耳其浴室、圖書館、電梯和壁球室。頭等艙的公共休息室由精細的木質鑲板裝飾，配有高級傢俱以及其餘各類高級裝飾，並不遺餘力提供了之前從未見過的服務水平。陽光充裕的巴黎咖啡館爲頭等艙乘客提供各類高級點心。泰坦尼克號的二等艙甚至是三等艙的居住環境和休息室都一樣高檔，甚至能夠和當時許多客輪的頭等艙相比。三臺電梯專門爲頭等艙乘客服務；做爲革新，二等艙乘客也有一臺電梯使用，不過，三等艙的乘客仍然須要爬樓梯。

passenger_train.Pclass.value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

fig, axes = plt.subplots(1, 4, figsize=(12, 3))
passenger_train.Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0], title='All Class');
passenger_train[passenger_train.Pclass==1].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1], title='Class 1');
passenger_train[passenger_train.Pclass==2].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[2], title='Class 2');
passenger_train[passenger_train.Pclass==3].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[3], title='Class 3');

從上圖明顯看出，一等艙的生存率最高，二等艙次之，三等艙最低。可見客艙等級確實影響了乘客的生存概率。

咱們再來看下，不一樣等級客艙的小孩兒的生存概率有沒有差異。

fig, axes = plt.subplots(1, 4, figsize=(12, 3))
passenger_train[passenger_train.Age<14].Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0], title='All Children');
passenger_train[(passenger_train.Pclass==1) & (passenger_train.Age<14)]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1], title='Class 1 Children'); # 頭等艙小孩
passenger_train[(passenger_train.Pclass==2) & (passenger_train.Age<14)]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[2], title='Class 2 Children'); # 二等艙小孩
passenger_train[(passenger_train.Pclass==3) & (passenger_train.Age<14)]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[3], title='Class 3 Children'); # 三等艙小孩

print passenger_train[passenger_train.Age<14].PassengerId.count() # 全部小孩
print passenger_train[(passenger_train.Age<14) & (passenger_train.Survived==0)].PassengerId.count() # 死亡
print passenger_train[(passenger_train.Age<14) & (passenger_train.Survived==1)].PassengerId.count() # 存活
print passenger_train[(passenger_train.Age<14) & (passenger_train.Survived==0) & (passenger_train.Pclass==3)].PassengerId.count() # 死亡
print passenger_train[(passenger_train.Age<14) & (passenger_train.Survived==1) & (passenger_train.Pclass==3)].PassengerId.count() # 存活

一共 71 個 14 歲如下的小孩兒，死亡 29 個，存活 42 個。其中頭等艙一共 4 個小孩，死亡 1 個，存活 3 個；二等艙 18 個小孩所有存活；三等艙 49 個小孩，死亡 28 個，存活 21 個。從柱狀圖上也能夠看出，上等客艙的小孩兒存活率更高。

再來比較頭等艙的男人和三等艙的小孩兒的存活個數。

fig, axes = plt.subplots(1, 2, figsize=(12, 3))
passenger_train[(passenger_train.Pclass==1) & (passenger_train.Age>=14) & (passenger_train.Sex=='male')]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0], title='Class 1 Men'); # 頭等艙男人
passenger_train[(passenger_train.Pclass==3) & (passenger_train.Age<14)]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1], title='Class 3 Children'); # 三等艙小孩

目測三等艙小孩兒的存活概率要高些。

既然到這了，咱們就索性再看下頭等艙的男人和三等艙的婦女的存活個數。

fig, axes = plt.subplots(1, 2, figsize=(12, 3))
passenger_train[(passenger_train.Pclass==1) & (passenger_train.Age>=14) & (passenger_train.Sex=='male')]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0], title='Class 1 Men'); # 頭等艙男人
passenger_train[(passenger_train.Pclass==3) & (passenger_train.Age>=14) & (passenger_train.Sex=='female')]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1], title='Class 3 Women'); # 三等艙婦女

頭等艙男人不比三等艙婦女存活概率高，女士比等級優先。

目前咱們看了性別、年齡和等級對存亡的影響，這三個因素的影響都是很明顯的。再來看下 SibSp（親戚和配偶在船數量）、Parch（父母孩子的在船數量）對存亡的影響。

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
passenger_train.Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0], title='All');
passenger_train[(passenger_train.SibSp>0)]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1], title='SibSp'); # 親戚和配偶在船數量
passenger_train[passenger_train.Parch>0]    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[2], title='Parch'); # 父母孩子的在船數量

從上圖看，有親戚和配偶，或者有父母孩子在船，存活率是要高些。

再來看下 Embarked（上船的港口編號）。

passenger_train.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

fig, axes = plt.subplots(1, 4, figsize=(12, 3))
passenger_train.Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[0], title='All');
passenger_train[passenger_train.Embarked == 'S']    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[1], title='S'); # S
passenger_train[passenger_train.Embarked == 'C']    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[2], title='C'); # C
passenger_train[passenger_train.Embarked == 'Q']    \
    .Survived.value_counts().sort_index().plot(kind = 'bar', ax =axes[3], title='Q'); # Q

上面港口編號爲 C 的人存活比例比其它要高，但我認爲這只是一個很隨機的結果，隨機不等於均勻，不是存活的緣由。若是這也能算，那要從船上找到此類緣由就多了去了。

再來看下票價 Fare，統計下各等級船艙的票價頻次。

fig, axes = plt.subplots(3, 1, figsize=(12, 6))
passenger_train.Fare.hist(bins = 100, ax=axes[0])
passenger_train[passenger_train.Pclass==1].Fare.hist(bins = 100, ax=axes[0]); # 一等艙票價頻次
passenger_train[passenger_train.Pclass==2].Fare.hist(bins = 100, ax=axes[1]); # 二等艙票價頻次
passenger_train[passenger_train.Pclass==3].Fare.hist(bins = 100, ax=axes[2]); # 三等艙票價頻次

船艙等級越高，票價越高，票價 Fare 跟客艙等級是同一性質的屬性，客艙等級已是個很是好的價格離散化後的結果了。因此，這裏的預測就不要 Fare 屬性了。

客艙位置 Cabin 缺失值太多，去除；Ticket 值太多，沒意義，去除；Name 無心義，去除。

去除無關屬性數據

passenger_train = passenger_train.drop('Name', axis=1)
passenger_train = passenger_train.drop('Ticket', axis=1)
passenger_train = passenger_train.drop('Embarked', axis=1)
passenger_train = passenger_train.drop('Cabin', axis=1)
passenger_train = passenger_train.drop('Fare', axis=1)
passenger_train = passenger_train.drop('PassengerId', axis=1)

passenger_train.head()

	Survived	Pclass	Sex	Age	SibSp
0	0	3	male	22.0	1
1	1	1	female	38.0	1
2	1	3	female	26.0	0
3	1	1	female	35.0	1
4	0	3	male	35.0	0

passenger_train.Sex = passenger_train.Sex.map({"male": 1, "female": 0}) # 作個映射

passenger_train.ix[passenger_train.Age<15,'Age'] = 0
passenger_train.ix[(passenger_train.Age>=15) & (passenger_train.Age<30),'Age'] = 1
passenger_train.ix[(passenger_train.Age>=30) & (passenger_train.Age<45),'Age'] = 2
passenger_train.ix[(passenger_train.Age>=45) & (passenger_train.Age<60),'Age'] = 3
passenger_train.ix[passenger_train.Age>=60,'Age'] = 4

passenger_train.head()

	Survived	Pclass	Sex	Age	SibSp
0	0	3	1	1.0	1
1	1	1	0	2.0	1
2	1	3	0	1.0	0
3	1	1	0	2.0	1
4	0	3	1	2.0	0

passenger_train.info() # 查看信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null int64
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
dtypes: float64(1), int64(5)
memory usage: 41.8 KB

對年齡字段作個處理。

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(passenger_train.drop('Survived', axis=1), passenger_train.Survived, train_size=0.8)

下面嘗試各類分類算法。

from sklearn import datasets
from sklearn import cross_validation
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn import neighbors
from sklearn import svm
from sklearn import ensemble
from sklearn import cluster

先試下 Logistic 迴歸。

classifier = linear_model.LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

y_test_pred = classifier.predict(X_test)

print(metrics.classification_report(y_test, y_test_pred)) # 真實的 y 和預測的 y

precision    recall  f1-score   support

          0       0.84      0.87      0.85       112
          1       0.77      0.73      0.75        67

avg / total       0.81      0.82      0.81       179

precision 是精準度，recall 是召回率，fs-score 是 F1 值。各個指標還算不錯。再來看混淆矩陣。

metrics.confusion_matrix(y_test, y_test_pred)

array([[97, 15],
       [18, 49]])

預測正確的有 146 人，錯誤 33 人。效果還不錯。

再來嘗試其它分類方法。決策樹：

classifier = tree.DecisionTreeClassifier() # 決策樹
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[101,  13],
       [ 18,  47]])

classifier = neighbors.KNeighborsClassifier() # K 近鄰
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[97, 17],
       [15, 50]])

classifier = svm.SVC() # 支持向量機
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[101,  13],
       [ 15,  50]])

classifier = ensemble.RandomForestClassifier() # 隨機森林
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[101,  13],
       [ 18,  47]])

從以上各類分類算法的混淆矩陣來看，支持向量機的預測效果是最好的，只預測錯了 28 我的。

4. 研究 kaggle 中的 Digit Recognizer 數據，嘗試用一些特徵工程來提取數字的特徵，並放入分類器中觀察預測準確率，相對直接使用原始變量是否有提高。

train.csv 和 test.csv 包含 1~9 的手寫數字的灰度圖片。每幅圖片都是 28 個像素的高度和寬度，共 28*28=784 個像素點，每一個像素值都在 0~255 之間。

train.csv 包含 785 列，由於第 1 列是手寫數字的真實值，後面的 784 列都是像素值。除第一行外，有 42000 條數據。

test.csv 除了不包含 label 列，其它跟 train.csv 同樣。除第一行外，有 28000 條數據。

先來看看 train.csv 裏的灰度圖片是什麼樣子。

digitTrain = pd.read_csv('./Data/Digit-Recognizer/train.csv')

img = digitTrain.values[0:11,1:]

fig = plt.figure() 

for i in range(0,9,1):
    print "\ncurrent num is: %d" % i
    px = img[i,:]
    pix = []
    for j in range(28):
        pix.append([])
        for k in range(28):
            pix[j].append(px[j*28+k])
    ax = fig.add_subplot(330+i+1)
    ax.imshow(pix)
plt.show()

current num is: 0

current num is: 1

current num is: 2

current num is: 3

current num is: 4

current num is: 5

current num is: 6

current num is: 7

current num is: 8

首先將每一個圖片的像素值都變成二進制形式，像素值大於 0 的變成 1。

digitdata = digitTrain.ix[:,1:] # 像素數據
digitdata = digitdata.replace([1,255], 1)

digittest = digitTrain.ix[:,0] # Label 數據

X_train, X_test, y_train, y_test = cross_validation.train_test_split(digitdata, digittest, train_size=0.7) # 70% 用於訓練，30% 用於檢驗

通過分析，該數據集適合用 K 最近鄰算法。

classifier = neighbors.KNeighborsClassifier() # K 近鄰
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

array([[1220,    5,    1,    0,    1,    1,    1,    0,    0,    1],
       [   1, 1385,    3,    1,    1,    0,    1,    2,    0,    1],
       [  31,   39, 1156,    5,    1,    0,    2,   23,    1,    2],
       [  20,   57,    9, 1153,    0,   11,    1,    8,    6,    3],
       [  27,   46,    4,    0, 1118,    0,    4,    1,    0,   30],
       [  31,   30,    1,   20,    0, 1007,    7,    1,    0,    8],
       [  42,   23,    0,    0,    1,    1, 1216,    0,    0,    0],
       [  26,   46,    6,    0,    4,    0,    0, 1232,    0,    7],
       [  24,   61,    7,   18,    6,   25,    7,    3, 1052,   13],
       [  22,   40,    4,    8,   10,    5,    2,   27,    1, 1174]])

從混淆矩陣來看，預測結果還算不錯。