機器學習一個完整的項目過程

時間 2019-11-06

標籤機器學習一個完整項目過程简体版

原文原文鏈接

準備數據

訓練集和測試集的數據來源於不少地方，好比：數據庫，csv文件或者其餘存儲數據的方式，爲了操做的簡便性，能夠寫一些小的腳原本下載並解析這些數據。在本文中，咱們先寫一個腳原本演示：node

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml/master/'
HOUSING_PATH = 'chapter02/datasets/housing'
HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing' + '/housing.tgz'


def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    print(housing_url)
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, 'housing.tgz')
    urllib.request.urlretrieve(housing_url, tgz_path)
    print(tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    

fetch_housing_data()

執行上邊的代碼後，數據就已經下載到本地了，接下來在使用pandas加載數據python

import pandas as pd


def load_housing_data(housing_path=HOUSING_PATH):
    print(housing_path)
    csv_path = os.path.join(housing_path, "housing.csv")
    print(csv_path)
    return pd.read_csv(csv_path)

數據預覽

使用pandas解析後的數據是DataFrames格式，咱們能夠調用變量的head()方法，獲取默認的前5條數據git

能夠看出，總共有10條屬性，在這5條中，顯示數據都很完整，沒有發現數值有空的狀況，使用info()，咱們能夠對整個數據的信息進行預覽：github

一共有20640條數據，這點數據對於ML來講是很小的，只有total_bedrooms的屬性下存在數據爲空的狀況。算法

經過觀察數據，咱們發現，除了ocean_proximity以外的屬性的值都是數值類型，數值類型很容易在ML算法中實現，再次觀察上邊5條數據的ocean_proximity值，能夠推斷出ocean_proximity應該存在幾種類型，跟枚舉有點像，使用value_counts()方法能夠查看每一個值得數量：數據庫

除此以外，使用describe()能夠查看每一行更多的信息：bootstrap

名詞解釋：數組

名稱	解釋
count	數量
mean	均值
min	最小值
max	最大值
std	標準差
25%/50%.75%	低於該值所佔的比例

若是想查看每一個屬性更加詳細的信息，咱們可使用hist()方法，查看每一個屬性的矩形圖：網絡

%matplotlib inline 
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20, 15))
plt.show()

經過觀察矩形圖能夠很容易的看出值的分佈狀況，矩形圖的x軸表示值，y軸表示數量。針對咱們這份數據，咱們發現了以下信息：app

對於median_income來講，它的值並非表示的是真實的收入，而是經過計算的結果，取值範圍在0.5~15之間，明白數值是如何計算的，也很重要。
數據受限的狀況，housing_median_age和median_house_value存在明顯的值得限制，在他們的矩形圖的右邊有一條很長的條，這說明存在限制的狀況，這會對ML算法產生必定的影響，好比，在使用算法預測的時候，是否須要也添加該限制？若是答案是不限制，須要對當前受限制的數據作進一步的處理：
- 收集受限制的數據的真實值
- 刪除這些受限制的數據
這些屬性的取值範圍有很大的區別，這個會在下文中解決這個問題
圖形中有存在尾重的現象，這個也會在下文中解決

建立test集

在建立test set的過程當中，可以進一步讓咱們瞭解數據，這對選擇機器學習算法頗有幫助。最簡單的就是隨機收取大約20%的數據做爲test set。

使用隨機函數的缺點是，每次運行程序獲得的結果都不同，所以，爲處理這個問題，咱們須要給每一行一個惟一的identifier，而後對identifier進行hash化，取它的最後一個字節值小於或等於51（20%）就能夠了。

在原有的數據中，並不存在這樣的identifier，所以須要調用reset_index()函數，爲每行添加索引，做爲identifier。

import hashlib
import numpy as np


def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio


def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

# 給housing添加index
housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
print(len(train_set), 'train +', len(test_set), "test")

# 也可使用這種方式來建立id
# housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
# train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

在上邊的代碼中，使用index做爲identifier有一個缺點，須要把新的數據拼接到數據總體的最後邊，同時不能刪除中間的數據，解決的方法是，使用其餘屬性的組合來計算identifier。

固然sklearn也提供了生成test set的方法

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

隨機抽樣比較適用於數據量大的樣本，若是樣本不夠大，就會引入很大的抽樣誤差。對於當前的數據，咱們採起分層抽樣。當你詢問專家那個屬性最重要的時候，他回答說median_income最重要，咱們就要考慮基於median_income進行分層抽樣。

觀察上圖，能夠發現，median_income的值主要集中在幾個層次上，因爲層次不夠多，這也側面說明了不太適合使用隨機抽樣。

咱們爲數據新增一個屬性，用於標記每行數據屬於哪一個層次。對於大於5.0的，都歸到5.0中。

# 隨機抽樣會在某些狀況下存在誤差，這時候能夠考慮分層抽樣，每層的實例個數不能太少，分層不能太多
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
print(housing.head(10))

接下來就須要根據income_cat,使用sklearn對數據進行分層抽樣。

# 使用sklearn的tratifiedShuffleSplit類進行分層抽樣
from sklearn.model_selection import StratifiedShuffleSplit


split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
    
print(housing["income_cat"].value_counts() / len(housing))

# 獲得訓練集和測試集後刪除income_cat
for s in (strat_train_set, strat_test_set):
    s.drop(["income_cat"], axis=1, inplace=True)
    
print(strat_train_set.head(10))

上邊的代碼在抽樣成功後，刪除了income_cat屬性，結果以下：

若是咱們計算test set和原數據的偏差，可以獲得下邊這張表格，能夠看出，分層抽樣的錯誤明顯小於隨機抽樣。

發現數據的更多信息

要想找到數據中隱藏的信息，就要使用可視化的手段，對於咱們的housing數據來講，它包含經緯度信息，基於地理位置應該是一個好的切入口。

housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(20, 12))

這張圖若是繪製成這樣的，很難發現有什麼特色，咱們調整點的透明度試一試。

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1, figsize=(20, 12))

這樣咱們的頭腦自動分析後，很容易得出數據濃度高的地方存在特殊性，那麼這些是否與價格相關？更進一步，咱們用點的半徑表示相應點的人口規模，用顏色表示價格，而後繪圖：

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, 
             s=housing["population"]/100, label="population", 
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, figsize=(20, 12))
plt.legend()

從這張圖，能夠觀察到，價格跟位置和人口密度有很大的關係，和ocean_proximity一樣有關係，所以，從直覺上，咱們能夠考慮使用聚類算法。

屬性組合

在數據中，可能打個屬性的用處並不大，可是對這些屬性作一些特殊的重組後，會獲取到一些有用的信息。

在咱們這個例子中，total_rooms,total_bedrooms單獨存在的意義不是很大，可是若是跟population和households作一些組合後，就會產生新的有意義的屬性。

# 有些屬性多是咱們不須要的，在這裏，bedrooms的總數，不是咱們關心的
# 所以咱們可使用已有的一些屬性生成新的組合屬性
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

bedrooms_per_room比，total_rooms,total_bedrooms的相關性都要高，說明咱們作的屬性重組起到了做用。

對數據的操做是一個按部就班的過程。

數據清洗

在清洗數據以前，咱們先保存好數據。

# 分離labels
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

在本文上半部分，咱們提到過total_bedrooms有一些值爲空的狀況，對於這種狀況，咱們通常會採起如下幾種方式「

放棄值爲空的整行的數據
放棄該屬性
從新賦值

一般會採起第三種方式，爲空的值從新附一個新值，比方說均值。

sklearn提供了一個Imputer來專門處理這個問題：

# 機器學習算法不能運行在值缺失的狀況，所以須要對值缺失作一些處理
# 1. 放棄那一行數據 2. 放棄整個屬性 3. 給缺失的值從新賦值
from sklearn.impute import SimpleImputer


# 使用中位數做爲策略
imputer = SimpleImputer(strategy="median")
# 移除不是數值類型的項
housing_num = housing.drop("ocean_proximity", axis=1)
# fit只用來計算數據的策略值
imputer.fit(housing_num)
print(imputer.statistics_)
# 轉換數據，就是補齊missing value
X = imputer.transform(housing_num)

其中imputer的fit()函數，只是計算了各個屬性的均值，並無作其餘額外的事情，這就比如對imputer進行了‘訓練’，而後調用transfom()轉化數據。

其中均值以下：

處理text類型的屬性

在咱們這個例子中,ocean_proximity是text類型，須要把它轉爲數值類型。sklearn提供了LabelEncoder模塊來把這些text類型的值轉換成數值。

# 對於不是數值的屬性值，sk頁提供了轉換方法
from sklearn.preprocessing import LabelEncoder


encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
print(housing_cat_encoded)
print(encoder.classes_)

'''
[3 3 3 ... 1 1 1]
['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN']
'''

可是這麼作存在的問題是，在機器學習中，認爲相近的數值每每類似性更高，爲了解決這個問題，sklearn提供了OneHotEncoder模塊，把整數映射爲一個只有0和1的向量，只有相對的位置是1，其餘都是0：

# 在上邊的例子中有個很大的問題，ml的算法會任務0和1比較接近，可是<1H OCEAN和NEAR OCEAN更類似
# 爲了解決這個問題，須要引入one hot的方式，用所在的位置設爲1
from sklearn.preprocessing import OneHotEncoder


encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
print(housing_cat_1hot.toarray())

'''
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 ...
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]]
 '''

固然，sklearn還提供了把上邊兩步合爲一步的模塊LabelBinarizer:

# 也能夠把label和one hot的步驟合成一個
from sklearn.preprocessing import LabelBinarizer


encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
print(housing_cat_1hot)

自定義Transforms

儘管sklearn提供了不少有用的transfoms，可是咱們仍是但願可以自定義一些transforms，並且這些自定義的模塊，最好用起來和sklearn提供的同樣，很簡單，下邊的代碼實現了一個很簡單的數據轉換：

以前：

# 有些屬性多是咱們不須要的，在這裏，bedrooms的總數，不是咱們關心的
# 所以咱們可使用已有的一些屬性生成新的組合屬性
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

如今：

# 自定義Transformation
from sklearn.base import BaseEstimator, TransformerMixin


rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6


class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        print("==============")
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            print("aaaa", np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room][0])
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
    

attr_adder = CombinedAttributesAdder()
housing_extra_attribs = attr_adder.transform(housing.values)
print(len(housing_extra_attribs[0])) # 在每一行的後邊拼接了兩個值
print(housing_extra_attribs) # 在每一行的後邊拼接了兩個值

'''
[[-121.89 37.29 38.0 ... 4.625368731563422 2.094395280235988
  0.22385204081632654]
 [-121.93 37.05 14.0 ... 6.008849557522124 2.7079646017699117
  0.15905743740795286]
 [-117.2 32.77 31.0 ... 4.225108225108225 2.0259740259740258
  0.24129098360655737]
 ...
 [-116.4 34.09 9.0 ... 6.34640522875817 2.742483660130719
  0.1796086508753862]
 [-118.01 33.82 31.0 ... 5.50561797752809 3.808988764044944
  0.19387755102040816]
 [-122.45 37.77 52.0 ... 4.843505477308295 1.9859154929577465
  0.22035541195476574]]
  '''

這個轉換的另外一個好處是，能夠很方便的加入到pipeline中，這個下邊也講到了。

特徵縮放

對於機器學習，數據的scaling一樣很重要，不一樣scaling的特徵，會產生不一樣的結果，在咱們的數據中，就存在scaling不一致的問題，解決這樣的問題通常有兩種方式：

Min-max scaling，也叫normalization，主要是把值壓縮到0~1之間，用值減去最小值後，再除以最大值減最小值的值
Standardization，減去均值後再除以方差，這個跟也叫normalization不同的地方在於，他的取值範圍不是0~1，它能夠避免數據中存在極大值形成的偏差

sklearn提供了StandardScaler模塊用於特徵縮放，咱們使用的是第二種Standardization。

Transformation Pipelines

咱們上邊的一系列過程，包含數據清洗，屬性重組，數據縮放，text類型的轉換，均可以使用sklearn的Pipeline來組合成一個總體的過程，支持異步的方式，同時進行多個pipeline

# 使用屬性組合的方式
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.attribute_names].values
    

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer(*args, **kwargs)
        
    def fit(self, x, y=None):
        self.encoder.fit(x)
        return self
    
    def transform(self, x, y=None):
        print(self.encoder.transform(x))
        return self.encoder.transform(x)
        

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]


num_pipeline = Pipeline([("selector", DataFrameSelector(num_attribs)), 
                         ("imputer", SimpleImputer(strategy="median")), 
                         ("attribs_adder", CombinedAttributesAdder()), 
                          ("std_scaler", StandardScaler())])

cat_pipeline = Pipeline([("selector", DataFrameSelector(cat_attribs)), 
                        ("label_binarizer", CustomLabelBinarizer())])


full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline), 
                                               ("cat_pipeline", cat_pipeline)])

housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared[0])

上邊的代碼實現了從數據清洗到特徵縮放的整個過程。

選擇和訓練模型

在完成了數據的準備任務後，咱們對數據應該有了很清晰的瞭解，接下來就須要選擇訓練模型，這個過程也是一個不斷選擇的過程。

咱們首先用linear regression model來試一下：

# 咱們先用線性迴歸模型試一下
from sklearn.linear_model import LinearRegression


lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# 準備一些測試數據
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print(some_data_prepared)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
print("Labels:\t\t,", list(some_labels))

用sklearn寫模型仍是很簡單的，經過打印，咱們可以看到預測值和觀測值還有差距，這時候，就須要一個error信息，來監控錯誤率

mean_squared_error表示均方偏差，公式爲：

通常使用RMSE進行評估（這個迴歸分析模型中最經常使用的評估方法）：

用代碼表示爲：

# 使用RMSE測錯誤
from sklearn.metrics import mean_squared_error


housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse # 這種錯誤偏差已經很大，說明當前的features不能提供預測的足夠的信息或者當前模型不夠強大

'''
68628.19819848923
'''

從本文上部分的分佈應該不難看出，用線性迴歸的話偏差應該很大，更進步，咱們考慮使用決策樹模型來訓練試一下。

# 使用決策樹來訓練數據
from sklearn.tree import DecisionTreeRegressor


tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

tree_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, tree_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

'''
0.0
'''

偏差爲0，這說明過擬合了。過擬合不是一件好事，爲了解決這個問題，咱們能夠對當前的訓練數據作交叉驗證Cross-Validation。它的本質是把當前的數據分割成n份，同時生成n個偏差。

這裏用到的是K-fold Cross Validation叫作K折交叉驗證，和LOOCV的不一樣在於，咱們每次的測試集將再也不只包含一個數據，而是多個，具體數目將根據K的選取決定。好比，若是K=5，那麼咱們利用五折交叉驗證的步驟就是：

將全部數據集分紅5份
不重複地每次取其中一份作測試集，用其餘四份作訓練集訓練模型，以後計算該模型在測試集上的MSE_i
將5次的MSE_i取平均獲得最後的MSE

# 上邊出現了error爲0的狀況，說明過擬合了，可使用sk的交叉驗證
# 把訓練數據分紅必定的分數，相互驗證
from sklearn.model_selection import cross_val_score


scores = cross_val_score(tree_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)


def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
    
    
display_scores(tree_rmse_scores)

能夠看出決策樹的偏差也很高，咱們在對線性迴歸模型作交叉驗證：

# 使用交叉驗證看看回歸的error
line_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
line_rmse_scores = np.sqrt(-line_scores)


display_scores(line_rmse_scores)

最後，咱們使用隨機森林來訓練模型：

# 隨機森林
from sklearn.ensemble import RandomForestRegressor


random_forest = RandomForestRegressor()
random_forest.fit(housing_prepared, housing_labels)

forest_predictions = random_forest.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, forest_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

'''
22100.915917968654
'''

看上去，此次錯誤明顯小了不少，這個模型目前來講是比較理想的。

在經歷過選擇模型後，咱們通常會獲得一個模型列表，只需選擇最優的那個就好了。

微調模型

通常來講，機器學習算法都有一些hyperparameter，這些參數能夠影響結果，咱們對模型的優化也包括如何找到最優的參數。

sklearn的GridSearchCV可以方便的建立參數組合，好比：

# 在獲得一系列可用的模型列表後，須要對該模型作微調
# Grid Search 網絡搜索，使用sk對各類不一樣的參數組合作訓練，獲取最佳參數組合
from sklearn.model_selection import GridSearchCV


param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
              {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_

'''
{'max_features': 8, 'n_estimators': 30}
'''

上邊的代碼中一共嘗試了34 + 23 = 18種組合。

# 獲取最優的estimator
grid_search.best_estimator_

'''
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False,
           random_state=None, verbose=0, warm_start=False)
'''

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

能夠很直觀的看到每一個參數下的偏差。

用測試集驗證

最後，當有了可用的模型後，就能夠對test set進行驗證了，但首先須要使用上文的pipeline對test set進行轉換：

# 使用最終的模型來評估測試數據
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

'''
47732.7520382174
'''