【學習筆記】Hands On Machine Learning - Chap2. End-to-End Machine Learning Project

從標題能夠看出,這一章主要從大的方向,介紹機器學習的通常步驟,雖然是介紹性的知識,但不乏一些有價值的內容,如下幾點是我我的的總結:python

數據預覽:git

  1. 預覽前 5 條數據,有個直觀的感覺
  2. 查看數據總行數,字段類型,每一個字段的非空行數
  3. 查看分類字段的分佈狀況
  4. 查看數據字段的均值、方差、最小值、最大值、25/50/75分位值
  5. 查看數據字段的分佈(最好是圖形)

測試集建立:github

  1. 數據量大的狀況下能夠經過隨機的方式建立數據集
  2. 數據量不大的狀況下,須要使用分層抽樣,確保樣本數據和真實數據具備同樣的分層分佈,避免產生採樣誤差

數據分析bootstrap

  • 對屬性和目標字段作相關性分析
  • 屬性組合:在屬性組合後,再作一次相關性分析,查看組合後的屬性的相關性是否變強
  • 對於長尾型數據,作 log 處理

數據清洗數組

  • 空值填充
  • 處理文字類型屬性
    • label encoding 對有順序關係的字段進行編碼
    • one-hot encoding 對非順序關係的字段進行編碼

特徵工程dom

  • 歸一化在異常值干擾方面沒有標準化好

選擇模型進行訓練機器學習

  • 欠擬合的解決方案函數

    • 選擇一個更復雜的模型
    • 增長其餘更好的特徵
    • 減小模型限制,例如去掉正則化
  • 過擬合的解決方案學習

    • 簡化模型
    • 使用正則化
    • 加大訓練數據
  • 使用交叉驗證評估模型,檢查模型的泛化能力測試

  • 使用 Grid Search 方法來選擇一組較好的超參組合

  • 訓練後,使用特徵重要性分析,將可有可無的特徵去掉,以後能夠再加入新特徵,從新訓練,直到獲得滿意的模型

上線前的總結

  • 從實驗中學到了什麼
  • 什麼可行和不可行
  • 本實驗中有哪些假設
  • 該實驗有哪些限制

上線時須要注意什麼

  • 持續監控,避免因數據的持續更新,致使模型的退化
  • 採樣預測數據,並對其進行評估,監控模型效果
  • 按期從新訓練模型,如 6 個月
  • 按期全量預測

如下是具體的筆記內容:

數據集方面

import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): 
    if not os.path.isdir(housing_path):
             os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
        
# 下載數據
fetch_housing_data()
複製代碼

數據總覽

import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH): 
    csv_path = os.path.join(housing_path, "housing.csv") 
    return pd.read_csv(csv_path)

# 查看前 5 條數據
housing = load_housing_data()
housing.head()
複製代碼

# 查看總行數,列類型,各列的非空條數,注意到 total_bedrooms 存在空記錄
housing.info()
----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
  
# 查看分類數據的分佈狀況
housing["ocean_proximity"].value_counts()
----
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
    
# 查看數字字段的概覽
housing.describe()
複製代碼

#This tells Jupyter to set up Matplotlib so it uses Jupyter’s own backend.
%matplotlib inline 
# 輸出全部數字屬性的分佈狀況
import matplotlib.pyplot as plt 
housing.hist(bins=50, figsize=(20,15))
複製代碼

建立測試集

測試集樣本要有表明性,能反映真實狀況,不然會形成採樣誤差,這是很容易被忽視的部分

# 數據量相對大的狀況下(相對於特徵數來講),隨機的方法是可行的,不然會產生抽樣誤差
# 分層抽樣,樣本各種別的比例要符合真實狀況,例如實際男女比例爲6:4,那麼樣本中男:女就應該爲6:4
# 要保證測試集符合真實狀況,假設收入是預測房價的重要特徵,你就須要確保測試集具備和真實狀況一樣的收入分佈
# 構造收入分類屬性
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

# sklearn 的分層抽樣方法
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
housing["income_cat"].value_counts() / len(housing)
----
3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cat, dtype: float64
    
# remove income_cat attr
for set in (strat_train_set, strat_test_set):
    set.drop(["income_cat"], axis=1, inplace=True)
複製代碼

深刻探索和將數據可視化

# 複製數據,將其可視化
housing = strat_train_set.copy()
# 以地理位置畫散點圖
housing.plot(kind="scatter", x="longitude", y="latitude")
複製代碼

# 設置透明度,可顯示數據稠密地區
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
複製代碼

# 添加人口和房價信息
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
         s=housing["population"]/100, label="population",
         c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
     )
plt.legend()
複製代碼

相關性分析

計算屬性間的相關性 - standard correlation coefficient (Pearson's r),屬性間是否相關可參考如下圖示

注意:最底下的圖形爲非線性關係

corr_matrix = housing.corr()
# 查看屬性和房價的相關性
corr_matrix["median_house_value"].sort_values(ascending=False)
----
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64
複製代碼
# 使用 pandas' scatter_matrix 函數來查看兩兩屬性的相關性
from pandas.tools.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
                  "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
複製代碼

image-20190523221906362.png

屬性組合

  • 對於長尾屬性,能夠對其進行 log 處理
# 添加組合屬性
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
# 繼續查看相關性,能夠看到 bedrooms_per_room 比 total_bedrooms 和 total_rooms 相關性都高
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
----
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64
複製代碼

機器學習準備

# 將 label 分離,drop 操做的是複製的數據,不會影響原數據
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
複製代碼

數據清洗

  1. 空值填充
# 方法1:刪除含空值的記錄
#housing.dropna(subset=["total_bedrooms"])
# 方法2:刪除全部屬性 
#housing.drop("total_bedrooms", axis=1)
# 方法3:用(0、中值、平均值等)填充空值
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median)

# 你也可使用 Imputer
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1) # imputer只能用於數字字段上
imputer.fit(housing_num)

imputer.statistics_
----
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

# 將 imputer 應用到數據中
X = imputer.transform(housing_num)
# 將輸出結果轉換爲 Pandas Dataframe 格式
housing_tr = pd.DataFrame(X, columns=housing_num.columns)
複製代碼

Scikit-Learn 的設計原則

  • 一致性:全部對象保持一致且簡單的接口
    • Estimators: 能夠基於數據訓練參數的對象稱爲 estimator, 例如:imputer 是一個 estimator,訓練使用 fit() 函數完成。超參數:除數據源和 label 外的其餘參數
    • Transformers: 能做用於數據上且對數據作出改變的 estimators 被稱爲 transformers,API 爲 transform()fit_transform() 等價於 fit()transform()
    • Predictors: 能夠對數據進行預測的 estimators 被稱爲 predictors,例如 LinearRegression 模型,predictors 提供一個 predict() 接口,同時還提供 score() 接口,用來衡量預測的質量
  • 可檢查:全部 estimators 的超參數均可經過公有成員訪問,例如 imputer.stategy,訓練參數也能夠經過帶下劃線的公有成員訪問,例如 imputer.statistics_
  • 類型友好:數據集由 NumPy 數組或 SciPy 稀疏矩陣表示
  • 可組合:很容易構建 Pipeline estimator
  • 良好的默認行爲:對大多數參數來講,都提供一個合理的默認值,這樣很容易建立一個基準版本

處理文字及分類屬性

# labelEncoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
----
array([0, 0, 4, ..., 1, 0, 3])
複製代碼

labelEncoder 的問題問題在於兩個相鄰的數值被認爲是相近的,但很明顯在大多數狀況下這種編碼方式比較隨機,沒法體現順序關係,解決辦法是使用 one-hot 編碼

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot # 返回值是一個 sparse 矩陣,sparse 矩陣只存儲了有效信息,可節省空間

# 轉換爲 dense 矩陣
housing_cat_1hot.toarray()
----
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
複製代碼
# 直接返回 dense 矩陣,若是想獲得 sparse 矩陣,將 `sparse_output=True` 設入 `LabelBinarizer` 構造函數
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot
----
array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       ...,
       [0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0]])
複製代碼

自定義 Transformers

  1. 實現 fit()transform()fit_transform() 接口
  2. 繼承 TransformerMixin 後會自動擁有 fit_transform() 接口
  3. 繼承 BaseEstimator 類後會得到額外的 get_params()set_params() 接口

例子以下:

# 該例子可讓你設置一個超參數,以告訴你增長某個特徵是否能對模型有幫助
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
    
    
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
複製代碼

Feature Scaling

  • min-max scaling: end up ranging from 0 to 1; SK-Learn: MinMaxScaler; feature_range let you change the range if you don't want 0-1 for some reason.
  • standardization: standardization 不會受到異常值的干擾,例如:假設異常值爲 100,Min-Max 會使全部的值從 0-15 歸檔到 0-0.15,而標準化不會太受該異常值干擾。SK-Learn:StandardScaler

scalers 只應該做用於訓練集,不該做用於測試集和預測集

Transformation Pipelines

Pipeline 將每一個 transformer 的輸出做爲下一個 transformer 的輸入,下面是運用在數字屬性上的 pipeline 的例子

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
            # (name,estimator) 對,名字能夠隨便起
            # 除最後一個外,全部 estimators 必須爲 transformers (實現了 fit_transform() 方法)
            ('imputer', Imputer(strategy="median")),
            ('attribs_adder', CombinedAttributesAdder()),
            ('std_scaler', StandardScaler()), 
        ])
# pipeline 對象調用的方法和最後一個 estimator 的方法對應
housing_num_tr = num_pipeline.fit_transform(housing_num)
複製代碼

FeatureUnion

當又要處理數字特徵,又要處理文字特徵時,可以使用 FeatureUnion,它讓多個 pipeline 並行執行,當所有執行結束時,再將它們 concat 起來一塊兒返回

from sklearn.pipeline import FeatureUnion
from sklearn_features.transformers import DataFrameSelector

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
             ('selector', DataFrameSelector(num_attribs)),
             ('imputer', Imputer(strategy="median")),
             ('attribs_adder', CombinedAttributesAdder()),
             ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
             ('selector', DataFrameSelector(cat_attribs)),
             ('label_binarizer', LabelBinarizer()),
])

full_pipeline = FeatureUnion(transformer_list=[
             ("num_pipeline", num_pipeline),
 # ("cat_pipeline", cat_pipeline),
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared.shape
複製代碼

選擇和訓練模型

訓練和分析訓練集

調用線性模型,該模型的 MSE 較大,意味着欠擬合,即特徵未提供足夠的信息來進行預測,或模型不夠強大。

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# 對比一些預測數據和他們的標籤
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
print("Labels:\t\t", list(some_labels))
----
Predictions:	 [206563.06068576 318589.03841011 206073.20582883  71351.11544056
 185692.95569414]
Labels:		 [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
複製代碼
# 輸出MSE
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
----
69422.88161769879
複製代碼

解決欠擬合的方法爲:

  1. 選擇一個更復雜的模型
  2. 增長其餘更好的特徵
  3. 減小模型的限制,該模型沒有使用正則化,因此此選項可不考慮

下面換一個決策樹迴歸模型(DecisionTreeRegressor)

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
----
0.0
複製代碼

從上面 MSE 爲 0 能夠看出,該模型過擬合了

使用交叉驗證評估模型

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
    
display_scores(rmse_scores)
----
Scores: [74010.28770706 74680.64882796 74773.57241916 71768.12641187
 75927.45258799 74781.87802591 73151.93148335 72730.44601226
 72628.73907481 74100.34761688]
Mean: 73855.343016726
Standard deviation: 1199.2342001940942
複製代碼
# 對線性模型使用交叉驗證
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
    scoring="neg_mean_squared_error", cv=10) 
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
----
Scores: [67383.78417581 67985.10139708 72048.46844728 74992.50810742
 68535.66280489 71602.89821633 66059.1201932  69302.44278968
 72437.02688935 68368.6996472 ]
Mean: 69871.57126682388
Standard deviation: 2630.4324574585044
複製代碼

使用隨機森林

Ensemble Learning: Building a model on top of many other models

from sklearn.ensemble import RandomForestRegressor 
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
    scoring="neg_mean_squared_error", cv=10) 
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
----
Scores: [52716.39575252 51077.36847995 53916.75005202 55501.91073001
 52624.70886263 56367.33336096 52139.5370373  53443.45594517
 55513.29552081 54751.65501867]
Mean: 53805.24107600411
Standard deviation: 1618.473853712107
複製代碼

解決過擬合的辦法:

  1. 簡化模型
  2. 使用正則化
  3. 加大訓練數據

調試模型

Grid Search

試不一樣的超參數,直到找到一個最佳組合。使用 GridSearchCV,你只須要設置你想實驗的參數,它會使用 CrossValidation 嘗試全部可能的組合

例如

from sklearn.model_selection import GridSearchCV
# 3*4 + 2*3 種組合,每一個模型訓練 5 次
param_grid = [
        {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
        {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                               scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
# 查看最佳參數組合
grid_search.best_params_
----
{'max_features': 6, 'n_estimators': 30}
複製代碼

重要性分析

# 根據重要性分析,你能夠丟棄一些無用的 feature
feature_importances = grid_search.best_estimator_.feature_importances_
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"] 
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
----
[(0.4085511709593715, 'median_income'),
 (0.1274639391269915, 'pop_per_hhold'),
 (0.10153999652040019, 'bedrooms_per_room'),
 (0.09974644399457142, 'longitude'),
 (0.09803482684236019, 'latitude'),
 (0.055005428384214745, 'housing_median_age'),
 (0.047782933377019284, 'rooms_per_hhold'),
 (0.0165562182216361, 'population'),
 (0.01549536838937868, 'total_rooms'),
 (0.014996404177845452, 'total_bedrooms'),
 (0.014827270006210978, 'households')]
複製代碼

在測試集上評估模型

final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse) # => evaluates to 48,209.6
final_rmse
----
49746.60716972495
複製代碼

上預發前,須要展現你的解決方案:

  1. 你學到了什麼
  2. 什麼可行?什麼不可行
  3. 你作了哪些假設
  4. 系統的限制是什麼?

上線時須要注意什麼

  1. 持續監控,避免由於數據持續更新,致使模型退化
  2. 採樣預測數據,並對其進行評估,以監控模型效果
  3. 按期從新訓練模型,例如每6個月
  4. 按期作全量預測

以上是該書第二章的學習筆記,你也能夠下載 Jupyter NoteBook 來具體操練一下。

相關文章
相關標籤/搜索