第一章：AI人工智能の數據預處理編程實戰 Numpy, Pandas, Matplotlib, Scikit-Learn

時間 2019-12-13

標籤第一章人工智能數據預處理編程實戰 numpy pandas matplotlib scikit learn 简体版

原文原文鏈接

本課主題

數據中 Independent 變量和 Dependent 變量
Python 數據預處理的三大神器：Numpy、Pandas、Matplotlib
Scikit-Learn 的機器學習實戰
- 數據丟失或者不完整的處理方法及編程實戰
- Categorical 數據的 Dummy Encoders 方法及編程實戰
- Fit 和 Transform 總結
- 數據切分之Training 和 Testing 集合實戰
- Feature Scaling 實戰

引言

機器學習中數據預處理是一個很重要的步驟，由於有好的數據做爲基礎能夠訓練出精確度很高的機器學習模型，但在真實的世界，數據是不完美的，因此才須要經過數據預處理，儘量把垃圾數據轉化爲更合理的數據來訓練機器學習模型。這篇文章是一個起點，主要介紹在機器學習過程當中的步驟：其中包括如下幾點，但願經過這篇文章可讓你們對機器學習有一個更直觀的認識。html

數據預處理；
Independant Variable 和 Dependent Variable 關係和區別；
Categorial Data 和 Dummy Encoder 編程實戰；
數據不完整的處理方法；
Feature Scaling 的重要性；
機器學習中 Python庫 (Numpy, Pandas, Matplotlib, Scikit-Learn) 的實戰編程

Dependent 和 Independent 變量

什麼是 Independent 變量? 什麼是 dependent 變量? 機器學習的目標是找出 Depenedent 變量和 Independent 變量之間的關係，有了這個結果，你就能夠根據過去的歷史數據來預測未來的行為。在這個列子中，有 Country，Age，Salary，Purchased 四個維度的數據，其中 Country，Age 和 Salary 是 Independent 變量，也能夠叫特徵，而 Purchased 是一個 Dependent 變量，它會跟據其餘三個特徵來得出買與不買的結論。python

Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No

data.csv

Python 有不少專門處理數據預處理的庫：Numpy 是 Python 中數據處理最流行和最強大的庫之一，尢其是對矩陣進行了全面的支持；Pandas 是以 Table 的方式對數據進行處理，叫 DataFrame；Matplotlib 對開發者最爲友好的數據可視化工具之一，下面調用 pandas.read_csv 函數來讀取數據源編程

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
 
dataset = pd.read_csv("data.csv")
 
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values

[下圖是 data.csv 的數據]
數組

這些數據中有一部份數據是不完整的，出現 Null 的狀況，此時，能夠調用 sklearn.preprocessing 中 Imputer 類，你能夠對丟失的部份採用平均法來填補上。dom

數據丟失或者不完整的處理方法及編程實戰

數據丟失在數據中是常常出現的，因此在進行機器學習的模型訓練以前，必須先進行數據預處理，來填補數據的空白，具體方法之一：計算整列數據的平均值並填補平均值。機器學習

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

[下圖是通過平均法計算出來的數據]ide

在機算計的世界它只對數字敏感，它是看不懂字符串類型的數據的，把丟失的數據填補上以後，下一步能夠對字符串類型的數據進行數字化處理，好比把 Country: France, Spain, Germany 和Purchases: Yes, No 中的數據轉化成數字。函數

Categorical 數據和 Dummy Encoders 方法及編程實戰

能夠經過調用 sklearn.preprocessing 庫中 LabelEncoder 類 fit_transform 函數把字符串類型數據 StringType 轉化爲數字型數據 IntegerType工具

from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

[下圖爲數據運行後的結果，把 Country: France, Spain, Germany 和Purchases: Yes, No 換化成 [0, 2, 1] 和 [1,0]]post

Country 中的 [0,1,2] 是有順序的，若是數據量大的話，可能會由於數字的大小而影響模型訓練的結果，這是咱們不想看見的，因此可使用一種叫 Dummy Encoders 的方法，以數組的方式，用 [0,1] 來表示，好比把 Country 編碼成爲 [0,0,1], [0,1,0], [1,0,0]

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

[下圖爲數據運行後的結果，如今總共有 5 列，前 3 列分別是描述 Country 的特徵, 第 4 列是 Age 和第 5 列是 Salary]

訓練集和測試集 (Training and Testing Dataset)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

[下圖爲數據運行後的結果，把數據分爲 testing 和 training data]

Feature Scaling

在 data.csv 中能夠看見Salary 從數字上講遠遠比 Age 的數字大，這會影響機器學習模型的準確度，因此咱們必須進行 Feature Scaling 來減小數據之間的差距，在數學上咱們能夠用 Standardization 和 Normalization 來解決這個問題。

Standardization: ( x - mean(x) ) / standard deviation (x)
Normalization:

在編程上咱們能夠調用 sklearn.preprocessing 的 StandardScaler 類中的 fit_transform 函數

from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)

能夠看到完整 Feature Scaling 後的數據的差距沒有這麼極端

[下圖爲數據運行後的結果, 這樣作能夠大大減小數據差距]

如下是這個例子完整的代碼　　

# -*- coding: utf-8 -*-
# 數據預處理的三大神器 Numpy, Pandas, Matplotlib

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv("data.csv")
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
 
 
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)

完整的代碼

總結

1. 不能有數據丟失的狀況

2. 把全部字符串類型的分類特徵轉換成數字類型 (Categorical Data)

3. 把數據分開 Training Data Set 和 Testing Data Set

4. 爲了避免讓過大或者過少的數據影響機器模型的結果，因此須要用 Feature Scaling 去預處理

參考資料

資料來源來至

[1] DT大數據夢工廠 30個真實商業案例代碼中習得AI：10大機器學習案例、13大深度學習案例、7大加強學習案例

第2課：AI數據的預處理三部曲之第一步：導入數據及初步處理Numpy、Pandas、Matplotlib
第3課：AI數據的預處理三部曲之第二步：使用Scikit-Learn來對Missing &Categorical數據進行最快速處理
第4課：AI數據的預處理三部曲之第三步：使用Scikit-Learn來對把數據切分爲Training&Testing Set以及Feature Scaling實戰

[2] Python 數據科學系列の Numpy、Series 和 DataFrame介紹