數據預處理是機器學習中最基礎也最麻煩的一部份內容
在咱們把精力撲倒各類算法的推導以前,最應該作的就是把數據預處理先搞定
在以後的每一個算法實現和案例練手過程當中,這一步都必不可少
同窗們也不要嫌麻煩,動起手來吧
基礎比較好的同窗也能夠溫故知新,再練習一下哈html
閒言少敘,下面咱們六步完成數據預處理
其實我感受這裏少了一步:觀察數據
[此處輸入圖片的描述][1]python
這是十組國籍、年齡、收入、是否已購買的數據算法
有分類數據,有數值型數據,還有一些缺失值dom
看起來是一個分類預測問題機器學習
根據國籍、年齡、收入來預測是夠會購買學習
OK,有了大致的認識,開始表演。測試
Step 1:導入庫code
import numpy as np import pandas as pd
Step 2:導入數據集orm
dataset = pd.read_csv('Data.csv') X = dataset.iloc[ : , :-1].values Y = dataset.iloc[ : , 3].values print("X") print(X) print("Y") print(Y)
這一步的目的是將自變量和因變量拆成一個矩陣和一個向量。
結果以下htm
X [['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 nan] ['France' 35.0 58000.0] ['Spain' nan 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]] Y ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
Step 3:處理缺失數據
from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) imputer = imputer.fit(X[ : , 1:3]) X[ : , 1:3] = imputer.transform(X[ : , 1:3])
Imputer類具體用法移步
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
本例中咱們用的是均值替代法填充缺失值
運行結果以下
Step 3: Handling the missing data step2 X [['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 63777.77777777778] ['France' 35.0 58000.0] ['Spain' 38.77777777777778 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]]
Step 4:把分類數據轉換爲數字
from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder_X = LabelEncoder() X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0]) onehotencoder = OneHotEncoder(categorical_features = [0]) X = onehotencoder.fit_transform(X).toarray() labelencoder_Y = LabelEncoder() Y = labelencoder_Y.fit_transform(Y) print("X") print(X) print("Y") print(Y)
LabelEncoder用法請移步
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
X [[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01 7.20000000e+04] [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04] [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01 5.40000000e+04] [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01 6.10000000e+04] [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04] [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01 5.80000000e+04] [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01 5.20000000e+04] [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01 7.90000000e+04] [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01 8.30000000e+04] [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01 6.70000000e+04]] Y [0 1 0 0 1 1 0 1 0 1]
Step 5:將數據集分爲訓練集和測試集
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
X_train [[0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04] [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01 6.70000000e+04] [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04] [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01 5.20000000e+04] [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01 7.90000000e+04] [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01 6.10000000e+04] [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01 7.20000000e+04] [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01 5.80000000e+04]] X_test [[0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04] [0.0e+00 1.0e+00 0.0e+00 5.0e+01 8.3e+04]] step2 Y_train [1 1 1 0 1 0 0 1] Y_test [0 0]
Step 6:特徵縮放
from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)
大多數機器學習算法在計算中使用兩個數據點之間的歐氏距離
特徵在幅度、單位和範圍上很大的變化,這引發了問題
高數值特徵在距離計算中的權重大於低數值特徵
經過特徵標準化或Z分數歸一化來完成
導入sklearn.preprocessing 庫中的StandardScala
用法:http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
X_train [[-1. 2.64575131 -0.77459667 0.26306757 0.12381479] [ 1. -0.37796447 -0.77459667 -0.25350148 0.46175632] [-1. -0.37796447 1.29099445 -1.97539832 -1.53093341] [-1. -0.37796447 1.29099445 0.05261351 -1.11141978] [ 1. -0.37796447 -0.77459667 1.64058505 1.7202972 ] [-1. -0.37796447 1.29099445 -0.0813118 -0.16751412] [ 1. -0.37796447 -0.77459667 0.95182631 0.98614835] [ 1. -0.37796447 -0.77459667 -0.59788085 -0.48214934]] X_test [[-1. 2.64575131 -0.77459667 -1.45882927 -0.90166297] [-1. 2.64575131 -0.77459667 1.98496442 2.13981082]]