（原創）(二)機器學習筆記之數據預處理

時間 2020-06-19

標籤原創機器學習筆記數據預處理简体版

原文原文鏈接

數據預處理

數據預處理通常包括：python

（1）數據標準化web

這是最經常使用的數據預處理，把某個特徵的全部樣本轉換成均值爲0，方差爲1。dom

將數據轉換成標準正態分佈的方法：工具

對每維特徵單獨處理：測試

其中，ui

能夠調用sklearn.preprocessing中的StandardScaler()進行數據的標準化。人工智能

（2）數據歸一化spa

把某個特徵的全部樣本取值限定在規定範圍內（通常爲[-1,1]或者[0,1]）。code

歸一化得方法爲：orm

能夠調用sklearn.preprocessing中的MinMaxScaler()將數據限定在[0,1]範圍，調用MaxAbsScaler()將數據限定在[-1,1]範圍。

（3）數據正規化

把某個特徵的全部樣本的模長轉換爲1。方法爲：

能夠調用sklearn.preprocessing中的Normalizer()實現

（4）數據二值化

把數據的特徵取值根據閾值轉爲爲0或者1。

（5）數據缺值處理

對於缺失的特徵數據，進行數據填補，通常填補的方法有：均值，中位數，衆數填補等。

（6）數據離羣點處理

刪除離羣點數據。

（7）數據類型轉換

若是數據的特徵不是數值型特徵，則須要轉換爲數值型。

1.導入必要的工具包

數據處理工具包爲：Numpy,SciPy,pandas,其中SciPy,pandas是基於Numpy進一步的封裝
數據可視化工具包爲：Matplotlib,Seaborn,其中Seaborn是基於Matplotlib進一步的封裝

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.metrics import r2_score %matplotlib inline

2.讀取數據

dpath = './data/' data = pd.read_csv(dpath +"boston_housing.csv") data.head() data.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): CRIM 506 non-null float64 ZN 506 non-null int64 INDUS 506 non-null float64 CHAS 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null int64 TAX 506 non-null int64 PTRATIO 506 non-null int64 B 506 non-null float64 LSTAT 506 non-null float64 MEDV 506 non-null float64 dtypes: float64(9), int64(5) memory usage: 55.4 KB

3.將數據分割訓練數據與測試數據

刪去某行或者某列：

DataFrame.drop(labels, axis=0, level=None, inplace=False, errors=’raise’)

labels : single label or list-like
axis : int or axis name
level : int or level name, default None For MultiIndex
inplace : bool, default False. If True, do operation inplace and return None.
errors : {‘ignore’, ‘raise’}, default ‘raise’,If‘ignore, suppress error and existing labels are dropped.
Returns: dropped : type of caller

y = data['MEDV'] # 獲取列名爲'MEDV'的列的數據 #print y X = data.drop('MEDV', axis=1) # 從axis=1軸（列）中刪去列名爲'MEDV'的列 X.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null int64 INDUS 506 non-null float64 CHAS 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null int64 TAX 506 non-null int64 PTRATIO 506 non-null int64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(8), int64(5) memory usage: 51.5 KB

4.採樣訓練樣本和測試樣本

sklearn.cross_validation.train_test_split(*arrays, **options)

*arrays : sequence of indexables with same length / shape[0]
              Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int, or None (default is None)
               If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the
               test split. If int, represents the absolute number of test samples. If None, the value is automatically
               set to the complement of the train size. If train size is also None, test size is set to 0.25.
train_size : float, int, or None (default is None)。
                 If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include
                 in the train split. If int, represents the absolute number of train samples. If None, the value is
                 automatically set to the complement of the test size.
random_state : int or RandomState。Pseudo-random number generator state used for random sampling.
                stratify : array-like or None (default is None)

X_train,X_test,y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25)

X：輸入特徵，
y：輸入標籤，
random_state：隨機種子，
test_size：測試樣本數佔比，爲默認爲0.25
[X_train, y_train] 和 [X_test, y_test]是一對，分別對應分割以後的訓練數據和訓練標籤，測試數據和訓練標籤

from sklearn.cross_validation import train_test_split # 隨機採樣25%的數據構建測試樣本，其他做爲訓練樣本 # X：輸入特徵，y：輸入標籤，random_state隨機種子爲27， test_size：測試樣本數佔比，若是train_size=NULL，則爲默認的0.25 # 輸出爲訓練樣本和測試樣本的DataFrame數據 X_train,X_test,y_train, y_test = train_test_split(X, y, random_state=27, test_size=0.25) print X_train.shape print y_train.shape print X_test.shape print y_test.shape

(379, 13) (379L,) (127, 13) (127L,)

5.數據預處理

數據標準化:

初始化：

sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)

with_mean : boolean, True by default.If True, center the data before scaling.
with_std : boolean, True by default.If True, scale the data to unit variance (or equivalently, unit
standard deviation).
copy : boolean, optional, default True.If False, try to avoid a copy and do inplace scaling instead.

方法：

X_new = fit_transform(X, y=None, **fit_params) 進行mean和std計算，並進行數據的標準化

X : numpy array of shape [n_samples, n_features].Training set.
y : numpy array of shape [n_samples].Target values.
X_new : numpy array of shape [n_samples, n_features_new].Transformed array.

X_new = transform(X, y=None, copy=None) 使用已經計算的mean和std進行數據的標準化

X : array-like, shape [n_samples, n_features].The data used to scale along the features axis.
X_new : numpy array of shape [n_samples, n_features_new].Transformed array.

# 數據標準化 from sklearn.preprocessing import StandardScaler # 分別初始化對特徵和目標值的標準化器 ss_X = StandardScaler() ss_y = StandardScaler() # 分別對訓練和測試數據的特徵以及目標值進行標準化處理 X_train = ss_X.fit_transform(X_train) # 先計算均值和方差，再進行變換 X_test = ss_X.transform(X_test) # 利用上面計算好的均值和方差，直接進行轉換 y_train = ss_y.fit_transform(y_train) y_test = ss_y.transform(y_test) print X_train

[[-0.37683627 -0.50304409 2.48277286 ..., 0.86555269 -0.13431739 1.60921499] [ 5.13573477 -0.50304409 1.0607873 ..., 0.86555269 -2.93693892 3.44576006] [-0.37346431 0.01751212 -0.44822848 ..., -1.31269744 0.33223834 2.45055308] ..., [-0.39101613 -0.50304409 -1.13119458 ..., -0.87704742 0.28632785 -0.36708256] [-0.38897021 -0.50304409 -1.2462515 ..., -0.44139739 0.38012111 0.19898553] [-0.31120842 -0.50304409 -0.40840109 ..., 1.30120272 0.37957325 -0.18215757]]

人工智能從入門到專家教程資料：https://item.taobao.com/item.htm?spm=a1z38n.10677092.0.0.38270209gU11fS&id=562189023765