在sklearn之數據分析中總結了數據分析經常使用方法,接下來對數據預處理進行總結python
當咱們拿到數據集後通常須要進行如下步驟:git
依然以房價數據爲例,依次進行上述操做算法
import pandas as pd import matplotlib.pyplot as plt import numpy as np
housing = pd.read_csv('./datasets/housing/housing.csv')
print(housing.shape)
(20640, 10)
print(housing.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): longitude 20640 non-null float64 latitude 20640 non-null float64 housing_median_age 20640 non-null float64 total_rooms 20640 non-null float64 total_bedrooms 20433 non-null float64 population 20640 non-null float64 households 20640 non-null float64 median_income 20640 non-null float64 median_house_value 20640 non-null float64 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB None
經過info()發現除了:網絡
(1) 放棄缺失值所在的行機器學習
(2) 放棄缺失值所在的屬性,即列函數
(3) 將缺失值設置爲某個值(0,平均值、中位數或使用頻率高的值)學習
print(housing[housing.isnull().T.any().T][:5]) #打印有NaN的行的前5行
longitude latitude housing_median_age total_rooms total_bedrooms \ 290 -122.16 37.77 47.0 1256.0 NaN 341 -122.17 37.75 38.0 992.0 NaN 538 -122.28 37.78 29.0 5154.0 NaN 563 -122.24 37.75 45.0 891.0 NaN 696 -122.10 37.69 41.0 746.0 NaN population households median_income median_house_value ocean_proximity 290 570.0 218.0 4.3750 161900.0 NEAR BAY 341 732.0 259.0 1.6196 85100.0 NEAR BAY 538 3741.0 1273.0 2.5762 173400.0 NEAR BAY 563 384.0 146.0 4.9489 247100.0 NEAR BAY 696 387.0 161.0 3.9063 178400.0 NEAR BAY
# 刪除行 housing1 = housing.dropna(subset=['total_bedrooms']) print(housing1.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 20433 entries, 0 to 20639 Data columns (total 10 columns): longitude 20433 non-null float64 latitude 20433 non-null float64 housing_median_age 20433 non-null float64 total_rooms 20433 non-null float64 total_bedrooms 20433 non-null float64 population 20433 non-null float64 households 20433 non-null float64 median_income 20433 non-null float64 median_house_value 20433 non-null float64 ocean_proximity 20433 non-null object dtypes: float64(9), object(1) memory usage: 1.7+ MB None
# 刪除列 housing2 = housing.drop(['total_bedrooms',],axis=1) print(housing2.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 9 columns): longitude 20640 non-null float64 latitude 20640 non-null float64 housing_median_age 20640 non-null float64 total_rooms 20640 non-null float64 population 20640 non-null float64 households 20640 non-null float64 median_income 20640 non-null float64 median_house_value 20640 non-null float64 ocean_proximity 20640 non-null object dtypes: float64(8), object(1) memory usage: 1.4+ MB None
# 使用平均值替換 mean = housing['total_bedrooms'].mean() print('mean:',mean) housing3 = housing.fillna({'total_bedrooms':mean}) print(housing3[290:291])
mean: 537.8705525375618 longitude latitude housing_median_age total_rooms total_bedrooms \ 290 -122.16 37.77 47.0 1256.0 537.870553 population households median_income median_house_value ocean_proximity 290 570.0 218.0 4.375 161900.0 NEAR BAY
當數據集的數值屬性具備很是大的比例差別,每每致使機器學習的算法表現不佳,固然也有極少數特例。在實際應用中,經過梯度降低法求解的模型一般須要歸一化,包括線性迴歸、邏輯迴歸、支持向量機、神經網絡等模型。但對於決策樹不使用,以C4.5爲例,決策樹在進行節點分裂時主要依據數據集D關於特徵X的信息增益比,而信息增益比跟特徵是否通過歸一化是無關的優化
數據標準化經常使用方法有:編碼
在監督學習中,除了決策樹等少數模型外都須要將預測值與實際值(也就是說標籤)進行比較,而後經過算法優化損失函數,這就須要將標籤轉換爲數值類型用於計算code
經常使用的編碼方式有:序號編碼,獨熱編碼,二進制編碼
序號編碼一般用於處理類別間具備大小感謝的數據,例如成績,能夠分爲低、中、高三檔,而且存在‘高>中>低’的排列順序,序號編碼會按照大小關係對類別型特徵賦予一個數值ID,例如高表示3,中表示2,低表示1
獨熱編碼一般用於處理類別間不具備大小關係的特徵。例如血血型,一共有4個取值(A型血、B型血、AB型血、O型血),獨熱編碼會把血型變成一個4維稀疏向量,A型血表示(1,0,0,0),B型血表示(0,1,0,0),C型血表示(0,0,1,0),D型血表示(0,0,0,1)
對於類別取值較多的狀況下使用獨熱編碼須要注意如下問題:
(1) 使用稀疏向量來節省空間
在獨熱編碼下,特徵向量只有某一維取1,其餘位置均爲0,所以能夠利用向量的稀疏性表示有效地節省空間,而且目前大部分的算法均接受稀疏向量形式的輸入
(2) 配合特徵選擇來下降維度
二進制編碼本質上就是利用二進制對ID進行哈希映射,最終獲得0/1特徵向量,且維數少於獨熱編碼,節省了存儲空間
當咱們對數據集進行必定程度的分析以後,可能會發現不一樣屬性之間的某些有趣的聯繫,特別是跟目標屬性相關的聯繫,在準備給機器學習算法輸入數據以前,應該嘗試各類屬性的組合
以上面的房價數據集爲例,若是你不知道一個地區有多少個家庭,那麼知道一個地區的房間總數也沒什麼用,你真正想知道是的一個家庭的房間數量,一樣的,但看臥室總數這個屬性自己,也沒有什麼意義,你可能想拿它和房間總數來對比,或者拿來通每一個家庭的人口數這個屬性結合
corr_martrix = housing.corr() print(corr_martrix['median_house_value'].sort_values(ascending=False))
median_house_value 1.000000 median_income 0.688075 total_rooms 0.134153 housing_median_age 0.105623 households 0.065843 total_bedrooms 0.049686 population -0.024650 longitude -0.045967 latitude -0.144160 Name: median_house_value, dtype: float64
housing4 = housing.copy() housing4['rooms_per_household'] = housing4['total_rooms'] / housing4['households'] housing4['bedrooms_per_room'] = housing4['total_bedrooms'] / housing4['total_rooms'] housing4['population_per_household'] = housing4['population'] / housing4['households'] corr_martrix1 = housing.corr() print(corr_martrix1['median_house_value'].sort_values(ascending=False))
median_house_value 1.000000 median_income 0.688075 total_rooms 0.134153 housing_median_age 0.105623 households 0.065843 total_bedrooms 0.049686 population -0.024650 longitude -0.045967 latitude -0.144160 Name: median_house_value, dtype: float64
能夠看出bedrooms_per_room較房間總數或是臥室總數與房價中位數的相關性要高的多,因此在進行屬性組合時能夠多多嘗試
from sklearn.preprocessing import Imputer,LabelEncoder,OneHotEncoder,StandardScaler from sklearn.base import BaseEstimator,TransformerMixin from sklearn.pipeline import Pipeline,FeatureUnion
class DaraFrameSelector(BaseEstimator,TransformerMixin): def __init__(self,attr_name): self.attr_name = attr_name def fit(self,X,Y=None): return self def transform(self,X,Y=None): return X[self.attr_name].values
features_attr = list(housing.columns[:-1]) labels_attr = [housing.columns[-1]] feature_pipeline = Pipeline([('selector',DaraFrameSelector(features_attr)), ('imputer',Imputer(strategy='mean')), ('scaler',StandardScaler()),]) label_pipeline = Pipeline([('selector',DaraFrameSelector(labels_attr)), ('encoder',OneHotEncoder()),]) full_pipeline = FeatureUnion(transformer_list=[('feature_pipeline',feature_pipeline), ('label_pipeline',label_pipeline),])
C:\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:58: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead. warnings.warn(msg, category=DeprecationWarning)
housing_prepared = full_pipeline.fit_transform(housing) print(housing_prepared.shape)
(20640, 14)
參考資料: