[Feature] Preprocessing tutorial

時間 2019-11-18

標籤 feature preprocessing tutorial 简体版

原文原文鏈接

偉哥的筆記，要認真的學習。主要是L1-L3的內容，先簡單的複習下前面的內容，而後重點研究L3-Preprocessing的代碼。html

Ref: https://github.com/DBWangGroupUNSW/COMP9318/blob/master/L3%20-%20Preprocessing.ipynbpython

扒下來html網頁，而後分析。git

import string
import sys
import urllib.request

from bs4 import BeautifulSoup from pprint import pprint

def get_page(url):
    try :
        web_page = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(web_page, 'html.parser')
        return soup except urllib2.HTTPError :
        print("HTTPERROR!")
    except urllib2.URLError :
        print("URLERROR!")
      
  
def get_titles(sp):
    i = 1
    papers = sp.find_all('div', {'class' : 'data'})
    for paper in papers:
        title = paper.find('span', {'class' : 'title'} )
        print("Paper {}:\t{}".format(i, title.get_text()))


sp = get_page('http://dblp.uni-trier.de/pers/hd/m/Manning:Christopher_D=')
get_titles(sp)

2、初看數據

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))

df['A'].plot(ax=axes[0, 0]);
axes[0, 0].set_title('A');

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))

bins = df.shape[0]
df['A'].hist(ax=axes[0, 0], bins=bins);
axes[0, 0].set_title('A');

數據若是有偏，能夠經過log轉換。Estimated Attendance: few missing, right-skewed.github

>>> df_train.attendance.isnull().sum()
3

>>> x = df_train.attendance
>>> x.plot(kind='hist',
... title='Histogram of Attendance')

>>> np.log(x).plot(kind='hist',
... title='Histogram of Log Attendance')

可見，轉換後接近正態分佈。web

3、清洗數據

爲了去掉極端數據，好比第二行的第二列。app

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))
bins = 50

df_f = df['F']
df_f = df_f[df_f < 10]
df_f.hist(ax=axes[1, 1], bins=bins);
axes[0, 0].set_title('A');

2、空數據

獲取空數據

df[pd.isnull(df['Price'])]

index_with_null = df[pd.isnull(df['Price'])].index

統計空數據

df_train數據表 中的 application列，其中的null值的個數統計。post

# application_date
appdate_null_ct = df_train.application_date.isnull().sum()
print(appdate_null_ct)  # 3

放棄空數據

Ref: Handling Missing Values in Machine Learning: Part 1學習

# If axis = 1 then the column will be dropped
df2 = df.dropna(axis=0)

# Will drop all rows that have any missing values.
dataframe.dropna(inplace=True)

# Drop the rows only if all of the values in the row are missing.
dataframe.dropna(how='all',inplace=True)

# Keep only the rows with at least 4 non-na values
dataframe.dropna(thresh=4,inplace=True)

填補空數據

不太好的若干方案。 flex

df2 = df.fillna(0)                       # price value of row 3 is set to 0.0
df2 = df.fillna(method='pad', axis=0)    # The price of row 3 is the same as that of row 2

# Back-fill or forward-fill to propagate next or previous values respectively
#for back fill
dataframe.fillna(method='bfill',inplace=True)
#for forward-fill
dataframe.fillna(method='ffill',inplace=True)

好的方案，求本類別的數據平均值做爲替代。url

df["Price"] = df.groupby("City").transform(lambda x: x.fillna(x.mean()))

df.ix[index_with_null]　　#　以前的index便有了用武之地

3、加標籤（二值化）

分區間

講數值數據分區間，而後加上標籤。

# We could label the bins and add new column
df['Bin'] = pd.cut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])

df.head()

與之相對應的 Equal-depth Partitioining，爲了使量相等，而讓橫座標區間不等。

# Let's check the depth of each bin
df['Bin'] = pd.qcut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])

df.groupby('Bin').size()

加標籤後統計

df['Price-Smoothing-mean'] = df.groupby('Bin')['Price'].transform('mean')

df['Price-Smoothing-max']  = df.groupby('Bin')['Price'].transform('max')

4、隱藏數據（去量綱化）

每一列都標準化，以後再加上index，造成新的數據表，隱藏了隱私。

from sklearn import preprocessing

min_max_scaler = preprocessing.StandardScaler()
x_scaled = min_max_scaler.fit_transform(df[df.columns[1:5]]) # we need to remove the first column


df_standard = pd.DataFrame(x_scaled)
df_standard.insert(0, 'City', df.City)
df_standard

特徵選擇

1、主觀判斷

單個圖

能夠看到線性關係。

df.plot.scatter(x='Q1', y='Q3');

組圖

from pandas.plotting import scatter_matrix

scatter_matrix(df, alpha=0.9, figsize=(12, 12), diagonal='hist') # set the diagonal figures to be histograms