[Feature] Preprocessing tutorial

偉哥的筆記,要認真的學習。主要是L1-L3的內容,先簡單的複習下前面的內容,而後重點研究L3-Preprocessing的代碼。html

Ref: https://github.com/DBWangGroupUNSW/COMP9318/blob/master/L3%20-%20Preprocessing.ipynbpython

 

 

L0 - python3 and jupyter.ipynb


扒下來html網頁,而後分析。git

import string
import sys
import urllib.request
from bs4 import BeautifulSoup from pprint import pprint def get_page(url): try : web_page = urllib.request.urlopen(url).read() soup = BeautifulSoup(web_page, 'html.parser') return soup except urllib2.HTTPError : print("HTTPERROR!") except urllib2.URLError : print("URLERROR!")
def get_titles(sp): i = 1 papers = sp.find_all('div', {'class' : 'data'}) for paper in papers: title = paper.find('span', {'class' : 'title'} ) print("Paper {}:\t{}".format(i, title.get_text()))
sp
= get_page('http://dblp.uni-trier.de/pers/hd/m/Manning:Christopher_D=') get_titles(sp)

 

2、初看數據

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))

df['A'].plot(ax=axes[0, 0]);
axes[0, 0].set_title('A');

 

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))
bins
= df.shape[0] df['A'].hist(ax=axes[0, 0], bins=bins); axes[0, 0].set_title('A');

 

 

數據若是有偏,能夠經過log轉換。Estimated Attendance: few missing, right-skewed.github

>>> df_train.attendance.isnull().sum()
3
>>> x = df_train.attendance >>> x.plot(kind='hist', ... title='Histogram of Attendance')
>>> np.log(x).plot(kind='hist', ... title='Histogram of Log Attendance')

可見,轉換後接近正態分佈。web

 

3、清洗數據

爲了去掉極端數據,好比第二行的第二列。app

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))
bins = 50

df_f = df['F']
df_f = df_f[df_f < 10] df_f.hist(ax
=axes[1, 1], bins=bins); axes[0, 0].set_title('A');

 

 

 

 

L3 - Preprocessing.ipynb


 

2、空數據

獲取空數據

df[pd.isnull(df['Price'])]
index_with_null
= df[pd.isnull(df['Price'])].index

 

統計空數據

df_train數據表 中的 application列,其中的null值的個數統計。post

# application_date
appdate_null_ct = df_train.application_date.isnull().sum()
print(appdate_null_ct)  # 3

 

放棄空數據

Ref: Handling Missing Values in Machine Learning: Part 1學習

# If axis = 1 then the column will be dropped
df2 = df.dropna(axis=0)

# Will drop all rows that have any missing values.
dataframe.dropna(inplace=True)

# Drop the rows only if all of the values in the row are missing.
dataframe.dropna(how='all',inplace=True)

# Keep only the rows with at least 4 non-na values
dataframe.dropna(thresh=4,inplace=True)

 

填補空數據

不太好的若干方案。 flex

df2 = df.fillna(0)                       # price value of row 3 is set to 0.0
df2 = df.fillna(method='pad', axis=0)    # The price of row 3 is the same as that of row 2

# Back-fill or forward-fill to propagate next or previous values respectively
#for back fill
dataframe.fillna(method='bfill',inplace=True)
#for forward-fill
dataframe.fillna(method='ffill',inplace=True)

好的方案,求本類別的數據平均值做爲替代。url

df["Price"] = df.groupby("City").transform(lambda x: x.fillna(x.mean()))
df.ix[index_with_null]  # 以前的index便有了用武之地

 

3、加標籤(二值化)

分區間

講數值數據分區間,而後加上標籤。

# We could label the bins and add new column
df['Bin'] = pd.cut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])

df.head()

與之相對應的 Equal-depth Partitioining,爲了使量相等,而讓橫座標區間不等。

# Let's check the depth of each bin
df['Bin'] = pd.qcut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])

df.groupby(
'Bin').size()

 

加標籤後統計

df['Price-Smoothing-mean'] = df.groupby('Bin')['Price'].transform('mean')

df['Price-Smoothing-max']  = df.groupby('Bin')['Price'].transform('max')

 

4、隱藏數據(去量綱化)

每一列都標準化,以後再加上index,造成新的數據表,隱藏了隱私。

from sklearn import preprocessing

min_max_scaler = preprocessing.StandardScaler()
x_scaled = min_max_scaler.fit_transform(df[df.columns[1:5]]) # we need to remove the first column


df_standard = pd.DataFrame(x_scaled)
df_standard.insert(0, 'City', df.City)
df_standard

 

 

特徵選擇

1、主觀判斷

 

單個圖

能夠看到線性關係。

df.plot.scatter(x='Q1', y='Q3'); 

 

組圖

from pandas.plotting import scatter_matrix

scatter_matrix(df, alpha=0.9, figsize=(12, 12), diagonal='hist') # set the diagonal figures to be histograms

 

 

2、客觀分析

既然是」客觀「,創建在斷定指標,以下:

[Feature] Feature selection

3.1 Filter

3.1.1 方差選擇法

3.1.2 相關係數法

3.1.3 卡方檢驗

3.1.4 互信息法

 

 End.

相關文章
相關標籤/搜索