偉哥的筆記,要認真的學習。主要是L1-L3的內容,先簡單的複習下前面的內容,而後重點研究L3-Preprocessing的代碼。html
Ref: https://github.com/DBWangGroupUNSW/COMP9318/blob/master/L3%20-%20Preprocessing.ipynbpython
扒下來html網頁,而後分析。git
import string import sys import urllib.request
from bs4 import BeautifulSoup from pprint import pprint def get_page(url): try : web_page = urllib.request.urlopen(url).read() soup = BeautifulSoup(web_page, 'html.parser') return soup except urllib2.HTTPError : print("HTTPERROR!") except urllib2.URLError : print("URLERROR!")
def get_titles(sp): i = 1 papers = sp.find_all('div', {'class' : 'data'}) for paper in papers: title = paper.find('span', {'class' : 'title'} ) print("Paper {}:\t{}".format(i, title.get_text()))
sp = get_page('http://dblp.uni-trier.de/pers/hd/m/Manning:Christopher_D=') get_titles(sp)
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8)) df['A'].plot(ax=axes[0, 0]); axes[0, 0].set_title('A');
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))
bins = df.shape[0] df['A'].hist(ax=axes[0, 0], bins=bins); axes[0, 0].set_title('A');
數據若是有偏,能夠經過log轉換。Estimated Attendance: few missing, right-skewed.github
>>> df_train.attendance.isnull().sum() 3
>>> x = df_train.attendance >>> x.plot(kind='hist', ... title='Histogram of Attendance')
>>> np.log(x).plot(kind='hist', ... title='Histogram of Log Attendance')
可見,轉換後接近正態分佈。web
爲了去掉極端數據,好比第二行的第二列。app
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8)) bins = 50 df_f = df['F']
df_f = df_f[df_f < 10] df_f.hist(ax=axes[1, 1], bins=bins); axes[0, 0].set_title('A');
df[pd.isnull(df['Price'])]
index_with_null = df[pd.isnull(df['Price'])].index
df_train數據表 中的 application列,其中的null值的個數統計。post
# application_date appdate_null_ct = df_train.application_date.isnull().sum() print(appdate_null_ct) # 3
Ref: Handling Missing Values in Machine Learning: Part 1學習
# If axis = 1 then the column will be dropped df2 = df.dropna(axis=0) # Will drop all rows that have any missing values. dataframe.dropna(inplace=True) # Drop the rows only if all of the values in the row are missing. dataframe.dropna(how='all',inplace=True) # Keep only the rows with at least 4 non-na values dataframe.dropna(thresh=4,inplace=True)
不太好的若干方案。 flex
df2 = df.fillna(0) # price value of row 3 is set to 0.0 df2 = df.fillna(method='pad', axis=0) # The price of row 3 is the same as that of row 2 # Back-fill or forward-fill to propagate next or previous values respectively #for back fill dataframe.fillna(method='bfill',inplace=True) #for forward-fill dataframe.fillna(method='ffill',inplace=True)
好的方案,求本類別的數據平均值做爲替代。url
df["Price"] = df.groupby("City").transform(lambda x: x.fillna(x.mean()))
df.ix[index_with_null] # 以前的index便有了用武之地
講數值數據分區間,而後加上標籤。
# We could label the bins and add new column df['Bin'] = pd.cut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])
df.head()
與之相對應的 Equal-depth Partitioining,爲了使量相等,而讓橫座標區間不等。
# Let's check the depth of each bin df['Bin'] = pd.qcut(df['Price'],5,labels=["Very Low","Low","Medium","High","Very High"])
df.groupby('Bin').size()
df['Price-Smoothing-mean'] = df.groupby('Bin')['Price'].transform('mean') df['Price-Smoothing-max'] = df.groupby('Bin')['Price'].transform('max')
每一列都標準化,以後再加上index,造成新的數據表,隱藏了隱私。
from sklearn import preprocessing min_max_scaler = preprocessing.StandardScaler() x_scaled = min_max_scaler.fit_transform(df[df.columns[1:5]]) # we need to remove the first column df_standard = pd.DataFrame(x_scaled) df_standard.insert(0, 'City', df.City) df_standard
能夠看到線性關係。
df.plot.scatter(x='Q1', y='Q3');
from pandas.plotting import scatter_matrix scatter_matrix(df, alpha=0.9, figsize=(12, 12), diagonal='hist') # set the diagonal figures to be histograms
既然是」客觀「,創建在斷定指標,以下:
3.1 Filter
3.1.1 方差選擇法
3.1.2 相關係數法
3.1.3 卡方檢驗
3.1.4 互信息法
End.