《Python數據分析與機器學習實戰-唐宇迪》讀書筆記第10章-特徵工程

時間 2020-04-14

標籤 python 數據分析機器學習實戰讀書筆記特徵工程欄目 Python 简体版

原文原文鏈接

第10章特徵工程python

　　特徵工程是整個機器學習中很是重要的一部分，如何對數據進行特徵提取對最終結果的影響很是大。在建模過程當中，通常會優先考慮算法和參數，可是數據特徵才決定了總體結果的上限，而算法和參數只決定了如何逼近這個上限。特徵工程其實就是要從原始數據中找到最有價值的信息，並轉換成計算機所能讀懂的形式。本章結合數值數據與文本數據來分別闡述如何進行數值特徵與文本特徵的提取。算法

10.1數值特徵數據庫

　　實際數據中，最多見的就是數值特徵，本節介紹幾種經常使用的數值特徵提取方法與函數。首先仍是讀取一份數據集，並取其中的部分特徵來作實驗，不用考慮數據特徵的具體含義，只進行特徵操做便可。服務器

10.1.1字符串編碼網絡

1 import pandas as pd
2 import numpy as np
3 
4 vg_df = pd.read_csv('datasets/vgsales.csv', encoding = "ISO-8859-1")
5 vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

　　上述代碼生成的數據中不少特徵指標都是字符串，首先假設Genre列是最終的分類結果標籤，可是計算機可不認識這些字符串，此時就須要將字符轉換成數值。app

1 genres = np.unique(vg_df['Genre'])
2 genres

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

　　讀入數據後，最多見的狀況就是不少特徵並非數值類型，而是用字符串來描述的，打印結果後發現，Genre列一共有12個不一樣的屬性值，將其轉換成數值便可，最簡單的方法就是用數字進行映射：dom

1 from sklearn.preprocessing import LabelEncoder
2 
3 gle = LabelEncoder()
4 genre_labels = gle.fit_transform(vg_df['Genre'])
5 genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
6 genre_mappings

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

　　使用sklearn工具包中的LabelEncoder()函數能夠快速地完成映射工做，默認是從數值0開始，fit_transform()是實際執行的操做，自動對屬性特徵進行映射操做。變換完成以後，能夠將新獲得的結果加入原始DataFrame中對比一下：機器學習

1 vg_df['GenreLabel'] = genre_labels
2 vg_df[['Name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

　　此時全部的字符型特徵就轉換成相應的數值，也能夠自定義一份映射。異步

1 poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
2 poke_df.head()

1 poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
2 
3 np.unique(poke_df['Generation'])

array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)

　　這份數據集中一樣有多個屬性值須要映射，也能夠本身動手寫一個map函數，對應數值就從1開始吧：

1 gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
2                'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
3 
4 poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
5 poke_df[['Name', 'Generation', 'GenerationLabel']].iloc[4:10]

　　對於簡單的映射操做，不管本身完成仍是使用工具包中現成的命令都很是容易，可是更多的時候，對這種屬性特徵能夠選擇獨熱編碼，雖然操做稍微複雜些，但從結果上觀察更清晰：

 1 from sklearn.preprocessing import OneHotEncoder, LabelEncoder
 2 
 3 # 完成LabelEncoder
 4 gen_le = LabelEncoder()
 5 gen_labels = gen_le.fit_transform(poke_df['Generation'])
 6 poke_df['Gen_Label'] = gen_labels
 7 
 8 poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary']]
 9 
10 # 完成OneHotEncoder
11 gen_ohe = OneHotEncoder()
12 gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
13 gen_feature_labels = list(gen_le.classes_)
14 
15 # 將轉換好的特徵組合到dataframe中
16 gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)
17 poke_df_ohe = pd.concat
18 poke_df_ohe.head()

　　上述代碼首先導入了OneHotEncoder工具包，對數據進行數值映射操做，又進行獨熱編碼。輸出結果顯示，獨熱編碼至關於先把全部可能狀況進行展開，而後分別用0和1表示實際特徵狀況，0表明不是當前列特徵，1表明是當前列特徵。例如，當Gen_Label=3時，對應的獨熱編碼就是，Gen4爲1，其他位置都爲0（注意原索引從0開始，Gen_Label=3，至關於第4個位置）。

　　上述代碼看起來有點麻煩，那麼有沒有更簡單的方法呢？其實直接使用Pandas工具包更方便：

1 gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)
2 pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1).iloc[4:10]

　　Get_dummies()函數能夠完成獨熱編碼的工做，當特徵較多時，一個個命名太麻煩，此時能夠直接指定一個前綴用於標識：

1 gen_onehot_features = pd.get_dummies(poke_df['Generation'],prefix = 'one-hot')
2 pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[4:10]

　　如今全部執行獨熱編碼的特徵所有帶上「one-hot」前綴了，對比發現仍是get_dummies()函數更好用，1行代碼就能解決問題。

1 poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8') 2 poke_df.head()

10.1.2二值與多項式特徵

　　接下來打開一份音樂數據集：

1 popsong_df = pd.read_csv('datasets/song_views.csv', encoding='utf-8')
2 popsong_df.head(10)

　　數據中包括不一樣用戶對歌曲的播放量，能夠發現不少歌曲的播放量都是0，表示該用戶尚未播放過此音樂，這個時候能夠設置一個二值特徵，以表示用戶是否聽過該歌曲：

1 watched = np.array(popsong_df['listen_count']) 
2 watched[watched >= 1] = 1
3 popsong_df['watched'] = watched
4 popsong_df.head(10)

　　新加入的watched特徵表示歌曲是否被播放，一樣也可使用sklearn工具包中的Binarizer來完成二值特徵：

1 from sklearn.preprocessing import Binarizer
2 
3 bn = Binarizer(threshold=0.9)
4 pd_watched = bn.transform([popsong_df['listen_count']])[0]
5 popsong_df['pd_watched'] = pd_watched
6 popsong_df.head(10)

　　特徵的變換方法還有不少，還能夠對其進行各類組合。接下來登場的就是多項式特徵，例若有a、b兩個特徵，那麼它的2次多項式爲（1,a,b,a²,ab,b²），下面經過sklearn工具包完成變換操做：

1 poke_df = pd.read_csv('datasets/Pokemon.csv', encoding='utf-8')
2 atk_def = poke_df[['Attack', 'Defense']]
3 atk_def.head()
4 
5 from sklearn.preprocessing import PolynomialFeatures
6 
7 pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
8 res = pf.fit_transform(atk_def)
9 res[:5]

　　Attack     Defense
0     49     49
1     62     63
2     82     83
3     100     123
4     52     43

array([[   49.,    49.,  2401.,  2401.,  2401.],
       [   62.,    63.,  3844.,  3906.,  3969.],
       [   82.,    83.,  6724.,  6806.,  6889.],
       [  100.,   123., 10000., 12300., 15129.],
       [   52.,    43.,  2704.,  2236.,  1849.]])

　　PolynomialFeatures()函數涉及如下3個參數。

degree：控制多項式的度，若是設置的數值越大，特徵結果也會越多。
interaction_only：默認爲False。若是指定爲True，那麼不會有特徵本身和本身結合的項，例如上面的二次項中沒有a²和b²。
include_bias：默認爲True。若是爲True的話，那麼會新增1列。

　　爲了更清晰地展現，能夠加上操做的列名：

1 intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])
2 intr_features.head(5)

10.1.3連續值離散化

　　連續值離散化的操做很是實用，不少時候都須要對連續值特徵進行這樣的處理，效果如何還得實際經過測試集來觀察，但在特徵工程構造的初始階段，確定仍是但願可行的路線越多越好。

1 cc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', encoding='utf-8')
2 fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()

　　上述代碼讀取了一份帶有年齡信息的數據集，接下來要對年齡特徵進行離散化操做，也就是劃分紅一個個區間，實際操做以前，能夠觀察其分佈狀況：

 1 import pandas as pd
 2 import matplotlib.pyplot as plt
 3 import matplotlib as mpl
 4 import numpy as np
 5 import scipy.stats as spstats
 6 
 7 %matplotlib inline
 8 mpl.style.reload_library()
 9 mpl.style.use('classic')
10 mpl.rcParams['figure.facecolor'] = (1, 1, 1, 0)
11 mpl.rcParams['figure.figsize'] = [6.0, 4.0]
12 mpl.rcParams['figure.dpi'] = 100
13 
14 fig, ax = plt.subplots()
15 fcc_survey_df['Age'].hist(color='#A9C5D3')
16 ax.set_title('Developer Age Histogram', fontsize=12)
17 ax.set_xlabel('Age', fontsize=12)
18 ax.set_ylabel('Frequency', fontsize=12)

　　上述輸出結果顯示，年齡特徵的取值範圍在10～90之間。所謂離散化，就是將一段區間上的數據映射到一個組中，例如按照年齡大小可分紅兒童、青年、中年、老年等。簡單起見，這裏直接按照相同間隔進行劃分：

1 fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))
2 fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]

　　上述代碼中，np.floor表示向下取整，例如，對3.3取整後，獲得的就是3。這樣就完成了連續值的離散化，全部數值都劃分到對應的區間上。

　　還能夠利用分位數進行分箱操做，換一個特徵試試，先來看看收入的狀況：

1 #fcc_survey_df[['ID.x', 'Age', 'Income']].iloc[4:9]
2 fig, ax = plt.subplots()
3 fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
4 ax.set_title('Developer Income Histogram', fontsize=12)
5 ax.set_xlabel('Developer Income', fontsize=12)
6 ax.set_ylabel('Frequency', fontsize=12)

　　分位數就是按照比例來劃分，也能夠自定義合適的比例：

1 quantile_list = [0, .25, .5, .75, 1.]
2 quantiles = fcc_survey_df['Income'].quantile(quantile_list)
3 quantiles

0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0
Name: Income, dtype: float64

 1 fig, ax = plt.subplots()
 2 fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')
 3 
 4 for quantile in quantiles:
 5     qvl = plt.axvline(quantile, color='r')
 6 ax.legend([qvl], ['Quantiles'], fontsize=10)
 7 
 8 ax.set_title('Developer Income Histogram with Quantiles', fontsize=12)
 9 ax.set_xlabel('Developer Income', fontsize=12)
10 ax.set_ylabel('Frequency', fontsize=12)

　　Quantile函數就是按照選擇的比例獲得對應的切分值，再應用到數據中進行離散化操做便可：

1 quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
2 fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'], 
3                                                  q=quantile_list)
4 fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'], 
5                                                  q=quantile_list, labels=quantile_labels)
6 fcc_survey_df[['ID.x', 'Age', 'Income', 
7                'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]

　　此時全部數據都完成了分箱操做，拿到實際數據後如何指定比例就得看具體問題，並無固定不變的規則，根據實際業務來判斷纔是最科學的。

10.1.4對數與時間變換

　　拿到某列數據特徵後，其分佈多是各類各樣的狀況，可是，不少機器學習算法但願預測的結果值可以呈現高斯分佈，這就須要再對其進行變換，最直接的就是對數變換：

1 fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))
2 fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]

1 income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)
2 
3 fig, ax = plt.subplots()
4 fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3')
5 plt.axvline(income_log_mean, color='r')
6 ax.set_title('Developer Income Histogram after Log Transform', fontsize=12)
7 ax.set_xlabel('Developer Income (log scale)', fontsize=12)
8 ax.set_ylabel('Frequency', fontsize=12)
9 ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), fontsize=10)

　　通過對數變換以後，特徵分佈更接近高斯分佈，雖然還不夠完美，但仍是有些進步的，感興趣的讀者還能夠進一步瞭解cox-box變換，目的都是相同的，只是在公式上有點區別。

　　時間相關數據也是能夠提取出不少特徵，例如年、月、日、小時等，甚至上旬、中旬、下旬、工做時間、下班時間等均可以看成算法的輸入特徵。

 1 import datetime
 2 import numpy as np
 3 import pandas as pd
 4 from dateutil.parser import parse
 5 import pytz
 6 
 7 import numpy as np
 8 import pandas as pd
 9 
10 time_stamps = ['2015-03-08 10:30:00.360000+00:00', '2017-07-13 15:45:05.755000-07:00',
11                '2012-01-20 22:30:00.254000+05:30', '2016-12-25 00:30:00.000000+10:00']
12 df = pd.DataFrame(time_stamps, columns=['Time'])
13 df

                    Time
0     2015-03-08 10:30:00.360000+00:00
1     2017-07-13 15:45:05.755000-07:00
2     2012-01-20 22:30:00.254000+05:30
3     2016-12-25 00:30:00.000000+10:00

　　接下來就要獲得各類細緻的時間特徵，若是用的是標準格式的數據，也能夠直接調用其屬性，更方便一些：

ts_objs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])
df['TS_obj'] = ts_objs
ts_objs

array([Timestamp('2015-03-08 10:30:00.360000+0000', tz='UTC'),
       Timestamp('2017-07-13 15:45:05.755000-0700', tz='pytz.FixedOffset(-420)'),
       Timestamp('2012-01-20 22:30:00.254000+0530', tz='pytz.FixedOffset(330)'),
       Timestamp('2016-12-25 00:30:00+1000', tz='pytz.FixedOffset(600)')],
      dtype=object)

 1 df['Year'] = df['TS_obj'].apply(lambda d: d.year)
 2 df['Month'] = df['TS_obj'].apply(lambda d: d.month)
 3 df['Day'] = df['TS_obj'].apply(lambda d: d.day)
 4 df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)
 5 # df['DayName'] = df['TS_obj'].apply(lambda d: d.weekday_name)#
 6 # AttributeError: 'Timestamp' object has no attribute 'weekday_name'
 7 df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)
 8 df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)
 9 df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)
10 
11 # df[['Time', 'Year', 'Month', 'Day', 'Quarter', 
12 #     'DayOfWeek', 'DayName', 'DayOfYear', 'WeekOfYear']]
13 df[['Time', 'Year', 'Month', 'Day', 'Quarter', 
14     'DayOfWeek',  'DayOfYear', 'WeekOfYear']]

1 hour_bins = [-1, 5, 11, 16, 21, 23]
2 bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night']
3 df['TimeOfDayBin'] = pd.cut(df['Hour'], 
4                             bins=hour_bins, labels=bin_names)
5 df[['Time', 'Hour', 'TimeOfDayBin']]

　　　　Time     　　　　　　　　　　　　　　　　Hour  TimeOfDayBin
0     2015-03-08 10:30:00.360000+00:00     10     Morning
1     2017-07-13 15:45:05.755000-07:00     15     Afternoon
2     2012-01-20 22:30:00.254000+05:30     22     Night
3     2016-12-25 00:30:00.000000+10:00     0      Late Night

　　原始時間特徵肯定後，居然分出這麼多小特徵。當拿到具體時間數據後，還能夠整合一些相關信息，例如天氣狀況，氣象臺數據很輕鬆就能夠拿到，對應的溫度、降雨等指標也就都有了。

10.2文本特徵

　　文本特徵常常在數據中出現，一句話、一篇文章都是文本特徵。仍是一樣的問題，計算機依舊不認識它們，因此首先要將其轉換成數值，也就是向量。關於文本特徵的提取方式，這裏先作簡單介紹，在下一章的新聞分類任務中，還會詳細解釋文本特徵提取操做。

10.2.1詞袋模型

　　先來構造一個數據集，簡單起見就用英文表示，若是是中文數據，還須要先進行分詞操做，英文中默認就是分好詞的結果：

 1 import pandas as pd
 2 import numpy as np
 3 import re
 4 import nltk #pip install nltk
 5 #jieba
 6 
 7 corpus = ['The sky is blue and beautiful.',
 8           'Love this blue and beautiful sky!',
 9           'The quick brown fox jumps over the lazy dog.',
10           'The brown fox is quick and the blue dog is lazy!',
11           'The sky is very blue and the sky is very beautiful today',
12           'The dog is lazy but the brown fox is quick!'    
13 ]
14 labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
15 corpus = np.array(corpus)
16 corpus_df = pd.DataFrame({'Document': corpus, 
17                           'Category': labels})
18 corpus_df = corpus_df[['Document', 'Category']]
19 corpus_df

　　　　Document     　　　　　　　　　　　　　　　　　　　　　　Category
0     The sky is blue and beautiful.     　　　　　　　　　　weather
1     Love this blue and beautiful sky!     　　　　　　　　 weather
2     The quick brown fox jumps over the lazy dog.     　　animals
3     The brown fox is quick and the blue dog is lazy! 　　animals
4     The sky is very blue and the sky is very beaut...　　weather
5     The dog is lazy but the brown fox is quick!     　　 animals

　　在天然語言處理中有一個很是實用的NLTK工具包，使用前須要先安裝該工具包，可是，安裝完以後，它至關於一個空架子，裏面沒有實際的功能，須要有選擇地安裝部分插件（見圖10-1）。

　　圖10-1 NLTK工具包

　　執行nltk.download()會跳出安裝界面，選擇須要的功能進行安裝便可。不只如此，NLTK工具包還提供了不少數據集供咱們練習使用，功能仍是很是強大的。

NLTK安裝能夠參考這裏：《數據分析實戰-托馬茲.卓巴斯》讀書筆記第9章--天然語言處理NLTK（分析文本、詞性標註、主題抽取、文本數據分類）

1 nltk.download()
2 # nltk.download('wordnet')
3 #並把文件從默認的路徑C:\Users\tony zhang\AppData\Roaming\nltk_data\移動到D:\download\nltk_data\

1 from nltk import data
2 data.path.append(r'D:\download\nltk_data') # 這裏的路徑須要換成本身數據文件下載的路徑

　　對於文本數據，第一步確定要進行預處理操做，基本的套路就是去掉各類特殊字符，還有一些用處不大的停用詞。

　　所謂停用詞就是該詞對最終結果影響不大，例如，「咱們」「今天」「可是」等詞語就屬於停用詞。

 1 import nltk
 2 from nltk import data
 3 data.path.append(r'D:\download\nltk_data') # 這裏的路徑須要換成本身數據文件下載的路徑
 4 #加載停用詞
 5 wpt = nltk.WordPunctTokenizer()
 6 stop_words = nltk.corpus.stopwords.words('english')
 7 
 8 def normalize_document(doc):
 9     # 去掉特殊字符
10     doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
11     # 轉換成小寫
12     doc = doc.lower()
13     doc = doc.strip()
14     # 分詞
15     tokens = wpt.tokenize(doc)
16     # 去停用詞
17     filtered_tokens = [token for token in tokens if token not in stop_words]
18     # 從新組合成文章
19     doc = ' '.join(filtered_tokens)
20     return doc
21 
22 normalize_corpus = np.vectorize(normalize_document)

1 norm_corpus = normalize_corpus(corpus)
2 norm_corpus
3 #The sky is blue and beautiful.

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
       'sky blue sky beautiful today', 'dog lazy brown fox quick'],
      dtype='<U30')

　　像the、this等對整句話的主題不起做用的詞也所有去掉，下面就要對文本進行特徵提取，也就是把每句話都轉換成數值向量。

1 from sklearn.feature_extraction.text import CountVectorizer
2 print (norm_corpus)
3 cv = CountVectorizer(min_df=0., max_df=1.)
4 cv.fit(norm_corpus)
5 print (cv.get_feature_names())
6 cv_matrix = cv.fit_transform(norm_corpus)
7 cv_matrix = cv_matrix.toarray()
8 cv_matrix

['sky blue beautiful' 'love blue beautiful sky'
 'quick brown fox jumps lazy dog' 'brown fox quick blue dog lazy'
 'sky blue sky beautiful today' 'dog lazy brown fox quick']
['beautiful', 'blue', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'love', 'quick', 'sky', 'today']

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]], dtype=int64)

1 vocab = cv.get_feature_names() 2 pd.DataFrame(cv_matrix, columns=vocab)

　　文章中出現多少個不一樣的詞，其向量的維度就是多大，再依照其出現的次數和位置，就能夠把向量構造出來。上述代碼只考慮單個詞，其實還能夠把詞和詞之間的組合考慮進來，原理仍是同樣的，接下來就要多考慮組合，從結果來看更直接：

1 bv = CountVectorizer(ngram_range=(2,2))
2 bv_matrix = bv.fit_transform(norm_corpus)
3 bv_matrix = bv_matrix.toarray()
4 vocab = bv.get_feature_names()
5 pd.DataFrame(bv_matrix, columns=vocab)

　　上述代碼設置了ngram_range參數，至關於要考慮詞的上下文，此處只考慮兩兩組合的狀況，你們也能夠將ngram_range參數設置成(1,2)，這樣既包括一個詞也包括兩個詞組合的狀況。

　　詞袋模型的原理和操做都十分簡單，可是這樣作出來的向量是沒有靈魂的。不管是一句話仍是一篇文章，都是有前後順序的，但在詞袋模型中，卻只考慮詞頻，而且每一個詞的重要程度徹底和其出現的次數相關，一般狀況下，文章向量會是一個很是大的稀疏矩陣，並不利於計算。

　　詞袋模型的問題看起來仍是不少，其優勢也是有的，簡單方便。在實際建模任務中，還不能肯定哪一種特徵提取方法效果更好，因此，各類方法都須要嘗試。

10.2.2經常使用文本特徵構造方法

　　文本特徵提取方法還不少，下面介紹一些經常使用的構造方法，在實際任務中，不只能夠選擇常規套路，也能夠組合使用一些野路子。

　　（1）TF-IDF特徵。雖然詞袋模型只考慮了詞頻，沒考慮詞自己的含義，但在TF-IDF中，會考慮每一個詞的重要程度，後續再詳細講解TF-IDF關鍵詞的提取方法，先來看看其能獲得的結果：

1 from sklearn.feature_extraction.text import TfidfVectorizer 
2 tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
3 tv_matrix = tv.fit_transform(norm_corpus)
4 tv_matrix = tv_matrix.toarray()
5 
6 vocab = tv.get_feature_names()
7 pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

　　上述輸出結果顯示，每一個詞都獲得一個小數結果，而且有大小之分，代表其在該篇文章中的重要程度，下一章的新聞分類任務還會詳細討論。

　　（2）類似度特徵。只要肯定了特徵，而且所有轉換成數值數據，才能夠計算它們之間的類似性，計算方法也比較多，這裏用餘弦類似性來舉例，sklearn工具包中已經有實現好的功能，直接將上例中TF-IDF特徵提取結果看成輸入便可：

1 from sklearn.metrics.pairwise import cosine_similarity
2 
3 similarity_matrix = cosine_similarity(tv_matrix)
4 similarity_df = pd.DataFrame(similarity_matrix)
5 similarity_df

　　（3）聚類特徵。聚類就是把數據按堆劃分，最後每堆給出一個實際的標籤，須要先把數據轉換成數值特徵，而後計算其聚類結果，其結果也能夠看成離散型特徵（聚類算法會在第16章講解）。

1 from sklearn.cluster import KMeans
2 
3 km = KMeans(n_clusters=2)
4 km.fit_transform(similarity_df)
5 cluster_labels = km.labels_
6 cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
7 pd.concat([corpus_df, cluster_labels], axis=1)

　　（4）主題模型。主題模型是無監督方法，輸入就是處理好的語料庫，能夠獲得主題類型以及其中每個詞的權重結果：

 1 from sklearn.decomposition import LatentDirichletAllocation
 2 
 3 # help(LatentDirichletAllocation)
 4 # lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
 5 #  n_components : int, optional (default=10)
 6 #  |      Number of topics.
 7 
 8 lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42)
 9 dt_matrix = lda.fit_transform(tv_matrix)
10 features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
11 features
12 
13 tt_matrix = lda.components_
14 for topic_weights in tt_matrix:
15     topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
16     topic = sorted(topic, key=lambda x: -x[1])
17     topic = [item for item in topic if item[1] > 0.6]
18     print(topic)
19     print()

     T1              T2
0     0.190548     0.809452
1     0.176804     0.823196
2     0.846184     0.153816
3     0.814863     0.185137
4     0.180516     0.819484
5     0.839172     0.160828

[('brown', 1.7273638692668465), ('dog', 1.7273638692668465), ('fox', 1.7273638692668465), ('lazy', 1.7273638692668465), ('quick', 1.7273638692668465), 
('jumps', 1.0328325272484777), ('blue', 0.7731573162915626)]

[('sky', 2.264386643135622), ('beautiful', 1.9068269319456903), ('blue', 1.7996282104933266), ('love', 1.148127242397004), 
('today', 1.0068251160429935)]

　　上述代碼設置n_topicsn_components =2，至關於要獲得兩種主題，最後的結果就是各個主題不一樣關鍵詞的權重，看起來這件事處理得還不錯，使用無監督的方法，也能獲得這麼多關鍵的指標。筆者認爲，LDA主題模型並非很實用，獲得的效果一般也是通常，因此，並不建議你們用其進行特徵處理或者建模任務，熟悉一下就好。

　　（5）詞向量模型。前面介紹的幾種特徵提取方法仍是比較容易理解的，再來看看詞向量模型，也就是常說的word2vec，其基本原理是基於神經網絡的。先來通俗地解釋一下，首先對每一個詞進行初始化操做，例如，每一個詞都是長度爲10的一個隨機向量。接下來，模型會對每一個詞及其上下文進行預測，例如輸入是向量「回家」，輸出就是「吃飯」，全部的輸入數據和輸出標籤都是語料庫中的上下文，因此標籤並不須要特地指定。此時不僅要經過優化算法選擇合適的權重參數，例如梯度降低，輸入的向量也會隨之改變，也就是向量「回家」一開始是隨機的，在每次迭代過程當中都會不斷改變，直到獲得一個合適的結果。

　　詞向量模型是現階段天然語言處理中最常使用的方法，並賦予每一個詞實際的空間含義，回顧一下，使用前面講述過的特徵提取方法獲得的向量都沒有實際意義，只是數值，但在詞向量模型中，每一個詞在空間中都是有實際意義的，例如，「喜歡」和「愛」這兩個詞在空間中比較接近，由於其表達的含義相似，可是它們和「手機」就離得比較遠，由於關係不大。講解完神經網絡以後，在第20章的影評分類任務中有它的實際應用案例。當你們使用時，需首先將文本中每個詞的向量構造出來，最經常使用的工具包就是Gensim，其中有語料庫：

 1 from gensim.models import word2vec
 2 from nltk import data
 3 data.path.append(r'D:\download\nltk_data') # 這裏的路徑須要換成本身數據文件下載的路徑
 4 wpt = nltk.WordPunctTokenizer()
 5 tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
 6 
 7 # 須要設置一些參數
 8 feature_size = 10    # 詞向量維度
 9 window_context = 10  # 滑動窗口                                                                        
10 min_word_count = 1   # 最小詞頻             
11 
12 w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
13                           window=window_context, min_count = min_word_count)
14 
15 w2v_model.wv['sky']

array([-0.02571594, -0.02806569, -0.01904523, -0.03620922,  0.01884929,
       -0.04410132,  0.02005241, -0.00504071,  0.01696092,  0.01301065],
      dtype=float32)

 1 def average_word_vectors(words, model, vocabulary, num_features):
 2     
 3     feature_vector = np.zeros((num_features,),dtype="float64")
 4     nwords = 0.
 5     
 6     for word in words:
 7         if word in vocabulary: 
 8             nwords = nwords + 1.
 9             feature_vector = np.add(feature_vector, model[word])
10     
11     if nwords:
12         feature_vector = np.divide(feature_vector, nwords)
13         
14     return feature_vector
15     
16    
17 def averaged_word_vectorizer(corpus, model, num_features):
18     vocabulary = set(model.wv.index2word)
19     features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
20                     for tokenized_sentence in corpus]
21     return np.array(features)

1 w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
2                                              num_features=feature_size)
3 pd.DataFrame(w2v_feature_array) #lstm

　　輸出結果就是輸入預料中的每個詞都轉換成向量，詞向量的應用十分普遍，現階段一般都是將其和神經網絡結合在一塊兒來搭配使用（後續案例就會看到其強大的戰鬥力）。

10.3論文與benchmark

　　在數據挖掘任務中，特徵工程尤其重要，數據的字段中可能包含各類各樣的信息，如何提取出最有價值的特徵呢？你們第一個想到的多是經驗方法，回顧一下以前處理其餘數據的方法或者一些通用的套路，但確定都不肯定方法是否得當，並且要把每一個想法都實踐一遍也不太現實。這裏給你們推薦一個套路，結合論文與benchmark來找解決方案，相信會事半功倍。

　　最好的方法就是從論文入手，你們也能夠把論文看成是一個實際任務的解決方案，對於較複雜的任務，你可能沒有深刻研究過，可是前人已經探索過其中的方法，論文就是他們對好的思路、實驗結果以及其中遇到各類問題的總結。若是把他們的方法加以研究和改進，再應用到實際任務中，是否是看起來很棒？

　　可是，如何找到合適的論文做爲參考呢？若是不是專門作某一領域，可能對這些資源並非很熟悉，這裏給你們推薦benchmark，翻譯過來叫做「基準」。其實它就是一個數據庫，裏面有某一領域的數據集，而且收錄不少該領域的論文，還有測試結果。

　　圖10-2所示爲迪哥曾經作過實驗的benchmark，首頁就是它的總體介紹。例如，對於一我的體關鍵點的圖像識別任務，其中不只提供了一份人體姿態的數據集，還收錄不少篇相關論文，一般能被benchmark收錄進來的論文都是被證實過效果很是不錯的。

　　圖10-2 MPII人體姿態識別benchmark

　　圖10-3中截取了其收錄的一部分論文，從2013—2018年的姿態識別經典論文均可以在此找到。若是你們熟悉計算機視覺領域，就能看出這些論文的發表級別很是高，右側有其實驗結果，包括頭部、肩膀、各個關節的識別效果。能夠發現，隨着年份的增長，效果逐步提高，如今作得已經很成熟了。

　　圖10-3 收錄論文結果

　　對於不會選擇合適論文的同窗，仍是看經典論文吧，直接搜索出來的論文可能價值通常，benchmark推薦的論文都是經典且有學習價值的。

　　Benchmark還有一個特色，就是其收錄的論文不少都是有公開代碼的。圖10-四、圖10-5就是打開的論文主頁，不只有實驗的源碼，還提供了訓練好的模型，不管是實際完成任務仍是學習階段，都對你們有很大的幫助。假設你須要作一我的體姿態識別的任務，這時候你不僅手裏有一份當下效果最好的識別代碼，還有原做者訓練好的模型，直接部署到服務器，不出一天你就能夠說：任務基本完成了，目前來看沒有比這個效果更好的了（這爲咱們的工做提供了一條捷徑）。

　　▲圖10-4 論文公開源碼（1）

　　▲圖10-5 論文公開源碼（2）

　　在初學階段最好將理論與實踐結合在一塊兒，論文固然就是指導思想，告訴你們一步步該怎麼作，其提供的代碼就是實踐方法。筆者認爲沒有源碼的學習是很是痛苦的，由於論文當中不少細節都簡化了，估計不少同窗也是這樣的想法，看代碼反而能更直接地理解論文的思想。

　　如何應用源碼呢？一般拿到的工做都是比較複雜的，直接看一行行代碼,估計都挺費勁，最好的辦法就是一步步debug，看看其中每一步完成了什麼，再結合論文就好理解了。

10.3圖像特徵

1 pip install skimage

 1 import skimage
 2 import numpy as np
 3 import pandas as pd
 4 import matplotlib.pyplot as plt
 5 from skimage import io
 6 #opencv tensorflow
 7 %matplotlib inline
 8 
 9 cat = io.imread('./datasets/cat.png')
10 dog = io.imread('./datasets/dog.png')
11 df = pd.DataFrame(['Cat', 'Dog'], columns=['Image'])
12 
13 
14 print(cat.shape, dog.shape)

(168, 300, 3) (168, 300, 3)

1 cat #0-255,越小的值表明越暗，越大的值越亮

array([[[114, 105,  90],
        [113, 104,  89],
        [112, 103,  88],
        ...,
        [127, 130, 121],
        [130, 133, 124],
        [133, 136, 127]],

       [[113, 104,  89],
        [112, 103,  88],
        [111, 102,  87],
        ...,
        [129, 132, 125],
        [132, 135, 128],
        [135, 138, 131]],

       [[111, 102,  87],
        [111, 102,  87],
        [110, 101,  86],
        ...,
        [132, 134, 133],
        [136, 138, 137],
        [139, 141, 140]],

       ...,

       [[ 32,  26,  28],
        [ 32,  26,  28],
        [ 30,  24,  26],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]],

       [[ 33,  27,  29],
        [ 32,  26,  28],
        [ 31,  25,  27],
        ...,
        [131, 131, 131],
        [131, 131, 131],
        [130, 130, 130]]], dtype=uint8)

1 #coffee = skimage.transform.resize(coffee, (300, 451), mode='reflect')
2 fig = plt.figure(figsize = (8,4))
3 ax1 = fig.add_subplot(1,2, 1)
4 ax1.imshow(cat)
5 ax2 = fig.add_subplot(1,2, 2)
6 ax2.imshow(dog)

<matplotlib.image.AxesImage at 0x233c9b53988>

 1 dog_r = dog.copy() # Red Channel
 2 dog_r[:,:,1] = dog_r[:,:,2] = 0 # set G,B pixels = 0
 3 dog_g = dog.copy() # Green Channel
 4 dog_g[:,:,0] = dog_r[:,:,2] = 0 # set R,B pixels = 0
 5 dog_b = dog.copy() # Blue Channel
 6 dog_b[:,:,0] = dog_b[:,:,1] = 0 # set R,G pixels = 0
 7 
 8 plot_image = np.concatenate((dog_r, dog_g, dog_b), axis=1)
 9 plt.figure(figsize = (10,4))
10 plt.imshow(plot_image)

1 dog_r

array([[[160,   0,   0],
        [160,   0,   0],
        [160,   0,   0],
        ..., 
        [113,   0,   0],
        [113,   0,   0],
        [112,   0,   0]],

       [[160,   0,   0],
        [160,   0,   0],
        [160,   0,   0],
        ..., 
        [113,   0,   0],
        [113,   0,   0],
        [112,   0,   0]],

       [[160,   0,   0],
        [160,   0,   0],
        [160,   0,   0],
        ..., 
        [113,   0,   0],
        [113,   0,   0],
        [112,   0,   0]],

       ..., 
       [[165,   0,   0],
        [165,   0,   0],
        [165,   0,   0],
        ..., 
        [212,   0,   0],
        [211,   0,   0],
        [210,   0,   0]],

       [[165,   0,   0],
        [165,   0,   0],
        [165,   0,   0],
        ..., 
        [210,   0,   0],
        [210,   0,   0],
        [209,   0,   0]],

       [[164,   0,   0],
        [164,   0,   0],
        [164,   0,   0],
        ..., 
        [209,   0,   0],
        [209,   0,   0],
        [209,   0,   0]]], dtype=uint8)

灰度圖：

1 fig = plt.figure(figsize = (8,4))
2 ax1 = fig.add_subplot(2,2, 1)
3 ax1.imshow(cgs, cmap="gray")
4 ax2 = fig.add_subplot(2,2, 2)
5 ax2.imshow(dgs, cmap='gray')

<matplotlib.image.AxesImage at 0x1fca2353358>

本章小結：

　　本章介紹了特徵提取的經常使用方法，主要包括數值特徵和文本特徵，能夠說不一樣的方法各有其優缺點。在任務起始階段，應當儘量多地嘗試各類可能的提取方法，特徵多沒關係，實際建模的時候，能夠經過實驗來篩選，可是少了就沒有辦法了，因此，在特徵工程階段，仍是要多動腦筋，要提早考慮建模方案。由於一旦涉及海量數據，提取特徵但是一個漫長的活，若是隻是走一步看一步，效率就會大大下降。

　　作任務的時候，必定要結合論文，各類解決方案都要進行嘗試，最好的方法就是先學學別人是怎麼作的，再應用到本身的實際任務中。

第10章完。

python數據分析我的學習讀書筆記-目錄索引

該書資源下載，請至異步社區：https://www.epubit.com