在Python中實現機器學習功能的4種方法

時間 2019-12-05

原文原文鏈接

在Python中實現機器學習功能的4種方法

來源 | 願碼(ChainDesk.CN)內容編輯

願碼Slogan | 鏈接每一個程序員的故事

網站 | http://chaindesk.cn

願碼願景 | 打造全學科IT系統免費課程，助力小白用戶、初級工程師0成本免費系統學習、低成本進階，幫助BAT一線資深工程師成長並利用自身優點創造睡後收入。

官方公衆號 | 願碼 | 願碼服務號 | 區塊鏈部落

免費加入願碼全思惟工程師社羣 | 任一公衆號回覆「願碼」兩個字獲取入羣二維碼

本文閱讀時長：13min程序員

在本文中，咱們將介紹從數據集中選擇要素的不一樣方法; 並使用Scikit-learn（sklearn）庫討論特徵選擇算法的類型及其在Python中的實現：算法

單變量特徵選擇
遞歸特徵消除(RFE)
主成分分析（PCA）
特徵選擇 (feature importance)

單變量特徵選擇

統計測試可用於選擇與輸出變量具備最強關係的那些特徵。數據庫

scikit-learn庫提供SelectKBest類，能夠與一組不一樣的統計測試一塊兒使用，以選擇特定數量的功能。數組

如下示例使用chi平方（chi ^ 2）統計檢驗非負特徵來選擇Pima Indians糖尿病數據集中的四個最佳特徵：app

#Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm

from sklearn.feature_selection import SelectKBest

#Import chi2 for performing chi square test from sklearn.feature_selection import chi2

#URL for loading the dataset

url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#We will select the features using chi square

test = SelectKBest(score_func=chi2, k=4)

#Fit the function for ranking the features by score

fit = test.fit(X, Y)

#Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_)

#Apply the transformation on to dataset

features = fit.transform(X)

#Summarize selected features print(features[0:5,:])

每一個屬性的分數和所選的四個屬性（分數最高的分數）：plas，test，mass和age。dom

每一個功能的分數：機器學習

[111.52   1411.887 17.605 53.108  2175.565   127.669 5.393

181.304]

特點：函數

[[148. 0. 33.6 50. ]

[85. 0. 26.6 31. ]

[183. 0. 23.3 32. ]

[89. 94. 28.1 21. ]

[137. 168. 43.1 33. ]]

遞歸特徵消除(RFE)

RFE經過遞歸刪除屬性並在剩餘的屬性上構建模型來工做。它使用模型精度來識別哪些屬性（和屬性組合）對預測目標屬性的貢獻最大。如下示例使用RFE和邏輯迴歸算法來選擇前三個特徵。算法的選擇並不重要，只要它技巧性和一致性：性能

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE

#Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

#Create pandas data frame by loading the data from URL

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

model = LogisticRegression() rfe = RFE(model, 3)

fit = rfe.fit(X, Y)

print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_)

執行後，咱們將得到：學習

Num Features: 3

Selected Features: [ True False False False False   True  True False]

Feature Ranking: [1 2 3 5 6 1 1 4]

您能夠看到RFE選擇了前三個功能，如preg，mass和pedi。這些在support_數組中標記爲True，並在ranking_數組中標記爲選項1。

主成分分析（PCA）

PCA使用線性代數將數據集轉換爲壓縮形式。一般，它被認爲是數據簡化技術。PCA的一個屬性是您能夠選擇轉換結果中的維數或主成分數。

在如下示例中，咱們使用PCA並選擇三個主要組件：

#Import the required packages

#Import pandas to read csv import pandas

#Import numpy for array related operations import numpy

#Import sklearn's PCA algorithm

from sklearn.decomposition import PCA

#URL for loading the dataset

url =

"https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data"

#Define the attribute names

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

#Create array from data values

array = dataframe.values

#Split the data into input and target

X = array[:,0:8]

Y = array[:,8]

#Feature extraction

pca = PCA(n_components=3) fit = pca.fit(X)

#Summarize components

print("Explained Variance: %s") % fit.explained_variance_ratio_

print(fit.components_)

您能夠看到轉換後的數據集（三個主要組件）與源數據幾乎沒有類似之處：

Explained Variance: [ 0.88854663   0.06159078  0.02579012]

[[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02

9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03]

[ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01

[ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]]

特徵選擇 (feature importance)

特徵重要性是用於使用訓練有監督的分類器來選擇特徵的技術。當咱們訓練分類器（例如決策樹）時，咱們會評估每一個屬性以建立分裂; 咱們能夠將此度量用做特徵選擇器。讓咱們詳細瞭解它。

隨機森林是最受歡迎的機器學習方法之一，由於它們具備相對較好的準確性，穩健性和易用性。它們還提供了兩種直接的特徵選擇方法 - 平均下降雜質和平均下降精度。

隨機森林由許多決策樹組成。決策樹中的每一個節點都是單個要素上的條件，旨在將數據集拆分爲兩個，以便相似的響應值最終出如今同一個集合中。選擇（局部）最佳條件的度量稱爲雜質。對於分類，它一般是基尼係數

雜質或信息增益/熵，對於迴歸樹，它是方差。所以，當訓練樹時，能夠經過每一個特徵減小樹中的加權雜質的程度來計算它。對於森林，能夠對每一個特徵的雜質減小進行平均，而且根據該度量對特徵進行排序。

讓咱們看看如何使用隨機森林分類器進行特徵選擇，並評估特徵選擇先後分類器的準確性。咱們將使用Otto數據集。

該數據集描述了超過61,000種產品的93個模糊細節，這些產品分爲10個產品類別（例如，時裝，電子產品等）。輸入屬性是某種不一樣事件的計數。

目標是將新產品的預測做爲10個類別中每一個類別的機率數組，並使用多類對數損失（也稱爲交叉熵）來評估模型。

咱們將從導入全部庫開始：

#Import the supporting libraries

#Import pandas to load the dataset from csv file

from pandas import read_csv

#Import numpy for array based operations and calculations

import numpy as np

#Import Random Forest classifier class from sklearn

from sklearn.ensemble import RandomForestClassifier

#Import feature selector class select model of sklearn

        from sklearn.feature_selection

        import SelectFromModel

         np.random.seed(1)

讓咱們定義一種方法將數據集拆分爲訓練和測試數據; 咱們將在訓練部分訓練咱們的數據集，測試部分將用於評估訓練模型：

#Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split):

np.random.seed(0) training = [] testing = []

np.random.shuffle(dataset) shape = np.shape(dataset)

trainlength = np.uint16(np.floor(split*shape[0]))

for i in range(trainlength): training.append(dataset[i])

for i in range(trainlength,shape[0]): testing.append(dataset[i])

training = np.array(training) testing = np.array(testing)

return training,testing

咱們還須要添加一個函數來評估模型的準確性; 它將預測和實際輸出做爲輸入來計算百分比準確度：

#Function to evaluate model performance

def getAccuracy(pre,ytest): count = 0

for i in range(len(ytest)):

if ytest[i]==pre[i]: count+=1

acc = float(count)/len(ytest)

return acc

這是加載數據集的時間。咱們將加載train.csv文件; 此文件包含超過61,000個訓練實例。咱們將在咱們的示例中使用50000個實例，其中咱們將使用35,000個實例來訓練分類器，並使用15,000個實例來測試分類器的性能：

#Load dataset as pandas data frame

data = read_csv('train.csv')

#Extract attribute names from the data frame

feat = data.keys()

feat_labels = feat.get_values()

#Extract data values from the data frame

dataset = data.values

#Shuffle the dataset

np.random.shuffle(dataset)

#We will select 50000 instances to train the classifier

inst = 50000

#Extract 50000 instances from the dataset

dataset = dataset[0:inst,:]

#Create Training and Testing data for performance evaluation

train,test = getTrainTestData(dataset, 0.7)

#Split data into input and output variable with selected features

Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain)

print("Shape of the dataset ",shape)

#Print the size of Data in MBs

print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6))

咱們在這裏注意數據大小; 由於咱們的數據集包含大約35000個具備94個屬性的訓練實例; 咱們的數據集的大小很是大。讓咱們來看看：

Shape of the dataset (35000, 94)

Size of Data set before feature selection: 26.32 MB

如您所見，咱們的數據集中有35000行和94列，超過26 MB數據。

在下一個代碼塊中，咱們將配置隨機林分類器; 咱們將使用250棵樹，最大深度爲30，隨機要素的數量爲7.其餘超參數將是sklearn的默認值：

#Lets select the test data for model evaluation purpose

Xtest = test[:,0:94] ytest = test[:,94]

#Create a random forest classifier with the following Parameters

trees            = 250

max_feat     = 7

max_depth = 30

min_sample = 2

clf = RandomForestClassifier(n_estimators=trees,

max_features=max_feat,

max_depth=max_depth,

min_samples_split= min_sample, random_state=0,

n_jobs=-1)

#Train the classifier and calculate the training time

import time

start = time.time() clf.fit(Xtrain, ytrain) end = time.time()

#Lets Note down the model training time

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

pre = clf.predict(Xtest)

Let's see how much time is required to train the model on the training dataset:

Execution time for building the Tree is: 2.913641

#Evaluate the model performance for the test data

acc = getAccuracy(pre, ytest)

print("Accuracy of model before feature selection is %.2f"%(100*acc))

咱們模型的準確性是：

特徵選擇前的模型精度爲98.82

正如您所看到的，咱們正在得到很是好的準確性，由於咱們將近99％的測試數據分類到正確的類別中。這意味着咱們正在對15,000個正確類中的14,823個實例進行分類。

那麼，如今個人問題是：咱們是否應該進一步改進？好吧，爲何不呢？若是能夠的話，咱們確定會尋求更多的改進; 在這裏，咱們將使用功能重要性來選擇功能。如您所知，在樹木構建過程當中，咱們使用雜質測量來選擇節點。選擇具備最低雜質的屬性值做爲樹中的節點。咱們可使用相似的標準進行特徵選擇。咱們能夠更加劇視雜質較少的功能，這可使用sklearn庫的feature_importances_函數來完成。讓咱們找出每一個功能的重要性：

#Once咱們培養的模型中，咱們的排名將全部功能的功能在拉鍊（feat_labels，clf.feature_importances_）：

print(feature)

('id', 0.33346650420175183)

('feat_1', 0.0036186958628801214)

('feat_2', 0.0037243050888530957)

('feat_3', 0.011579217472062748)

('feat_4', 0.010297382675187445)

('feat_5', 0.0010359139416194116)

('feat_6', 0.00038171336038056165)

('feat_7', 0.0024867672489765021)

('feat_8', 0.0096689721610546085)

('feat_9', 0.007906150362995093)

('feat_10', 0.0022342480802130366)

正如您在此處所看到的，每一個要素都基於其對最終預測的貢獻而具備不一樣的重要性。

咱們將使用這些重要性分數來排列咱們的功能; 在下面的部分中，咱們將選擇功能重要性大於0.01的模型訓練功能：

#Select features which have higher contribution in the final prediction

sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain)

在這裏，咱們將根據所選的特徵屬性轉換輸入數據集。在下一個代碼塊中，咱們將轉換數據集。而後，咱們將檢查新數據集的大小和形狀：

#Transform input dataset

Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest)

#Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6))

shape = np.shape(Xtrain_1)

print("Shape of the dataset ",shape)

Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20)

你看到數據集的形狀了嗎？在功能選擇過程以後，咱們只剩下20個功能，這將數據庫的大小從26 MB減小到5.60 MB。這比原始數據集減小了約80％。

在下一個代碼塊中，咱們將訓練一個新的隨機森林分類器，它具備與以前相同的超參數，並在測試數據集上進行測試。讓咱們看看修改訓練集後獲得的準確度：

#Model training time

start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time()

print("Execution time for building the Tree is: %f"%(float(end)- float(start)))

#Let's evaluate the model on test data

pre = clf.predict(Xtest_1) count = 0

acc2 = getAccuracy(pre, ytest)

print("Accuracy after feature selection %.2f"%(100*acc2))

Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97

你能看到!! 咱們使用修改後的數據集得到了99.97％的準確率，這意味着咱們在正確的類中對14,996個實例進行了分類，而以前咱們只正確地對14,823個實例進行了分類。

這是咱們在功能選擇過程當中取得的巨大進步; 咱們能夠總結下表中的全部結果：