淺談對機器學習方法（決策樹，SVM，knn最近鄰，隨機森林，樸素貝葉斯、邏輯迴歸）的理解以及用sklearn工具實現文本分類和迴歸方法

時間 2019-11-20

標籤淺談機器學習方法決策樹 svm knn 近鄰隨機森林樸素貝葉邏輯迴歸理解以及 sklearn 工具實現文本分類方法简体版

原文原文鏈接

1、決策樹html

　　定下一個最初的質點，從該點出發、分叉。（因爲最初質點有可能落在邊界值上，此時有可能會出現過擬合的問題。node

2、SVMpython

　　 svm是除深度學習在深度學習出現以前最好的分類算法了。它的特徵以下：web

　　（1）它既可應用於線性（迴歸問題）分類，也可應用於非線性分類;算法

　　（2）經過調節核函數參數的設置，可將數據集映射到多維平面上，對其細粒度化，從而使它的特徵從二維變成多維，將在二維上線性不可分的問題轉化爲在多維上線性可　　　　分的問題，最後再尋找一個最優切割平面（至關於在決策數基礎上再尋找一個最優解），所以svm的分類效果是優於大多數的機器學習分類方法的。markdown

　　（3）經過其它參數的設置，svm還能夠防止過擬合的問題。數據結構

推薦學習博客（噠噠師兄大大地推薦的喔~）：支持向量機通俗導論（理解SVM的三層境界）dom

3、隨機森林機器學習

　　爲了防止過擬合的問題，隨機森林至關於多顆決策樹。函數

4、knn最近鄰

　　因爲knn在每次尋找下一個離它最近的點時，都要將餘下全部的點遍歷一遍，所以其算法代價十分高。

5、樸素貝葉斯

要推事件A發生的機率下B發生的機率（其中事件A、B都可分解成多個事件），就能夠經過求事件B發生的機率下事件A發生的機率，再經過貝葉斯定理計算便可算出結果。

6、邏輯迴歸

　　（離散型變量，二分類問題，只有兩個值0和1）

本文主要參考了scikit-learn的官方網站

用scikit-learn的基本分類方法（決策樹、SVM、KNN）和集成方法（隨機森林，Adaboost和GBRT）

1. 數據準備

關於分類,咱們使用了Iris數據集,這個scikit-learn自帶了.
Iris數據集是經常使用的分類實驗數據集，由Fisher, 1936收集整理。Iris也稱鳶尾花卉數據集，是一類多重變量分析的數據集。數據集包含150個數據集，分爲3類，每類50個數據，每一個數據包含4個屬性。可經過花萼長度，花萼寬度，花瓣長度，花瓣寬度4個屬性預測鳶尾花卉屬於（Setosa，Versicolour，Virginica）三個種類中的哪一類。

注意,Iris數據集給出的三種花是按照順序來的,前50個是第0類,51-100是第1類,101~150是第二類,若是咱們分訓練集和測試集的時候要把順序打亂
這裏咱們引入一個兩類shuffle的函數,它接收兩個參數,分別是x和y,而後把x,y綁在一塊兒shuffle.

 1 def shuffle_in_unison(a, b):
 2     assert len(a) == len(b)
 3     import numpy
 4     shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
 5     shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
 6     permutation = numpy.random.permutation(len(a))
 7     for old_index, new_index in enumerate(permutation):
 8         shuffled_a[new_index] = a[old_index]
 9         shuffled_b[new_index] = b[old_index]
10     return shuffled_a, shuffled_b

下面咱們導入Iris數據並打亂它,而後分爲100個訓練集和50個測試集

1 from sklearn.datasets import load_iris
2 
3 iris = load_iris()
4 def load_data():
5     iris.data, iris.target = shuffle_in_unison(iris.data, iris.target)
6     x_train ,x_test = iris.data[:100],iris.data[100:]
7     y_train, y_test = iris.target[:100].reshape(-1,1),iris.target[100:].reshape(-1,1)
8     return x_train, y_train, x_test, y_test

2. 試驗各類不一樣的方法

經常使用的分類方法通常有決策樹, SVM, kNN, 樸素貝葉斯, 集成方法有隨機森林,Adaboost和GBDT
完整代碼以下:

 1 from sklearn.datasets import load_iris
 2 
 3 iris = load_iris()
 4 
 5 def shuffle_in_unison(a, b):
 6     assert len(a) == len(b)
 7     import numpy
 8     shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
 9     shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
10     permutation = numpy.random.permutation(len(a))
11     for old_index, new_index in enumerate(permutation):
12         shuffled_a[new_index] = a[old_index]
13         shuffled_b[new_index] = b[old_index]
14     return shuffled_a, shuffled_b
15 
16 def load_data():
17     iris.data, iris.target = shuffle_in_unison(iris.data, iris.target)
18     x_train ,x_test = iris.data[:100],iris.data[100:]
19     y_train, y_test = iris.target[:100].reshape(-1,1),iris.target[100:].reshape(-1,1)
20     return x_train, y_train, x_test, y_test
21 
22 
23 from sklearn import tree, svm, naive_bayes,neighbors
24 from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier
25 
26 
27 x_train, y_train, x_test, y_test = load_data()
28 
29 clfs = {'svm': svm.SVC(),\
30         'decision_tree':tree.DecisionTreeClassifier(),
31         'naive_gaussian': naive_bayes.GaussianNB(), \
32         'naive_mul':naive_bayes.MultinomialNB(),\
33         'K_neighbor' : neighbors.KNeighborsClassifier(),\
34         'bagging_knn' : BaggingClassifier(neighbors.KNeighborsClassifier(), max_samples=0.5,max_features=0.5), \
35         'bagging_tree': BaggingClassifier(tree.DecisionTreeClassifier(), max_samples=0.5,max_features=0.5),
36         'random_forest' : RandomForestClassifier(n_estimators=50),\
37         'adaboost':AdaBoostClassifier(n_estimators=50),\
38         'gradient_boost' : GradientBoostingClassifier(n_estimators=50, learning_rate=1.0,max_depth=1, random_state=0)
39         }
40 
41 def try_different_method(clf):
42     clf.fit(x_train,y_train.ravel())
43     score = clf.score(x_test,y_test.ravel())
44     print('the score is :', score)
45 
46 for clf_key in clfs.keys():
47     print('the classifier is :',clf_key)
48     clf = clfs[clf_key]
49     try_different_method(clf)

給出的結果以下:

 1 the classifier is : svm
 2 the score is : 0.94
 3 the classifier is : decision_tree
 4 the score is : 0.88
 5 the classifier is : naive_gaussian
 6 the score is : 0.96
 7 the classifier is : naive_mul
 8 the score is : 0.8
 9 the classifier is : K_neighbor
10 the score is : 0.94
11 the classifier is : gradient_boost
12 the score is : 0.88
13 the classifier is : adaboost
14 the score is : 0.62
15 the classifier is : bagging_tree
16 the score is : 0.94
17 the classifier is : bagging_knn
18 the score is : 0.94
19 the classifier is : random_forest
20 the score is : 0.92

用scikit-learn的基本回歸方法（線性、決策樹、SVM、KNN）和集成方法（隨機森林，Adaboost和GBRT）

前言：本教程主要使用了numpy的最最基本的功能，用於生成數據，matplotlib用於繪圖，scikit-learn用於調用機器學習方法。若是你不熟悉他們（我也不熟悉）,不要緊，看看numpy和matplotlib最簡單的教程就夠了。咱們這個教程的程序不超過50行

1. 數據準備

爲了實驗用，我本身寫了一個二元函數，y=0.5*np.sin(x1)+ 0.5*np.cos(x2)+0.1*x1+3。其中x1的取值範圍是0~50，x

 1 def f(x1, x2):
 2     y = 0.5 * np.sin(x1) + 0.5 * np.cos(x2)  + 0.1 * x1 + 3 
 3     return y
 4 
 5 def load_data():
 6     x1_train = np.linspace(0,50,500)
 7     x2_train = np.linspace(-10,10,500)
 8     data_train = np.array([[x1,x2,f(x1,x2) + (np.random.random(1)-0.5)] for x1,x2 in zip(x1_train, x2_train)])
 9     x1_test = np.linspace(0,50,100)+ 0.5 * np.random.random(100)
10     x2_test = np.linspace(-10,10,100) + 0.02 * np.random.random(100)
11     data_test = np.array([[x1,x2,f(x1,x2)] for x1,x2 in zip(x1_test, x2_test)])
12     return data_train, data_test

其中訓練集（y上加有-0.5~0.5的隨機噪聲）和測試集（沒有噪聲）的圖像以下：

2. scikit-learn最簡單的介紹。

scikit-learn很是簡單，只需實例化一個算法對象，而後調用fit()函數就能夠了，fit以後，就可使用predict()函數來預測了，而後可使用score(）函數來評估預測值和真實值的差別，函數返回一個得分。例如調用決策樹的方法以下:

 1 In [6]: from sklearn.tree import DecisionTreeRegressor
 2 
 3 In [7]: clf = DecisionTreeRegressor()
 4 
 5 In [8]: clf.fit(x_train,y_train)
 6 Out[11]:
 7 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
 8            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
 9            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
10            splitter='best')
11 In [15]: result = clf.predict(x_test)
12 
13 In [16]: clf.score(x_test,y_test)
14 Out[16]: 0.96352052312508396
15 
16 In [17]: result
17 Out[17]:
18 array([ 2.44996735,  2.79065744,  3.21866981,  3.20188779,  3.04219101,
19         2.60239551,  3.35783805,  2.40556647,  3.12082094,  2.79870458,
20         2.79049667,  3.62826131,  3.66788213,  4.07241195,  4.27444808,
21         4.75036169,  4.3854911 ,  4.52663074,  4.19299748,  4.42235821,
22         4.48263415,  4.16192621,  4.40477767,  3.76067775,  4.35353213,
23         4.6554961 ,  4.99228199,  4.29504731,  4.55211437,  5.08229167,

接下來，咱們能夠根據預測值和真值來畫出一個圖像。畫圖的代碼以下：

1     plt.figure()
2     plt.plot(np.arange(len(result)), y_test,'go-',label='true value')
3     plt.plot(np.arange(len(result)),result,'ro-',label='predict value')
4     plt.title('score: %f'%score)
5     plt.legend()
6     plt.show()

而後圖像會顯示以下：

3. 開始試驗各類不一樣的迴歸方法

爲了加快測試, 這裏寫了一個函數,函數接收不一樣的迴歸類的對象,而後它就會畫出圖像,而且給出得分.
函數基本以下:

 1 def try_different_method(clf):
 2     clf.fit(x_train,y_train)
 3     score = clf.score(x_test, y_test)
 4     result = clf.predict(x_test)
 5     plt.figure()
 6     plt.plot(np.arange(len(result)), y_test,'go-',label='true value')
 7     plt.plot(np.arange(len(result)),result,'ro-',label='predict value')
 8     plt.title('score: %f'%score)
 9     plt.legend()
10     plt.show()

1 train, test = load_data()
2 x_train, y_train = train[:,:2], train[:,2] #數據前兩列是x1,x2 第三列是y,這裏的y有隨機噪聲
3 x_test ,y_test = test[:,:2], test[:,2] # 同上,不過這裏的y沒有噪聲

3.1 常規迴歸方法

常規的迴歸方法有線性迴歸,決策樹迴歸,SVM和k近鄰(KNN)

3.1.1 線性迴歸

1 In [4]: from sklearn import linear_model
2 
3 In [5]: linear_reg = linear_model.LinearRegression()
4 
5 In [6]: try_different_method(linar_reg)

3.1.2數迴歸

1 from sklearn import tree
2 tree_reg = tree.DecisionTreeRegressor()
3 try_different_method(tree_reg)

而後決策樹迴歸的圖像就會顯示出來:

3.1.3 SVM迴歸

1 In [7]: from sklearn import svm
2 
3 In [8]: svr = svm.SVR()
4 
5 In [9]: try_different_method(svr)

結果圖像以下:

3.1.4 KNN

1 In [11]: from sklearn import neighbors
2 
3 In [12]: knn = neighbors.KNeighborsRegressor()
4 
5 In [13]: try_different_method(knn)

居然KNN這個計算效能最差的算法效果最好

3.2 集成方法(隨機森林,adaboost, GBRT)

3.2.1隨機森林

1 In [14]: from sklearn import ensemble
2 
3 In [16]: rf =ensemble.RandomForestRegressor(n_estimators=20)#這裏使用20個決策樹
4 
5 In [17]: try_different_method(rf)

3.2.2 Adaboost

1 In [18]: ada = ensemble.AdaBoostRegressor(n_estimators=50)
2 
3 In [19]: try_different_method(ada)

圖像以下:

3.2.3 GBRT

1 In [20]: gbrt = ensemble.GradientBoostingRegressor(n_estimators=100)
2 
3 In [21]: try_different_method(gbrt)

圖像以下

4. scikit-learn還有不少其餘的方法,能夠參考用戶手冊自行試驗.

5.完整代碼

我這裏在pycharm寫的代碼,可是在pycharm裏面不顯示圖形,因此能夠把代碼複製到ipython中,使用%paste方法複製代碼片.
而後參照上面的各個方法導入算法,使用try_different_mothod()函數畫圖.
完整代碼以下:

 1 import numpy as np
 2 import matplotlib.pyplot as plt
 3 
 4 def f(x1, x2):
 5     y = 0.5 * np.sin(x1) + 0.5 * np.cos(x2) + 3 + 0.1 * x1 
 6     return y
 7 
 8 def load_data():
 9     x1_train = np.linspace(0,50,500)
10     x2_train = np.linspace(-10,10,500)
11     data_train = np.array([[x1,x2,f(x1,x2) + (np.random.random(1)-0.5)] for x1,x2 in zip(x1_train, x2_train)])
12     x1_test = np.linspace(0,50,100)+ 0.5 * np.random.random(100)
13     x2_test = np.linspace(-10,10,100) + 0.02 * np.random.random(100)
14     data_test = np.array([[x1,x2,f(x1,x2)] for x1,x2 in zip(x1_test, x2_test)])
15     return data_train, data_test
16 
17 train, test = load_data()
18 x_train, y_train = train[:,:2], train[:,2] #數據前兩列是x1,x2 第三列是y,這裏的y有隨機噪聲
19 x_test ,y_test = test[:,:2], test[:,2] # 同上,不過這裏的y沒有噪聲
20 
21 def try_different_method(clf):
22     clf.fit(x_train,y_train)
23     score = clf.score(x_test, y_test)
24     result = clf.predict(x_test)
25     plt.figure()
26     plt.plot(np.arange(len(result)), y_test,'go-',label='true value')
27     plt.plot(np.arange(len(result)),result,'ro-',label='predict value')
28     plt.title('score: %f'%score)
29     plt.legend()
30     plt.show()