sklearn_隨機森林random forest原理_乳腺癌分類器建模(推薦AAA)

時間 2019-11-13

標籤 sklearn 隨機森林 random forest 原理分類器建模推薦 aaa 简体版

原文原文鏈接

python機器學習-乳腺癌細胞挖掘（博主親自錄製視頻）

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

Toby，項目合做QQ：231469242html

隨機森林就是由多個決策樹組合而成的投票機制。node

理解隨機森林，要先了解決策樹python

隨機森林是一個集成機器學習算法git

構建多個樹的算法：github

信息增益算法

GINI係數bootstrap

其它決策樹算法windows

隨機森林優勢：微信

1.可用於分類和迴歸網絡

2.處理缺失值，並不影響準確性

3.模型不會過渡擬合

4.可用於高維度，大數據

5.同盾用的就是隨機森林

隨機森林缺點

1.分類效果很好，但迴歸效果更好

2.此算法是個黑箱，很難改動參數

3.高維度，少數據表現較差

4.不能像樹同樣可視化

5.耗時間長，CPU資源佔用多

bagging是機器學習集成元算法，用於提升穩定性，減小方差和準確性

boosting是機器學習集成元算法，用於減小歧義，減小監督學習裏方差

bagging是一種用來提升學習算法準確度的方法，這種方法經過構造一個預測函數系列，而後以必定的方式將它們組合成一個預測函數。Bagging要求「不穩定」（不穩定是指數據集的小的變更可以使得分類結果的顯著的變更）的分類方法。好比：決策樹，神經網絡算法。

基本思想

編輯

1.給定一個弱學習算法,和一個訓練集;

2.單個弱學習算法準確率不高;

3.將該學習算法使用屢次,得出預測函數序列,進行投票;

4.最後結果準確率將獲得提升.

算法

編輯

1.For t = 1, 2, …, T Do

從數據集S中取樣（放回選樣）

訓練獲得模型Ht

對未知樣本X分類時,每一個模型Ht都得出一個分類，得票最高的即爲未知樣本X的分類

2.也可經過得票的平均值用於連續值的預測

https://baike.baidu.com/item/Boosting/1403912

Boosting方法是一種用來提升弱分類算法準確度的方法,這種方法經過構造一個預測函數系列,而後以必定的方式將他們組合成一個預測函數。Boosting是一種提升任意給定學習算法準確度的方法。它的思想起源於 Valiant提出的 PAC ( Probably Approxi mately Correct)學習模型。

隨機森林的應用

1.信貸公司對客戶評分

2.醫藥療效判斷

3.購物車

4.股價分析

隨機森林算法原理

樹的分離函數

隨機森林的一些參數：

最大深度

隨機森林優勢：

有效並被普遍使用，默認參數表現良好，不需正則化處理數據，蒙特卡洛隨機處理效果比單個樹更好

python代碼測試中隨機森林不只準確性更高，並且強因子更準確

Random Forest

# -*- coding: utf-8 -*-
"""
Created on Sat Mar 31 09:30:24 2018

@author: Toby，項目合做QQ：231469242
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

cancer=load_breast_cancer()
x_train,x_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
#n_estimators=100表示有樹的個數

forest=RandomForestClassifier(n_estimators=100,random_state=0) forest.fit(x_train,y_train) print("random forest:") print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train))) print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test))) print('Feature importances:{}'.format(forest.feature_importances_)) n_features=cancer.data.shape[1] plt.barh(range(n_features),forest.feature_importances_,align='center') plt.yticks(np.arange(n_features),cancer.feature_names) plt.title("Random Forest:") plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.show()

決策樹

# -*- coding: utf-8 -*-
"""
Created on Tue Mar 27 22:59:44 2018

@author: Toby，項目合做QQ：231469242
 radius半徑
 texture結構，灰度值標準差
 symmetry對稱

決策樹找出強因子
worst radius
worst symmetry
worst texture
texture error

"""
import csv,pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pydotplus 
from IPython.display import Image
import graphviz
from sklearn.tree import export_graphviz
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

cancer=load_breast_cancer()

featureNames=cancer.feature_names
#random_state 至關於隨機數種子
X_train,x_test,y_train,y_test=train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42)

list_average_accuracy=[]
depth=range(1,30)
for i in depth:
    #max_depth=4限制決策樹深度能夠下降算法複雜度，獲取更精確值
    tree= DecisionTreeClassifier(max_depth=i,random_state=0)
    tree.fit(X_train,y_train)
    accuracy_training=tree.score(X_train,y_train)
    accuracy_test=tree.score(x_test,y_test)
    average_accuracy=(accuracy_training+accuracy_test)/2.0
    #print("average_accuracy:",average_accuracy)
    list_average_accuracy.append(average_accuracy)
    
max_value=max(list_average_accuracy)
#索引是0開頭，結果要加1
best_depth=list_average_accuracy.index(max_value)+1
print("best_depth:",best_depth)

best_tree= DecisionTreeClassifier(max_depth=best_depth,random_state=0)
best_tree.fit(X_train,y_train)
accuracy_training=best_tree.score(X_train,y_train)
accuracy_test=best_tree.score(x_test,y_test)

print("decision tree:")    
print("accuracy on the training subset:{:.3f}".format(best_tree.score(X_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(best_tree.score(x_test,y_test)))
print('Feature importances:{}'.format(best_tree.feature_importances_))
n_features=cancer.data.shape[1]
plt.barh(range(n_features),best_tree.feature_importances_,align='center')
plt.yticks(np.arange(n_features),cancer.feature_names)
plt.title("Decision Tree:")
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()


'''

print(cancer.DESCR)
print(cancer.feature_names)
print(cancer.target_names)
print(cancer.data)
print(type(cancer.data))
print(cancer.data.shape)

#可視化沒法展現
dot_data=export_graphviz(tree,out_file="cancertree.dot",class_names=['malignant','benign'],feature_names=cancer.feature_names,impurity=False,filled=True)
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png()) 

#![](cancertree.png)


'''

對隨機森林樹的多少測試

沒有變量篩選準確度0.972

隨機森林變量篩選

帥選後的乳腺癌數據

# -*- coding: utf-8 -*-
"""
Created on Sat Mar 31 09:30:24 2018

@author: Administrator
隨機森林不須要預處理數據
"""
import pandas as pd
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

#讀取文件
readFileName="breast_cancer_變量篩選.xlsx"
trees=10000

#讀取excel
df=pd.read_excel(readFileName)
# data爲Excel前幾列數據
data=df[df.columns[:-1]]
#標籤爲Excel最後一列數據
target=df[df.columns[-1:]]
#變量名
feature_names=list(df.columns[:-1])

x_train,x_test,y_train,y_test=train_test_split(data,target,random_state=0)
#n_estimators表示樹的個數，測試中100顆樹足夠
forest=RandomForestClassifier(n_estimators=trees,random_state=0)
forest.fit(x_train,y_train)

print("random forest with %d trees:"%trees)  
print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test)))
print('Feature importances:{}'.format(forest.feature_importances_))


n_features=data.shape[1]
plt.barh(range(n_features),forest.feature_importances_,align='center')
plt.yticks(np.arange(n_features),feature_names)
plt.title("random forest with %d trees:"%trees)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()


'''
accuracy on the training subset:1.000
accuracy on the test subset:0.972
'''

篩選出最佳10個變量後建模，準確率反而沒有不篩選高，說明蒙特卡洛算法強大和優越性。建議不要作變量篩選，直接保留原汁原味變量，讓蒙特卡洛本身去模擬。

篩選變量VS不篩選變量：

篩選變量準確性,1.0,0.958

不篩選變量準確性：1.0，0.972

隨機森林調參

https://www.cnblogs.com/pinard/p/6160412.html

　　在Bagging與隨機森林算法原理小結中，咱們對隨機森林(Random Forest, 如下簡稱RF）的原理作了總結。本文就從實踐的角度對RF作一個總結。重點講述scikit-learn中RF的調參注意事項，以及和GBDT調參的異同點。

1. scikit-learn隨機森林類庫概述

　　　　在scikit-learn中，RF的分類類是RandomForestClassifier，迴歸類是RandomForestRegressor。固然RF的變種Extra Trees也有，分類類ExtraTreesClassifier，迴歸類ExtraTreesRegressor。因爲RF和Extra Trees的區別較小，調參方法基本相同，本文只關注於RF的調參。

　　　　和GBDT的調參相似，RF須要調參的參數也包括兩部分，第一部分是Bagging框架的參數，第二部分是CART決策樹的參數。下面咱們就對這些參數作一個介紹。

2. RF框架參數

　　　　首先咱們關注於RF的Bagging框架的參數。這裏能夠和GBDT對比來學習。在scikit-learn 梯度提高樹(GBDT)調參小結中咱們對GBDT的框架參數作了介紹。GBDT的框架參數比較多，重要的有最大迭代器個數，步長和子採樣比例，調參起來比較費力。可是RF則比較簡單，這是由於bagging框架裏的各個弱學習器之間是沒有依賴關係的，這減少的調參的難度。換句話說，達到一樣的調參效果，RF調參時間要比GBDT少一些。

　　　　下面我來看看RF重要的Bagging框架的參數，因爲RandomForestClassifier和RandomForestRegressor參數絕大部分相同，這裏會將它們一塊兒講，不一樣點會指出。

　　　　1) n_estimators: 也就是弱學習器的最大迭代次數，或者說最大的弱學習器的個數。通常來講n_estimators過小，容易欠擬合，n_estimators太大，計算量會太大，而且n_estimators到必定的數量後，再增大n_estimators得到的模型提高會很小，因此通常選擇一個適中的數值。默認是100。

　　　　2) oob_score :便是否採用袋外樣原本評估模型的好壞。默認識False。我的推薦設置爲True，由於袋外分數反應了一個模型擬合後的泛化能力。

　　　　3) criterion: 即CART樹作劃分時對特徵的評價標準。分類模型和迴歸模型的損失函數是不同的。分類RF對應的CART分類樹默認是基尼係數gini,另外一個可選擇的標準是信息增益。迴歸RF對應的CART迴歸樹默認是均方差mse，另外一個能夠選擇的標準是絕對值差mae。通常來講選擇默認的標準就已經很好的。

　　　　從上面能夠看出， RF重要的框架參數比較少，主要須要關注的是 n_estimators，即RF最大的決策樹個數。

3. RF決策樹參數

　　　　下面咱們再來看RF的決策樹參數，它要調參的參數基本和GBDT相同，以下:

　　　　1) RF劃分時考慮的最大特徵數max_features: 可使用不少種類型的值，默認是"auto",意味着劃分時最多考慮 $\sqrt{N}$

　　　　2) 決策樹最大深度max_depth: 默承認以不輸入，若是不輸入的話，決策樹在創建子樹的時候不會限制子樹的深度。通常來講，數據少或者特徵少的時候能夠無論這個值。若是模型樣本量多，特徵也多的狀況下，推薦限制這個最大深度，具體的取值取決於數據的分佈。經常使用的能夠取值10-100之間。

　　　　3) 內部節點再劃分所需最小樣本數min_samples_split: 這個值限制了子樹繼續劃分的條件，若是某節點的樣本數少於min_samples_split，則不會繼續再嘗試選擇最優特徵來進行劃分。默認是2.若是樣本量不大，不須要管這個值。若是樣本量數量級很是大，則推薦增大這個值。

　　　　4) 葉子節點最少樣本數min_samples_leaf: 這個值限制了葉子節點最少的樣本數，若是某葉子節點數目小於樣本數，則會和兄弟節點一塊兒被剪枝。默認是1,能夠輸入最少的樣本數的整數，或者最少樣本數佔樣本總數的百分比。若是樣本量不大，不須要管這個值。若是樣本量數量級很是大，則推薦增大這個值。

　　　　5）葉子節點最小的樣本權重和min_weight_fraction_leaf：這個值限制了葉子節點全部樣本權重和的最小值，若是小於這個值，則會和兄弟節點一塊兒被剪枝。默認是0，就是不考慮權重問題。通常來講，若是咱們有較多樣本有缺失值，或者分類樹樣本的分佈類別誤差很大，就會引入樣本權重，這時咱們就要注意這個值了。

　　　　6) 最大葉子節點數max_leaf_nodes: 經過限制最大葉子節點數，能夠防止過擬合，默認是"None」，即不限制最大的葉子節點數。若是加了限制，算法會創建在最大葉子節點數內最優的決策樹。若是特徵很少，能夠不考慮這個值，可是若是特徵分紅多的話，能夠加以限制，具體的值能夠經過交叉驗證獲得。

　　　　7) 節點劃分最小不純度min_impurity_split: 這個值限制了決策樹的增加，若是某節點的不純度(基於基尼係數，均方差)小於這個閾值，則該節點再也不生成子節點。即爲葉子節點。通常不推薦改動默認值1e-7。

　　　　上面決策樹參數中最重要的包括最大特徵數max_features，最大深度max_depth，內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf。

4.RF調參實例

　　　　這裏仍然使用GBDT調參時一樣的數據集來作RF調參的實例，數據的下載地址在這。本例咱們採用袋外分數來評估咱們模型的好壞。

　　　　完整代碼參見個人github:https://github.com/ljpzzz/machinelearning/blob/master/ensemble-learning/random_forest_classifier.ipynb

　　　　首先，咱們載入須要的類庫：

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn import cross_validation, metrics

import matplotlib.pylab as plt
%matplotlib inline

　　　　接着，咱們把解壓的數據用下面的代碼載入，順便看看數據的類別分佈。

train = pd.read_csv('train_modified.csv')
target='Disbursed' # Disbursed的值就是二元分類的輸出
IDcol = 'ID'
train['Disbursed'].value_counts()

　　　　能夠看到類別輸出以下，也就是類別0的佔大多數。

0 19680
1 320
Name: Disbursed, dtype: int64

　　　　接着咱們選擇好樣本特徵和類別輸出。

x_columns = [x for x in train.columns if x not in [target, IDcol]]
X = train[x_columns]
y = train['Disbursed']

　　　　無論任何參數，都用默認的，咱們擬合下數據看看：

rf0 = RandomForestClassifier(oob_score=True, random_state=10)
rf0.fit(X,y)
print rf0.oob_score_
y_predprob = rf0.predict_proba(X)[:,1]
print "AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob)

　　　　輸出以下，可見袋外分數已經很高，並且AUC分數也很高。相對於GBDT的默認參數輸出，RF的默認參數擬合效果對本例要好一些。

0.98005
AUC Score (Train): 0.999833

　　　　咱們首先對n_estimators進行網格搜索：

param_test1 = {'n_estimators':range(10,71,10)}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100,
                                  min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',cv=5)
gsearch1.fit(X,y)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

　　　　輸出結果以下：

([mean: 0.80681, std: 0.02236, params: {'n_estimators': 10},
mean: 0.81600, std: 0.03275, params: {'n_estimators': 20},
mean: 0.81818, std: 0.03136, params: {'n_estimators': 30},
mean: 0.81838, std: 0.03118, params: {'n_estimators': 40},
mean: 0.82034, std: 0.03001, params: {'n_estimators': 50},
mean: 0.82113, std: 0.02966, params: {'n_estimators': 60},
mean: 0.81992, std: 0.02836, params: {'n_estimators': 70}],
{'n_estimators': 60},
0.8211334476626017)

　　　　這樣咱們獲得了最佳的弱學習器迭代次數，接着咱們對決策樹最大深度max_depth和內部節點再劃分所需最小樣本數min_samples_split進行網格搜索。

param_test2 = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}
gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, 
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test2, scoring='roc_auc',iid=False, cv=5)
gsearch2.fit(X,y)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

　　　　輸出以下：

([mean: 0.79379, std: 0.02347, params: {'min_samples_split': 50, 'max_depth': 3},
mean: 0.79339, std: 0.02410, params: {'min_samples_split': 70, 'max_depth': 3},
mean: 0.79350, std: 0.02462, params: {'min_samples_split': 90, 'max_depth': 3},
mean: 0.79367, std: 0.02493, params: {'min_samples_split': 110, 'max_depth': 3},
mean: 0.79387, std: 0.02521, params: {'min_samples_split': 130, 'max_depth': 3},
mean: 0.79373, std: 0.02524, params: {'min_samples_split': 150, 'max_depth': 3},
mean: 0.79378, std: 0.02532, params: {'min_samples_split': 170, 'max_depth': 3},
mean: 0.79349, std: 0.02542, params: {'min_samples_split': 190, 'max_depth': 3},
mean: 0.80960, std: 0.02602, params: {'min_samples_split': 50, 'max_depth': 5},
mean: 0.80920, std: 0.02629, params: {'min_samples_split': 70, 'max_depth': 5},
mean: 0.80888, std: 0.02522, params: {'min_samples_split': 90, 'max_depth': 5},
mean: 0.80923, std: 0.02777, params: {'min_samples_split': 110, 'max_depth': 5},
mean: 0.80823, std: 0.02634, params: {'min_samples_split': 130, 'max_depth': 5},
mean: 0.80801, std: 0.02637, params: {'min_samples_split': 150, 'max_depth': 5},
mean: 0.80792, std: 0.02685, params: {'min_samples_split': 170, 'max_depth': 5},
mean: 0.80771, std: 0.02587, params: {'min_samples_split': 190, 'max_depth': 5},
mean: 0.81688, std: 0.02996, params: {'min_samples_split': 50, 'max_depth': 7},
mean: 0.81872, std: 0.02584, params: {'min_samples_split': 70, 'max_depth': 7},
mean: 0.81501, std: 0.02857, params: {'min_samples_split': 90, 'max_depth': 7},
mean: 0.81476, std: 0.02552, params: {'min_samples_split': 110, 'max_depth': 7},
mean: 0.81557, std: 0.02791, params: {'min_samples_split': 130, 'max_depth': 7},
mean: 0.81459, std: 0.02905, params: {'min_samples_split': 150, 'max_depth': 7},
mean: 0.81601, std: 0.02808, params: {'min_samples_split': 170, 'max_depth': 7},
mean: 0.81704, std: 0.02757, params: {'min_samples_split': 190, 'max_depth': 7},
mean: 0.82090, std: 0.02665, params: {'min_samples_split': 50, 'max_depth': 9},
mean: 0.81908, std: 0.02527, params: {'min_samples_split': 70, 'max_depth': 9},
mean: 0.82036, std: 0.02422, params: {'min_samples_split': 90, 'max_depth': 9},
mean: 0.81889, std: 0.02927, params: {'min_samples_split': 110, 'max_depth': 9},
mean: 0.81991, std: 0.02868, params: {'min_samples_split': 130, 'max_depth': 9},
mean: 0.81788, std: 0.02436, params: {'min_samples_split': 150, 'max_depth': 9},
mean: 0.81898, std: 0.02588, params: {'min_samples_split': 170, 'max_depth': 9},
mean: 0.81746, std: 0.02716, params: {'min_samples_split': 190, 'max_depth': 9},
mean: 0.82395, std: 0.02454, params: {'min_samples_split': 50, 'max_depth': 11},
mean: 0.82380, std: 0.02258, params: {'min_samples_split': 70, 'max_depth': 11},
mean: 0.81953, std: 0.02552, params: {'min_samples_split': 90, 'max_depth': 11},
mean: 0.82254, std: 0.02366, params: {'min_samples_split': 110, 'max_depth': 11},
mean: 0.81950, std: 0.02768, params: {'min_samples_split': 130, 'max_depth': 11},
mean: 0.81887, std: 0.02636, params: {'min_samples_split': 150, 'max_depth': 11},
mean: 0.81910, std: 0.02734, params: {'min_samples_split': 170, 'max_depth': 11},
mean: 0.81564, std: 0.02622, params: {'min_samples_split': 190, 'max_depth': 11},
mean: 0.82291, std: 0.02092, params: {'min_samples_split': 50, 'max_depth': 13},
mean: 0.82177, std: 0.02513, params: {'min_samples_split': 70, 'max_depth': 13},
mean: 0.82415, std: 0.02480, params: {'min_samples_split': 90, 'max_depth': 13},
mean: 0.82420, std: 0.02417, params: {'min_samples_split': 110, 'max_depth': 13},
mean: 0.82209, std: 0.02481, params: {'min_samples_split': 130, 'max_depth': 13},
mean: 0.81852, std: 0.02227, params: {'min_samples_split': 150, 'max_depth': 13},
mean: 0.81955, std: 0.02885, params: {'min_samples_split': 170, 'max_depth': 13},
mean: 0.82092, std: 0.02600, params: {'min_samples_split': 190, 'max_depth': 13}],
{'max_depth': 13, 'min_samples_split': 110},
0.8242016800050813)

　　　　咱們看看咱們如今模型的袋外分數：

rf1 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=110,
                                  min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10)
rf1.fit(X,y)
print rf1.oob_score_

　　　　輸出結果爲：

0.984

　　　　可見此時咱們的袋外分數有必定的提升。也就是時候模型的泛化能力加強了。

　　　　對於內部節點再劃分所需最小樣本數min_samples_split，咱們暫時不能一塊兒定下來，由於這個還和決策樹其餘的參數存在關聯。下面咱們再對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一塊兒調參。

param_test3 = {'min_samples_split':range(80,150,20), 'min_samples_leaf':range(10,60,10)}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13,
                                  max_features='sqrt' ,oob_score=True, random_state=10),
   param_grid = param_test3, scoring='roc_auc',iid=False, cv=5)
gsearch3.fit(X,y)
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

　　　　輸出以下：

([mean: 0.82093, std: 0.02287, params: {'min_samples_split': 80, 'min_samples_leaf': 10},
mean: 0.81913, std: 0.02141, params: {'min_samples_split': 100, 'min_samples_leaf': 10},
mean: 0.82048, std: 0.02328, params: {'min_samples_split': 120, 'min_samples_leaf': 10},
mean: 0.81798, std: 0.02099, params: {'min_samples_split': 140, 'min_samples_leaf': 10},
mean: 0.82094, std: 0.02535, params: {'min_samples_split': 80, 'min_samples_leaf': 20},
mean: 0.82097, std: 0.02327, params: {'min_samples_split': 100, 'min_samples_leaf': 20},
mean: 0.82487, std: 0.02110, params: {'min_samples_split': 120, 'min_samples_leaf': 20},
mean: 0.82169, std: 0.02406, params: {'min_samples_split': 140, 'min_samples_leaf': 20},
mean: 0.82352, std: 0.02271, params: {'min_samples_split': 80, 'min_samples_leaf': 30},
mean: 0.82164, std: 0.02381, params: {'min_samples_split': 100, 'min_samples_leaf': 30},
mean: 0.82070, std: 0.02528, params: {'min_samples_split': 120, 'min_samples_leaf': 30},
mean: 0.82141, std: 0.02508, params: {'min_samples_split': 140, 'min_samples_leaf': 30},
mean: 0.82278, std: 0.02294, params: {'min_samples_split': 80, 'min_samples_leaf': 40},
mean: 0.82141, std: 0.02547, params: {'min_samples_split': 100, 'min_samples_leaf': 40},
mean: 0.82043, std: 0.02724, params: {'min_samples_split': 120, 'min_samples_leaf': 40},
mean: 0.82162, std: 0.02348, params: {'min_samples_split': 140, 'min_samples_leaf': 40},
mean: 0.82225, std: 0.02431, params: {'min_samples_split': 80, 'min_samples_leaf': 50},
mean: 0.82225, std: 0.02431, params: {'min_samples_split': 100, 'min_samples_leaf': 50},
mean: 0.81890, std: 0.02458, params: {'min_samples_split': 120, 'min_samples_leaf': 50},
mean: 0.81917, std: 0.02528, params: {'min_samples_split': 140, 'min_samples_leaf': 50}],
{'min_samples_leaf': 20, 'min_samples_split': 120},
0.8248650279471544)

　　　　最後咱們再對最大特徵數max_features作調參:

param_test4 = {'max_features':range(3,11,2)}
gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120,
                                  min_samples_leaf=20 ,oob_score=True, random_state=10),
   param_grid = param_test4, scoring='roc_auc',iid=False, cv=5)
gsearch4.fit(X,y)
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

　　　　輸出以下：

([mean: 0.81981, std: 0.02586, params: {'max_features': 3},
mean: 0.81639, std: 0.02533, params: {'max_features': 5},
mean: 0.82487, std: 0.02110, params: {'max_features': 7},
mean: 0.81704, std: 0.02209, params: {'max_features': 9}],
{'max_features': 7},
0.8248650279471544)

　　　　用咱們搜索到的最佳參數，咱們再看看最終的模型擬合：

rf2 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120,
                                  min_samples_leaf=20,max_features=7 ,oob_score=True, random_state=10)
rf2.fit(X,y)
print rf2.oob_score_

　　　　此時的輸出爲：

0.984

　　　　可見此時模型的袋外分數基本沒有提升，主要緣由是0.984已是一個很高的袋外分數了，若是想進一步須要提升模型的泛化能力，咱們須要更多的數據。

以上就是RF調參的一個總結，但願能夠幫到朋友們。

信用評分系統應用

http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

account balance 帳戶餘額

duration of credit

Data Set Information:

Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".

For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.

This dataset requires use of a cost matrix (see below)

..... 1 2
----------------------------
1 0 1
-----------------------
2 5 0

(1 = Good, 2 = Bad)

The rows represent the actual classification and the columns the predicted classification.

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

Attribute Information:

Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM / salary assignments for at least 1 year
A14 : no checking account

Attribute 2: (numerical)
Duration in month

Attribute 3: (qualitative)
Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)

Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical)
Credit amount

Attibute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical)
Installment rate in percentage of disposable income

Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical)
Present residence since

Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical)
Age in years

Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free

Attribute 16: (numerical)
Number of existing credits at this bank

Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer

Attribute 18: (numerical)
Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no

It is worse to class a customer as good when they are bad (5),

than it is to class a customer as bad when they are good (1).

多元共線性問題，缺失數據，非平衡數據

隨機森林對多元共線性不敏感，結果對缺失數據和非平衡的數據比較穩健，能夠很好地預測多達幾千個解釋變量的做用（Breiman 2001b），被譽爲當前最好的算法之一（Iverson et al. 2008）

在作實驗室的時候會發現當有線性相關的特徵時，選其中一個特徵和同時選擇兩個特徵的效果是同樣的，即隨機森林對多元共線不敏感。

在迴歸分析中，當自變量之間出現多重共線性現象時，常會嚴重影響到參數估計，擴大模型偏差，並破壞模型的穩健性，所以消除多重共線性成爲迴歸分析中參數估計的一個重要環節。如今經常使用的解決多元線性迴歸中多重共線性的迴歸模型有嶺迴歸（Ridge Regression）、主成分迴歸(Principal Component Regression簡記爲PCR)和偏最小二乘迴歸(Partial Least Square Regression簡記爲PLS)。
邏輯斯蒂迴歸對自變量的多元共線性很是敏感，要求自變量之間相互獨立。隨機森林則徹底不須要這個前提條件。

1.隨機森林的優勢是：

它的學習過程很快。在處理很大的數據時，它依舊很是高效。
隨機森林能夠處理大量的多達幾千個的自變量（Breiman，2001）。
現有的隨機森林算法評估全部變量的重要性，而不須要顧慮通常回歸問題面臨的多元共線性的問題。
它包含估計缺失值的算法，若是有一部分的資料遺失，仍能夠維持必定的準確度。
隨機森林中分類樹的算法天然地包括了變量的交互做用（interaction）（Cutler, et
al.，2007），即X1的變化致使X2對Y的做用發生改變。交互做用在其餘模型中（如邏輯斯蒂迴歸）因其複雜性常常被忽略。
隨機森林對離羣值不敏感，在隨機干擾較多的狀況下表現穩健。
隨機森林不易產生對數據的過分擬合（overfit）（Breiman，2001），然而這點尚有爭議（Elith and
Graham，2009）。

隨機森林經過袋外偏差（out-of-bag error）估計模型的偏差。對於分類問題，偏差是分類的錯誤率；對於迴歸問題，偏差是殘差的方差。隨機森林的每棵分類樹，都是對原始記錄進行有放回的重抽樣後生成的。每次重抽樣大約1/3的記錄沒有被抽取（Liaw，2012）。沒有被抽取的天然造成一個對照數據集。因此隨機森林不須要另外預留部分數據作交叉驗證，其自己的算法相似交叉驗證，並且袋外偏差是對預測偏差的無偏估計（Breiman，2001）。

2.隨機森林的缺點：

隨機森林在解決迴歸問題時並無像它在分類中表現的那麼好，這是由於它並不能給出一個連續型的輸出。當進行迴歸時，隨機森林不可以做出超越訓練集數據範圍的預測，這可能致使在對某些還有特定噪聲的數據進行建模時出現過分擬合。
對於許多統計建模者來講，隨機森林給人的感受像是一個黑盒子——你幾乎沒法控制模型內部的運行，只能在不一樣的參數和隨機種子之間進行嘗試。
是它的算法傾向於觀測值較多的類別（若是昆蟲B的記錄較多，並且昆蟲A、B和C間的差距不大，預測值會傾向於B）。
另外，**隨機森林中水平較多的分類屬性的自變量（如土地利用類型 >
20個類別）比水平較少的分類屬性的自變量（氣候區類型<10個類別）對模型的影響大**（Deng et
al.，2011）。總之，隨機森林功能強大而又簡單易用，相信它會對各行各業的數據分析產生積極的推進做用

隨機森林是一種比較新的機器學習模型。經典的機器學習模型是神經網絡，有半個多世紀的歷史了。神經網絡預測精確，可是計算量很大。上世紀八十年代Breiman等人發明分類樹的算法（Breiman et al. 1984），經過反覆二分數據進行分類或迴歸，計算量大大下降。2001年Breiman把分類樹組合成隨機森林（Breiman 2001a），即在變量（列）的使用和數據（行）的使用上進行隨機化，生成不少分類樹，再彙總分類樹的結果。隨機森林在運算量沒有顯著提升的前提下提升了預測精度。隨機森林對多元公線性不敏感，結果對缺失數據和非平衡的數據比較穩健，能夠很好地預測多達幾千個解釋變量的做用（Breiman 2001b），被譽爲當前最好的算法之一（Iverson et al. 2008）。

隨機森林顧名思義，是用隨機的方式創建一個森林，森林裏面有不少的決策樹組成，隨機森林的每一棵決策樹之間是沒有關聯的。在獲得森林以後，當有一個新的輸入樣本進入的時候，就讓森林中的每一棵決策樹分別進行一下判斷，看看這個樣本應該屬於哪一類（對於分類算法），而後看看哪一類被選擇最多，就預測這個樣本爲那一類。

1.2 隨機森林優勢

隨機森林是一個最近比較火的算法，它有不少的優勢：

a. 在數據集上表現良好，兩個隨機性的引入，使得隨機森林不容易陷入過擬合

b. 在當前的不少數據集上，相對其餘算法有着很大的優點，兩個隨機性的引入，使得隨機森林具備很好的抗噪聲能力

c. 它可以處理很高維度（feature不少）的數據，而且不用作特徵選擇，對數據集的適應能力強：既能處理離散型數據，也能處理連續型數據，數據集無需規範化

d. 可生成一個Proximities=（pij）矩陣，用於度量樣本之間的類似性： pij=aij/N, a_ij表示樣本i和j出如今隨機森林中同一個葉子結點的次數，N隨機森林中樹的顆數

e. 在建立隨機森林的時候，對generlization error使用的是無偏估計

f. 訓練速度快，能夠獲得變量重要性排序（兩種：基於OOB誤分率的增長量和基於分裂時的GINI降低量

g. 在訓練過程當中，可以檢測到feature間的互相影響

h. 容易作成並行化方法

i. 實現比較簡單

1.3 隨機森林應用範圍

隨機森林主要應用於迴歸和分類。本文主要探討基於隨機森林的分類問題。隨機森林和使用決策樹做爲基本分類器的（bagging）有些相似。以決策樹爲基本模型的bagging在每次bootstrap放回抽樣以後，產生一棵決策樹，抽多少樣本就生成多少棵樹，在生成這些樹的時候沒有進行更多的干預。而隨機森林也是進行bootstrap抽樣，但它與bagging的區別是：在生成每棵樹的時候，每一個節點變量都僅僅在隨機選出的少數變量中產生。所以，不但樣本是隨機的，連每一個節點變量（Features）的產生都是隨機的。

許多研究代表，組合分類器比單一分類器的分類效果好，隨機森林（random forest）是一種利用多個分類樹對數據進行判別與分類的方法，它在對數據進行分類的同時，還能夠給出各個變量（基因）的重要性評分，評估各個變量在分類中所起的做用。

2. 隨機森林方法理論介紹

2.1 隨機森林基本原理

隨機森林由LeoBreiman（2001）提出，它經過自助法（bootstrap）重採樣技術，從原始訓練樣本集N中有放回地重複隨機抽取k個樣本生成新的訓練樣本集合，而後根據自助樣本集生成k個分類樹組成隨機森林，新數據的分類結果按分類樹投票多少造成的分數而定。其實質是對決策樹算法的一種改進，將多個決策樹合併在一塊兒，每棵樹的創建依賴於一個獨立抽取的樣品，森林中的每棵樹具備相同的分佈，分類偏差取決於每一棵樹的分類能力和它們之間的相關性。特徵選擇採用隨機的方法去分裂每個節點，而後比較不一樣狀況下產生的偏差。可以檢測到的內在估計偏差、分類能力和相關性決定選擇特徵的數目。單棵樹的分

類能力可能很小，但在隨機產生大量的決策樹後，一個測試樣品能夠經過每一棵樹的分類結果經統計後選擇最可能的分類。

隨機森林-醫美分期數據

數據

腳本包含數據預處理

# -*- coding: utf-8 -*-
"""
Created on Sun Apr 15 13:30:16 2018

@author: Administrator
"""
from sklearn.model_selection import train_test_split
from sklearn.learning_curve import learning_curve
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
#導入數據預處理，包括標準化處理或正則處理
from sklearn import preprocessing
#樣本平均測試，評分更加
from sklearn.cross_validation import cross_val_score
 
from sklearn import datasets
#導入knn分類器
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
#數據預處理
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
#用於訓練數據和測試數據分類
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from matplotlib.font_manager import FontProperties

font=FontProperties(fname=r"c:\windows\fonts\simsun.ttc",size=14)

trees=100
#excel文件名
fileName="data1.xlsx"
#fileName="GermanData_total.xlsx"
#讀取excel
df=pd.read_excel(fileName)
# data爲Excel前幾列數據
x1=df[df.columns[:-1]]
#標籤爲Excel最後一列數據
y1=df[df.columns[-1:]]
 
#把dataframe 格式轉換爲陣列
x1=np.array(x1)
y1=np.array(y1)
#數據預處理，不然計算出錯
y1=[i[0] for i in y1]
y1=np.array(y1)

#數據預處理
imp = Imputer(missing_values='NaN', strategy='mean', axis=0) 
imp.fit(x1)
x1=imp.transform(x1)

forest=RandomForestClassifier(n_estimators=trees,random_state=0)

x_train,x_test,y_train,y_test=train_test_split(x1,y1,random_state=0)
forest.fit(x_train,y_train)
print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test)))
print('Feature importances:{}'.format(forest.feature_importances_))


feature_names=list(df.columns[:-1])
n_features=x1.shape[1]
plt.barh(range(n_features),forest.feature_importances_,align='center')
plt.yticks(np.arange(n_features),feature_names,fontproperties=font)
plt.title("random forest with %d trees:"%trees)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()

python風控建模實戰lendingClub(博主錄製，catboost，lightgbm建模，2K超清分辨率)

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

微信掃二維碼，免費學習更多python資源

相關標籤/搜索