Toby,項目合做QQ:231469242html
隨機森林就是由多個決策樹組合而成的投票機制。node
理解隨機森林,要先了解決策樹python
隨機森林是一個集成機器學習算法git
構建多個樹的算法:github
信息增益算法
GINI係數bootstrap
其它決策樹算法windows
隨機森林優勢:微信
1.可用於分類和迴歸網絡
2.處理缺失值,並不影響準確性
3.模型不會過渡擬合
4.可用於高維度,大數據
5.同盾用的就是隨機森林
隨機森林缺點
1.分類效果很好,但迴歸效果更好
2.此算法是個黑箱,很難改動參數
3.高維度,少數據表現較差
4.不能像樹同樣可視化
5.耗時間長,CPU資源佔用多
bagging是機器學習集成元算法,用於提升穩定性,減小方差和準確性
boosting是機器學習集成元算法,用於減小歧義,減小監督學習裏方差
bagging是一種用來提升學習算法準確度的方法,這種方法經過構造一個預測函數系列,而後以必定的方式將它們組合成一個預測函數。Bagging要求「不穩定」(不穩定是指數據集的小的變更可以使得分類結果的顯著的變更)的分類方法。好比:決策樹,神經網絡算法。
隨機森林的應用
1.信貸公司對客戶評分
2.醫藥療效判斷
3.購物車
4.股價分析
隨機森林算法原理
樹的分離函數
隨機森林的一些參數:
最大深度
隨機森林優勢:
有效並被普遍使用,默認參數表現良好,不需正則化處理數據,蒙特卡洛隨機處理效果比單個樹更好
python代碼測試中隨機森林不只準確性更高,並且強因子更準確
Random Forest
# -*- coding: utf-8 -*- """ Created on Sat Mar 31 09:30:24 2018 @author: Toby,項目合做QQ:231469242 """ import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer cancer=load_breast_cancer() x_train,x_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
#n_estimators=100表示有樹的個數
forest=RandomForestClassifier(n_estimators=100,random_state=0) forest.fit(x_train,y_train) print("random forest:") print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train))) print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test))) print('Feature importances:{}'.format(forest.feature_importances_)) n_features=cancer.data.shape[1] plt.barh(range(n_features),forest.feature_importances_,align='center') plt.yticks(np.arange(n_features),cancer.feature_names) plt.title("Random Forest:") plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.show()
決策樹
# -*- coding: utf-8 -*- """ Created on Tue Mar 27 22:59:44 2018 @author: Toby,項目合做QQ:231469242 radius半徑 texture結構,灰度值標準差 symmetry對稱 決策樹找出強因子 worst radius worst symmetry worst texture texture error """ import csv,pandas as pd import matplotlib.pyplot as plt import numpy as np import pydotplus from IPython.display import Image import graphviz from sklearn.tree import export_graphviz from sklearn.datasets import load_breast_cancer from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split cancer=load_breast_cancer() featureNames=cancer.feature_names #random_state 至關於隨機數種子 X_train,x_test,y_train,y_test=train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42) list_average_accuracy=[] depth=range(1,30) for i in depth: #max_depth=4限制決策樹深度能夠下降算法複雜度,獲取更精確值 tree= DecisionTreeClassifier(max_depth=i,random_state=0) tree.fit(X_train,y_train) accuracy_training=tree.score(X_train,y_train) accuracy_test=tree.score(x_test,y_test) average_accuracy=(accuracy_training+accuracy_test)/2.0 #print("average_accuracy:",average_accuracy) list_average_accuracy.append(average_accuracy) max_value=max(list_average_accuracy) #索引是0開頭,結果要加1 best_depth=list_average_accuracy.index(max_value)+1 print("best_depth:",best_depth) best_tree= DecisionTreeClassifier(max_depth=best_depth,random_state=0) best_tree.fit(X_train,y_train) accuracy_training=best_tree.score(X_train,y_train) accuracy_test=best_tree.score(x_test,y_test) print("decision tree:") print("accuracy on the training subset:{:.3f}".format(best_tree.score(X_train,y_train))) print("accuracy on the test subset:{:.3f}".format(best_tree.score(x_test,y_test))) print('Feature importances:{}'.format(best_tree.feature_importances_)) n_features=cancer.data.shape[1] plt.barh(range(n_features),best_tree.feature_importances_,align='center') plt.yticks(np.arange(n_features),cancer.feature_names) plt.title("Decision Tree:") plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.show() ''' print(cancer.DESCR) print(cancer.feature_names) print(cancer.target_names) print(cancer.data) print(type(cancer.data)) print(cancer.data.shape) #可視化沒法展現 dot_data=export_graphviz(tree,out_file="cancertree.dot",class_names=['malignant','benign'],feature_names=cancer.feature_names,impurity=False,filled=True) graph = pydotplus.graph_from_dot_data(dot_data) Image(graph.create_png()) #![](cancertree.png) '''
對隨機森林樹的多少測試
沒有變量篩選準確度0.972
隨機森林變量篩選
帥選後的乳腺癌數據
# -*- coding: utf-8 -*- """ Created on Sat Mar 31 09:30:24 2018 @author: Administrator 隨機森林不須要預處理數據 """ import pandas as pd from sklearn import metrics import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer #讀取文件 readFileName="breast_cancer_變量篩選.xlsx" trees=10000 #讀取excel df=pd.read_excel(readFileName) # data爲Excel前幾列數據 data=df[df.columns[:-1]] #標籤爲Excel最後一列數據 target=df[df.columns[-1:]] #變量名 feature_names=list(df.columns[:-1]) x_train,x_test,y_train,y_test=train_test_split(data,target,random_state=0) #n_estimators表示樹的個數,測試中100顆樹足夠 forest=RandomForestClassifier(n_estimators=trees,random_state=0) forest.fit(x_train,y_train) print("random forest with %d trees:"%trees) print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train))) print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test))) print('Feature importances:{}'.format(forest.feature_importances_)) n_features=data.shape[1] plt.barh(range(n_features),forest.feature_importances_,align='center') plt.yticks(np.arange(n_features),feature_names) plt.title("random forest with %d trees:"%trees) plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.show() ''' accuracy on the training subset:1.000 accuracy on the test subset:0.972 '''
篩選出最佳10個變量後建模,準確率反而沒有不篩選高,說明蒙特卡洛算法強大和優越性。建議不要作變量篩選,直接保留原汁原味變量,讓蒙特卡洛本身去模擬。
篩選變量VS不篩選變量:
篩選變量準確性,1.0,0.958
不篩選變量準確性:1.0,0.972
隨機森林調參
https://www.cnblogs.com/pinard/p/6160412.html
在Bagging與隨機森林算法原理小結中,咱們對隨機森林(Random Forest, 如下簡稱RF)的原理作了總結。本文就從實踐的角度對RF作一個總結。重點講述scikit-learn中RF的調參注意事項,以及和GBDT調參的異同點。
在scikit-learn中,RF的分類類是RandomForestClassifier,迴歸類是RandomForestRegressor。固然RF的變種Extra Trees也有, 分類類ExtraTreesClassifier,迴歸類ExtraTreesRegressor。因爲RF和Extra Trees的區別較小,調參方法基本相同,本文只關注於RF的調參。
和GBDT的調參相似,RF須要調參的參數也包括兩部分,第一部分是Bagging框架的參數,第二部分是CART決策樹的參數。下面咱們就對這些參數作一個介紹。
首先咱們關注於RF的Bagging框架的參數。這裏能夠和GBDT對比來學習。在scikit-learn 梯度提高樹(GBDT)調參小結中咱們對GBDT的框架參數作了介紹。GBDT的框架參數比較多,重要的有最大迭代器個數,步長和子採樣比例,調參起來比較費力。可是RF則比較簡單,這是由於bagging框架裏的各個弱學習器之間是沒有依賴關係的,這減少的調參的難度。換句話說,達到一樣的調參效果,RF調參時間要比GBDT少一些。
下面我來看看RF重要的Bagging框架的參數,因爲RandomForestClassifier和RandomForestRegressor參數絕大部分相同,這裏會將它們一塊兒講,不一樣點會指出。
1) n_estimators: 也就是弱學習器的最大迭代次數,或者說最大的弱學習器的個數。通常來講n_estimators過小,容易欠擬合,n_estimators太大,計算量會太大,而且n_estimators到必定的數量後,再增大n_estimators得到的模型提高會很小,因此通常選擇一個適中的數值。默認是100。
2) oob_score :便是否採用袋外樣原本評估模型的好壞。默認識False。我的推薦設置爲True,由於袋外分數反應了一個模型擬合後的泛化能力。
3) criterion: 即CART樹作劃分時對特徵的評價標準。分類模型和迴歸模型的損失函數是不同的。分類RF對應的CART分類樹默認是基尼係數gini,另外一個可選擇的標準是信息增益。迴歸RF對應的CART迴歸樹默認是均方差mse,另外一個能夠選擇的標準是絕對值差mae。通常來講選擇默認的標準就已經很好的。
從上面能夠看出, RF重要的框架參數比較少,主要須要關注的是 n_estimators,即RF最大的決策樹個數。
下面咱們再來看RF的決策樹參數,它要調參的參數基本和GBDT相同,以下:
1) RF劃分時考慮的最大特徵數max_features: 可使用不少種類型的值,默認是"auto",意味着劃分時最多考慮N−−√N個特徵;若是是"log2"意味着劃分時最多考慮log2Nlog2N個特徵;若是是"sqrt"或者"auto"意味着劃分時最多考慮N−−√N個特徵。若是是整數,表明考慮的特徵絕對數。若是是浮點數,表明考慮特徵百分比,即考慮(百分比xN)取整後的特徵數。其中N爲樣本總特徵數。通常咱們用默認的"auto"就能夠了,若是特徵數很是多,咱們能夠靈活使用剛纔描述的其餘取值來控制劃分時考慮的最大特徵數,以控制決策樹的生成時間。
2) 決策樹最大深度max_depth: 默承認以不輸入,若是不輸入的話,決策樹在創建子樹的時候不會限制子樹的深度。通常來講,數據少或者特徵少的時候能夠無論這個值。若是模型樣本量多,特徵也多的狀況下,推薦限制這個最大深度,具體的取值取決於數據的分佈。經常使用的能夠取值10-100之間。
3) 內部節點再劃分所需最小樣本數min_samples_split: 這個值限制了子樹繼續劃分的條件,若是某節點的樣本數少於min_samples_split,則不會繼續再嘗試選擇最優特徵來進行劃分。 默認是2.若是樣本量不大,不須要管這個值。若是樣本量數量級很是大,則推薦增大這個值。
4) 葉子節點最少樣本數min_samples_leaf: 這個值限制了葉子節點最少的樣本數,若是某葉子節點數目小於樣本數,則會和兄弟節點一塊兒被剪枝。 默認是1,能夠輸入最少的樣本數的整數,或者最少樣本數佔樣本總數的百分比。若是樣本量不大,不須要管這個值。若是樣本量數量級很是大,則推薦增大這個值。
5)葉子節點最小的樣本權重和min_weight_fraction_leaf:這個值限制了葉子節點全部樣本權重和的最小值,若是小於這個值,則會和兄弟節點一塊兒被剪枝。 默認是0,就是不考慮權重問題。通常來講,若是咱們有較多樣本有缺失值,或者分類樹樣本的分佈類別誤差很大,就會引入樣本權重,這時咱們就要注意這個值了。
6) 最大葉子節點數max_leaf_nodes: 經過限制最大葉子節點數,能夠防止過擬合,默認是"None」,即不限制最大的葉子節點數。若是加了限制,算法會創建在最大葉子節點數內最優的決策樹。若是特徵很少,能夠不考慮這個值,可是若是特徵分紅多的話,能夠加以限制,具體的值能夠經過交叉驗證獲得。
7) 節點劃分最小不純度min_impurity_split: 這個值限制了決策樹的增加,若是某節點的不純度(基於基尼係數,均方差)小於這個閾值,則該節點再也不生成子節點。即爲葉子節點 。通常不推薦改動默認值1e-7。
上面決策樹參數中最重要的包括最大特徵數max_features, 最大深度max_depth, 內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf。
這裏仍然使用GBDT調參時一樣的數據集來作RF調參的實例,數據的下載地址在這。本例咱們採用袋外分數來評估咱們模型的好壞。
完整代碼參見個人github:https://github.com/ljpzzz/machinelearning/blob/master/ensemble-learning/random_forest_classifier.ipynb
首先,咱們載入須要的類庫:
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.grid_search import GridSearchCV from sklearn import cross_validation, metrics import matplotlib.pylab as plt %matplotlib inline
接着,咱們把解壓的數據用下面的代碼載入,順便看看數據的類別分佈。
train = pd.read_csv('train_modified.csv') target='Disbursed' # Disbursed的值就是二元分類的輸出 IDcol = 'ID' train['Disbursed'].value_counts()
能夠看到類別輸出以下,也就是類別0的佔大多數。
0 19680
1 320
Name: Disbursed, dtype: int64
接着咱們選擇好樣本特徵和類別輸出。
x_columns = [x for x in train.columns if x not in [target, IDcol]] X = train[x_columns] y = train['Disbursed']
無論任何參數,都用默認的,咱們擬合下數據看看:
rf0 = RandomForestClassifier(oob_score=True, random_state=10) rf0.fit(X,y) print rf0.oob_score_ y_predprob = rf0.predict_proba(X)[:,1] print "AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob)
輸出以下,可見袋外分數已經很高,並且AUC分數也很高。相對於GBDT的默認參數輸出,RF的默認參數擬合效果對本例要好一些。
0.98005
AUC Score (Train): 0.999833
咱們首先對n_estimators進行網格搜索:
param_test1 = {'n_estimators':range(10,71,10)} gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100, min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), param_grid = param_test1, scoring='roc_auc',cv=5) gsearch1.fit(X,y) gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
輸出結果以下:
([mean: 0.80681, std: 0.02236, params: {'n_estimators': 10},
mean: 0.81600, std: 0.03275, params: {'n_estimators': 20},
mean: 0.81818, std: 0.03136, params: {'n_estimators': 30},
mean: 0.81838, std: 0.03118, params: {'n_estimators': 40},
mean: 0.82034, std: 0.03001, params: {'n_estimators': 50},
mean: 0.82113, std: 0.02966, params: {'n_estimators': 60},
mean: 0.81992, std: 0.02836, params: {'n_estimators': 70}],
{'n_estimators': 60},
0.8211334476626017)
這樣咱們獲得了最佳的弱學習器迭代次數,接着咱們對決策樹最大深度max_depth和內部節點再劃分所需最小樣本數min_samples_split進行網格搜索。
param_test2 = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)} gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10), param_grid = param_test2, scoring='roc_auc',iid=False, cv=5) gsearch2.fit(X,y) gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
輸出以下:
([mean: 0.79379, std: 0.02347, params: {'min_samples_split': 50, 'max_depth': 3},
mean: 0.79339, std: 0.02410, params: {'min_samples_split': 70, 'max_depth': 3},
mean: 0.79350, std: 0.02462, params: {'min_samples_split': 90, 'max_depth': 3},
mean: 0.79367, std: 0.02493, params: {'min_samples_split': 110, 'max_depth': 3},
mean: 0.79387, std: 0.02521, params: {'min_samples_split': 130, 'max_depth': 3},
mean: 0.79373, std: 0.02524, params: {'min_samples_split': 150, 'max_depth': 3},
mean: 0.79378, std: 0.02532, params: {'min_samples_split': 170, 'max_depth': 3},
mean: 0.79349, std: 0.02542, params: {'min_samples_split': 190, 'max_depth': 3},
mean: 0.80960, std: 0.02602, params: {'min_samples_split': 50, 'max_depth': 5},
mean: 0.80920, std: 0.02629, params: {'min_samples_split': 70, 'max_depth': 5},
mean: 0.80888, std: 0.02522, params: {'min_samples_split': 90, 'max_depth': 5},
mean: 0.80923, std: 0.02777, params: {'min_samples_split': 110, 'max_depth': 5},
mean: 0.80823, std: 0.02634, params: {'min_samples_split': 130, 'max_depth': 5},
mean: 0.80801, std: 0.02637, params: {'min_samples_split': 150, 'max_depth': 5},
mean: 0.80792, std: 0.02685, params: {'min_samples_split': 170, 'max_depth': 5},
mean: 0.80771, std: 0.02587, params: {'min_samples_split': 190, 'max_depth': 5},
mean: 0.81688, std: 0.02996, params: {'min_samples_split': 50, 'max_depth': 7},
mean: 0.81872, std: 0.02584, params: {'min_samples_split': 70, 'max_depth': 7},
mean: 0.81501, std: 0.02857, params: {'min_samples_split': 90, 'max_depth': 7},
mean: 0.81476, std: 0.02552, params: {'min_samples_split': 110, 'max_depth': 7},
mean: 0.81557, std: 0.02791, params: {'min_samples_split': 130, 'max_depth': 7},
mean: 0.81459, std: 0.02905, params: {'min_samples_split': 150, 'max_depth': 7},
mean: 0.81601, std: 0.02808, params: {'min_samples_split': 170, 'max_depth': 7},
mean: 0.81704, std: 0.02757, params: {'min_samples_split': 190, 'max_depth': 7},
mean: 0.82090, std: 0.02665, params: {'min_samples_split': 50, 'max_depth': 9},
mean: 0.81908, std: 0.02527, params: {'min_samples_split': 70, 'max_depth': 9},
mean: 0.82036, std: 0.02422, params: {'min_samples_split': 90, 'max_depth': 9},
mean: 0.81889, std: 0.02927, params: {'min_samples_split': 110, 'max_depth': 9},
mean: 0.81991, std: 0.02868, params: {'min_samples_split': 130, 'max_depth': 9},
mean: 0.81788, std: 0.02436, params: {'min_samples_split': 150, 'max_depth': 9},
mean: 0.81898, std: 0.02588, params: {'min_samples_split': 170, 'max_depth': 9},
mean: 0.81746, std: 0.02716, params: {'min_samples_split': 190, 'max_depth': 9},
mean: 0.82395, std: 0.02454, params: {'min_samples_split': 50, 'max_depth': 11},
mean: 0.82380, std: 0.02258, params: {'min_samples_split': 70, 'max_depth': 11},
mean: 0.81953, std: 0.02552, params: {'min_samples_split': 90, 'max_depth': 11},
mean: 0.82254, std: 0.02366, params: {'min_samples_split': 110, 'max_depth': 11},
mean: 0.81950, std: 0.02768, params: {'min_samples_split': 130, 'max_depth': 11},
mean: 0.81887, std: 0.02636, params: {'min_samples_split': 150, 'max_depth': 11},
mean: 0.81910, std: 0.02734, params: {'min_samples_split': 170, 'max_depth': 11},
mean: 0.81564, std: 0.02622, params: {'min_samples_split': 190, 'max_depth': 11},
mean: 0.82291, std: 0.02092, params: {'min_samples_split': 50, 'max_depth': 13},
mean: 0.82177, std: 0.02513, params: {'min_samples_split': 70, 'max_depth': 13},
mean: 0.82415, std: 0.02480, params: {'min_samples_split': 90, 'max_depth': 13},
mean: 0.82420, std: 0.02417, params: {'min_samples_split': 110, 'max_depth': 13},
mean: 0.82209, std: 0.02481, params: {'min_samples_split': 130, 'max_depth': 13},
mean: 0.81852, std: 0.02227, params: {'min_samples_split': 150, 'max_depth': 13},
mean: 0.81955, std: 0.02885, params: {'min_samples_split': 170, 'max_depth': 13},
mean: 0.82092, std: 0.02600, params: {'min_samples_split': 190, 'max_depth': 13}],
{'max_depth': 13, 'min_samples_split': 110},
0.8242016800050813)
咱們看看咱們如今模型的袋外分數:
rf1 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=110, min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10) rf1.fit(X,y) print rf1.oob_score_
輸出結果爲:
0.984
可見此時咱們的袋外分數有必定的提升。也就是時候模型的泛化能力加強了。
對於內部節點再劃分所需最小樣本數min_samples_split,咱們暫時不能一塊兒定下來,由於這個還和決策樹其餘的參數存在關聯。下面咱們再對內部節點再劃分所需最小樣本數min_samples_split和葉子節點最少樣本數min_samples_leaf一塊兒調參。
param_test3 = {'min_samples_split':range(80,150,20), 'min_samples_leaf':range(10,60,10)} gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, max_features='sqrt' ,oob_score=True, random_state=10), param_grid = param_test3, scoring='roc_auc',iid=False, cv=5) gsearch3.fit(X,y) gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
輸出以下:
([mean: 0.82093, std: 0.02287, params: {'min_samples_split': 80, 'min_samples_leaf': 10},
mean: 0.81913, std: 0.02141, params: {'min_samples_split': 100, 'min_samples_leaf': 10},
mean: 0.82048, std: 0.02328, params: {'min_samples_split': 120, 'min_samples_leaf': 10},
mean: 0.81798, std: 0.02099, params: {'min_samples_split': 140, 'min_samples_leaf': 10},
mean: 0.82094, std: 0.02535, params: {'min_samples_split': 80, 'min_samples_leaf': 20},
mean: 0.82097, std: 0.02327, params: {'min_samples_split': 100, 'min_samples_leaf': 20},
mean: 0.82487, std: 0.02110, params: {'min_samples_split': 120, 'min_samples_leaf': 20},
mean: 0.82169, std: 0.02406, params: {'min_samples_split': 140, 'min_samples_leaf': 20},
mean: 0.82352, std: 0.02271, params: {'min_samples_split': 80, 'min_samples_leaf': 30},
mean: 0.82164, std: 0.02381, params: {'min_samples_split': 100, 'min_samples_leaf': 30},
mean: 0.82070, std: 0.02528, params: {'min_samples_split': 120, 'min_samples_leaf': 30},
mean: 0.82141, std: 0.02508, params: {'min_samples_split': 140, 'min_samples_leaf': 30},
mean: 0.82278, std: 0.02294, params: {'min_samples_split': 80, 'min_samples_leaf': 40},
mean: 0.82141, std: 0.02547, params: {'min_samples_split': 100, 'min_samples_leaf': 40},
mean: 0.82043, std: 0.02724, params: {'min_samples_split': 120, 'min_samples_leaf': 40},
mean: 0.82162, std: 0.02348, params: {'min_samples_split': 140, 'min_samples_leaf': 40},
mean: 0.82225, std: 0.02431, params: {'min_samples_split': 80, 'min_samples_leaf': 50},
mean: 0.82225, std: 0.02431, params: {'min_samples_split': 100, 'min_samples_leaf': 50},
mean: 0.81890, std: 0.02458, params: {'min_samples_split': 120, 'min_samples_leaf': 50},
mean: 0.81917, std: 0.02528, params: {'min_samples_split': 140, 'min_samples_leaf': 50}],
{'min_samples_leaf': 20, 'min_samples_split': 120},
0.8248650279471544)
最後咱們再對最大特徵數max_features作調參:
param_test4 = {'max_features':range(3,11,2)} gsearch4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120, min_samples_leaf=20 ,oob_score=True, random_state=10), param_grid = param_test4, scoring='roc_auc',iid=False, cv=5) gsearch4.fit(X,y) gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
輸出以下:
([mean: 0.81981, std: 0.02586, params: {'max_features': 3},
mean: 0.81639, std: 0.02533, params: {'max_features': 5},
mean: 0.82487, std: 0.02110, params: {'max_features': 7},
mean: 0.81704, std: 0.02209, params: {'max_features': 9}],
{'max_features': 7},
0.8248650279471544)
用咱們搜索到的最佳參數,咱們再看看最終的模型擬合:
rf2 = RandomForestClassifier(n_estimators= 60, max_depth=13, min_samples_split=120, min_samples_leaf=20,max_features=7 ,oob_score=True, random_state=10) rf2.fit(X,y) print rf2.oob_score_
此時的輸出爲:
0.984
可見此時模型的袋外分數基本沒有提升,主要緣由是0.984已是一個很高的袋外分數了,若是想進一步須要提升模型的泛化能力,咱們須要更多的數據。
以上就是RF調參的一個總結,但願能夠幫到朋友們。
信用評分系統應用
http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)
account balance 帳戶餘額
duration of credit
Data Set Information:
Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".
For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.
This dataset requires use of a cost matrix (see below)
..... 1 2
----------------------------
1 0 1
-----------------------
2 5 0
(1 = Good, 2 = Bad)
The rows represent the actual classification and the columns the predicted classification.
It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).
Attribute Information:
Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM / salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no
It is worse to class a customer as good when they are bad (5),
than it is to class a customer as bad when they are good (1).
隨機森林對多元共線性不敏感,結果對缺失數據和非平衡的數據比較穩健,能夠很好地預測多達幾千個解釋變量的做用(Breiman 2001b),被譽爲當前最好的算法之一(Iverson et al. 2008)
在作實驗室的時候會發現當有線性相關的特徵時,選其中一個特徵和同時選擇兩個特徵的效果是同樣的,即隨機森林對多元共線不敏感。
在迴歸分析中,當自變量之間出現多重共線性現象時,常會嚴重影響到參數估計,擴大模型偏差,並破壞模型的穩健性,所以消除多重共線性成爲迴歸分析中參數估計的一個重要環節。如今經常使用的解決多元線性迴歸中多重共線性的迴歸模型有嶺迴歸(Ridge Regression)、主成分迴歸(Principal Component Regression簡記爲PCR)和偏最小二乘迴歸(Partial Least Square Regression簡記爲PLS)。
邏輯斯蒂迴歸對自變量的多元共線性很是敏感,要求自變量之間相互獨立。隨機森林則徹底不須要這個前提條件。
1.隨機森林的優勢是:
隨機森林經過袋外偏差(out-of-bag error)估計模型的偏差。對於分類問題,偏差是分類的錯誤率;對於迴歸問題,偏差是殘差的方差。隨機森林的每棵分類樹,都是對原始記錄進行有放回的重抽樣後生成的。每次重抽樣大約1/3的記錄沒有被抽取(Liaw,2012)。沒有被抽取的天然造成一個對照數據集。因此隨機森林不須要另外預留部分數據作交叉驗證,其自己的算法相似交叉驗證,並且袋外偏差是對預測偏差的無偏估計(Breiman,2001)。
2.隨機森林的缺點:
隨機森林是一種比較新的機器學習模型。經典的機器學習模型是神經網絡,有半個多世紀的歷史了。神經網絡預測精確,可是計算量很大。上世紀八十年代Breiman等人發明分類樹的算法(Breiman et al. 1984),經過反覆二分數據進行分類或迴歸,計算量大大下降。2001年Breiman把分類樹組合成隨機森林(Breiman 2001a),即在變量(列)的使用和數據(行)的使用上進行隨機化,生成不少分類樹,再彙總分類樹的結果。隨機森林在運算量沒有顯著提升的前提下提升了預測精度。隨機森林對多元公線性不敏感,結果對缺失數據和非平衡的數據比較穩健,能夠很好地預測多達幾千個解釋變量的做用(Breiman 2001b),被譽爲當前最好的算法之一(Iverson et al. 2008)。
隨機森林顧名思義,是用隨機的方式創建一個森林,森林裏面有不少的決策樹組成,隨機森林的每一棵決策樹之間是沒有關聯的。在獲得森林以後,當有一個新的輸入樣本進入的時候,就讓森林中的每一棵決策樹分別進行一下判斷,看看這個樣本應該屬於哪一類(對於分類算法),而後看看哪一類被選擇最多,就預測這個樣本爲那一類。
1.2 隨機森林優勢
隨機森林是一個最近比較火的算法,它有不少的優勢:
a. 在數據集上表現良好,兩個隨機性的引入,使得隨機森林不容易陷入過擬合
b. 在當前的不少數據集上,相對其餘算法有着很大的優點,兩個隨機性的引入,使得隨機森林具備很好的抗噪聲能力
c. 它可以處理很高維度(feature不少)的數據,而且不用作特徵選擇,對數據集的適應能力強:既能處理離散型數據,也能處理連續型數據,數據集無需規範化
d. 可生成一個Proximities=(pij)矩陣,用於度量樣本之間的類似性: pij=aij/N, aij表示樣本i和j出如今隨機森林中同一個葉子結點的次數,N隨機森林中樹的顆數
e. 在建立隨機森林的時候,對generlization error使用的是無偏估計
f. 訓練速度快,能夠獲得變量重要性排序(兩種:基於OOB誤分率的增長量和基於分裂時的GINI降低量
g. 在訓練過程當中,可以檢測到feature間的互相影響
h. 容易作成並行化方法
i. 實現比較簡單
1.3 隨機森林應用範圍
隨機森林主要應用於迴歸和分類。本文主要探討基於隨機森林的分類問題。隨機森林和使用決策樹做爲基本分類器的(bagging)有些相似。以決策樹爲基本模型的bagging在每次bootstrap放回抽樣以後,產生一棵決策樹,抽多少樣本就生成多少棵樹,在生成這些樹的時候沒有進行更多的干預。而隨機森林也是進行bootstrap抽樣,但它與bagging的區別是:在生成每棵樹的時候,每一個節點變量都僅僅在隨機選出的少數變量中產生。所以,不但樣本是隨機的,連每一個節點變量(Features)的產生都是隨機的。
許多研究代表, 組合分類器比單一分類器的分類效果好,隨機森林(random forest)是一種利用多個分類樹對數據進行判別與分類的方法,它在對數據進行分類的同時,還能夠給出各個變量(基因)的重要性評分,評估各個變量在分類中所起的做用。
2. 隨機森林方法理論介紹
2.1 隨機森林基本原理
隨機森林由LeoBreiman(2001)提出,它經過自助法(bootstrap)重採樣技術,從原始訓練樣本集N中有放回地重複隨機抽取k個樣本生成新的訓練樣本集合,而後根據自助樣本集生成k個分類樹組成隨機森林,新數據的分類結果按分類樹投票多少造成的分數而定。其實質是對決策樹算法的一種改進,將多個決策樹合併在一塊兒,每棵樹的創建依賴於一個獨立抽取的樣品,森林中的每棵樹具備相同的分佈,分類偏差取決於每一棵樹的分類能力和它們之間的相關性。特徵選擇採用隨機的方法去分裂每個節點,而後比較不一樣狀況下產生的偏差。可以檢測到的內在估計偏差、分類能力和相關性決定選擇特徵的數目。單棵樹的分
類能力可能很小,但在隨機產生大量的決策樹後,一個測試樣品能夠經過每一棵樹的分類結果經統計後選擇最可能的分類。
隨機森林-醫美分期數據
數據
腳本包含數據預處理
# -*- coding: utf-8 -*- """ Created on Sun Apr 15 13:30:16 2018 @author: Administrator """ from sklearn.model_selection import train_test_split from sklearn.learning_curve import learning_curve from sklearn.datasets import load_digits import matplotlib.pyplot as plt import numpy as np import pandas as pd #導入數據預處理,包括標準化處理或正則處理 from sklearn import preprocessing #樣本平均測試,評分更加 from sklearn.cross_validation import cross_val_score from sklearn import datasets #導入knn分類器 from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier #數據預處理 from sklearn.preprocessing import Imputer from sklearn.linear_model import LogisticRegression #用於訓練數據和測試數據分類 from sklearn.cross_validation import train_test_split from sklearn.ensemble import RandomForestClassifier from matplotlib.font_manager import FontProperties font=FontProperties(fname=r"c:\windows\fonts\simsun.ttc",size=14) trees=100 #excel文件名 fileName="data1.xlsx" #fileName="GermanData_total.xlsx" #讀取excel df=pd.read_excel(fileName) # data爲Excel前幾列數據 x1=df[df.columns[:-1]] #標籤爲Excel最後一列數據 y1=df[df.columns[-1:]] #把dataframe 格式轉換爲陣列 x1=np.array(x1) y1=np.array(y1) #數據預處理,不然計算出錯 y1=[i[0] for i in y1] y1=np.array(y1) #數據預處理 imp = Imputer(missing_values='NaN', strategy='mean', axis=0) imp.fit(x1) x1=imp.transform(x1) forest=RandomForestClassifier(n_estimators=trees,random_state=0) x_train,x_test,y_train,y_test=train_test_split(x1,y1,random_state=0) forest.fit(x_train,y_train) print("accuracy on the training subset:{:.3f}".format(forest.score(x_train,y_train))) print("accuracy on the test subset:{:.3f}".format(forest.score(x_test,y_test))) print('Feature importances:{}'.format(forest.feature_importances_)) feature_names=list(df.columns[:-1]) n_features=x1.shape[1] plt.barh(range(n_features),forest.feature_importances_,align='center') plt.yticks(np.arange(n_features),feature_names,fontproperties=font) plt.title("random forest with %d trees:"%trees) plt.xlabel('Feature Importance') plt.ylabel('Feature') plt.show()
python風控建模實戰lendingClub(博主錄製,catboost,lightgbm建模,2K超清分辨率)
https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149
微信掃二維碼,免費學習更多python資源