機器學習算法 --- Pruning (decision trees) & Random Forest Algorithm

時間 2019-11-10

標籤機器學習算法 pruning decision trees random forest algorithm 简体版

原文原文鏈接

1、Table for Content

　　在以前的文章中咱們介紹了Decision Trees Agorithms，然而這個學習算法有一個很大的弊端，就是很容易出現Overfitting，爲了解決此問題人們找到了一種方法，就是對Decision Trees 進行 Pruning(剪枝)操做。node

　　爲了提升Decision Tree Agorithm的正確率和避免overfitting，人們又嘗試了對它進行集成，即便用多棵樹決策，而後對於分類問題投票得出最終結果，而對於迴歸問題則計算平均結果。下面是幾條是本篇要講的主要內容。算法

Pruning (decision trees)
What is Random forest algorithm?
Why Random Forest algorithm?
How Random Forest algorithm works?
Advantages of Random Forest algorithm.
Random Forest algorithm real life example.

　本文主要參考一下幾篇文章，有能力的讀者可自行前往閱讀原文：app

　　1. Wikipedia上的Pruning (decision trees) 和 Random Froest algorithm。dom

　　2. Dataaspirant上的《HOW THE RANDOM FOREST ALGORITHM WORKS IN MACHINE LEARNING》ide

　　3. medium上的《How Random Forest Algorithm Works in Machine Learning》post

　同時推薦讀者去閱讀《The Random Forest Algorithm》，由於這篇文章講解了在scikit-learn中Random Forest Agorithm經常使用的重要參數。學習

2、Pruning(decision trees)

There are two approaches to avoiding overfitting in building decision trees:

Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training set.
Post-pruning that allows the tree to perfectly classify the training set, and then post prune the tree.

Pre-pruning(預剪枝)，該方法是在創建決策樹的過程當中，判斷當決策樹的node知足必定條件(好比當樹的深度達到事先設定的值，或者當該node下的樣例個數小於等於某個數)時，不在繼續創建子樹，因此也叫Early stopping。ui

Post-pruning(後剪枝)，對於此方法，先創建完整的決策樹，而後經過必定的算法，將某個非leaf node設爲leaf node(即將該node下的子樹丟棄)實現pruning。spa

因爲Pre-pruning較爲簡單就不作具體介紹，因此介紹一下Cost complexity pruning(經過此方法選擇某個node設爲leaf node，此方法來自wikipedia)，固然還有許多其餘的方法就不一一介紹了，讀者可自行查閱。翻譯

3、What is Random Forest algorithm?

　　關於Random Froest algorithm(隨機森林)算法的介紹，不少文章的介紹用例都大同小異，因此在這裏也就不另起爐竈了，參考某篇文章的介紹，並作本土特點化翻譯以下：

　　假設有一名學生叫小明，他今年暑假準備去旅遊，但他不知道該去哪兒，因而就去問本身的好朋友小剛的意見，小剛則問他一些問題，好比你之前去過哪兒啊，你對要去地方的天氣有什麼要求啊等等，而後小剛經過這些問題給小明一個建議。決策樹就是這樣一種思想，經過對樣本數據的各個特徵值創建必定的規則，讓後使用這些規則對新數據作出決策，跟此例很是類似。

　　可是小明以爲只是一我的的建議，可能比較片面，因而他就問去問了一下他的其餘幾個朋友，而這幾個朋友也問了他一些問題，這些問題有的跟小剛的問題同樣，有的不同，而後他們各自給出了建議，小明拿到這些建議後，綜合了一下，有5個朋友建議他去西安，3個朋友建議他去重慶，2個朋友建議他去成都，他最終就決定這個暑假去西安遊玩。Random Froest algorithm(隨機森林)算法也是如此，不少顆樹使用隨機樣本的隨機特徵值創建不一樣的規則，而後各樹對於新數據得出不一樣的結果，最終結果取綜合（分類投票，迴歸取平均）。

　　Random Froest algorithm(隨機森林)的維基百科定義以下：

　　Random forests or random decision forests are an ensemble learningmethod for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

4、Why Random Forest algorithm?

　　關於這個問題，主要有如下幾點理由：

The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
Random forest classifier will handle the missing values.
When we have more trees in the forest, random forest classifier won’t overfit the model.
Can model the random forest classifier for categorical values also.

5、How Random Forest algorithm works?

　　創建隨機森林的過程以下圖：

　　對左圖中的Dataset建立包含三棵樹的隨機森林，過程以下：

　　　　step1：在Dataset的衆多特徵中，隨機選取5個特徵，在隨機選取j個樣本數據。

　　　　step2: 而後以這些數據構建一顆decesion tree。

　　　 step3：重作step1, step2，直到森林中樹的數目知足要求。

　　因此構建Random Forest的通用算法以下：

　　　　1. Randomly select 「K」 features from total 「m」 features where k << m, then randomly seletct 「J」 samples from total 「n」 samples .

　　　　2. Among the 「K」 features of 「J」 samples, calculate the node 「d」 using the best split point.

　　　　3. Split the node into daughter nodes using the best split.

　　　　4. Repeat the 1 to 3 steps until 「l」 number of nodes has been reached.

　　　　5. Build forest by repeating steps a to d for 「q」 number times to create 「q」 number of trees.

　　Random Forest classifier的使用步驟以下：

　　　　1. Takes the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome(target).

　　　　2. Calculate the votes for each predicted target.

　　　　3. Consider the high voted predicted target as the final prediction from the random forest algorithm.