Spark2.0機器學習系列之6:GBDT(梯度提高決策樹)、GBDT與隨機森林差別、參數調試及Scikit代碼分析

概念梳理

GBDT的別稱

 GBDT(Gradient Boost Decision Tree),梯度提高決策樹。 
    GBDT這個算法還有一些其餘的名字,好比說MART(Multiple Additive Regression Tree),GBRT(Gradient Boost Regression Tree),Tree Net等,其實它們都是一個東西(參考自wikipedia – Gradient Boosting),發明者是Friedman。 
研究GBDT必定要看看Friedman的paper《Greedy Function Approximation: A Gradient Boosting Machine》,裏面論述和公式推導更爲系統。html

什麼是梯度提高算法?

  GB(Gradient Boosting)梯度提高算法 
  GB實際上是一個算法框架,便可以將已有的分類或迴歸算法放入其中,獲得一個性能很強大的算法。 
  GB這個框架中能夠放入不少不一樣的算法。 
  GB總共須要進行M次迭代,每次迭代產生一個模型,咱們須要讓每次迭代生成的模型對訓練集的損失函數最小,而如何讓損失函數愈來愈小呢?咱們採用梯度降低的方法,在每次迭代時經過向損失函數的負梯度方向移動來使得損失函數愈來愈小,這樣咱們就能夠獲得愈來愈精確的模型。 python

這裏寫圖片描述

梯度降低算法在機器學習中會常常遇到,這裏給一幅圖片就好理解了: 算法

圖片說明:將參數θ按照梯度降低的方向進行調整,就會使得代價函數J(θ)往更低的方向進行變化,如圖所示,算法的結束將是在θ降低到沒法繼續降低爲止。黑線就是代價(錯誤)降低的軌跡,始終是按照梯度方向降低的,也是降低最快的方向。 
圖片來源: 
http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_gradient_descent.html 
更詳細的內容能夠參考原博客。sql

原始Boosting算法與Gradient Boosting的區別

  一樣都是提高算法,原始Boosting算法與Gradient Boosting是有很本質區別的。 
  原始的Boost算法是在算法開始的時候,爲每個樣本賦上一個權重值,初始的時候,你們都是同樣重要的。在每一步訓練中獲得的模型,會使得數據點的估計有對有錯,咱們就在每一步結束後,增長分錯的點的權重,減小分對的點的權重,這樣使得某些點若是總是被分錯,那麼就會被「嚴重關注」,也就被賦上一個很高的權重。而後等進行了N次迭代(由用戶指定),將會獲得N個簡單的分類器(basic learner),而後咱們將它們組合起來(好比說能夠對它們進行加權、或者讓它們進行投票等),獲得一個最終的模型。express

而Gradient Boost與傳統的Boost的區別是,每一次的計算是爲了減小上一次的殘差(residual),而爲了消除殘差,咱們能夠在殘差減小的梯度(Gradient)方向上創建一個新的模型。因此說,在Gradient Boost中,每一個新的模型的簡歷是爲了使得以前模型的殘差往梯度方向減小,與傳統Boost對正確、錯誤的樣本進行加權有着很大的區別。 
  在GB算法框架中放入決策樹,就是GBDT了。 apache

GBDT的兩個版本

參考文章:http://blog.csdn.net/kunlong0909/article/details/17587101編程

(1)殘差版本 
     殘差其實就是真實值和預測值之間的差值,在學習的過程當中,首先學習一顆迴歸樹,而後將「真實值-預測值」獲得殘差,再把殘差做爲一個學習目標,學習下一棵迴歸樹,依次類推,直到殘差小於某個接近0的閥值或迴歸樹數目達到某一閥值。其核心思想是每輪經過擬合殘差來下降損失函數。 
  總的來講,第一棵樹是正常的,以後全部的樹的決策全是由殘差來決定。 
首先給出一個簡單的例子: 
若是不明白圖片是什麼意思,請參考: 
http://blog.csdn.net/w28971023/article/details/8240756 
這裏寫圖片描述 
  能夠看到第二棵數的輸入是對第一棵樹預測結果與實際結果的殘差。所以很容易發現GBDT算法有這樣一些重要的特性,會對後面Spark實際編程時參數設置(調試)有一些指導做用(後面還會詳細說)。 
  GBDT是經過迭代不斷使偏差減少的過程,後一棵樹對前一棵樹的殘差進行預測,這和隨機森林平行的用多棵樹同時預測徹底不同。所以對樹結構(如MaxDepth),運算時間,預測結果,泛化能力都和隨機森林不同。(Spark coding時再詳細對比分析)app

  算法: 
  這裏寫圖片描述 
(2)梯度版本 
  與殘差版本把GBDT說成一個殘差迭代樹,認爲每一棵迴歸樹都在學習前N-1棵樹的殘差不一樣,Gradient版本把GBDT說成一個梯度迭代樹,使用梯度降低法求解,認爲每一棵迴歸樹在學習前N-1棵樹的梯度降低值。總的來講二者相同之處在於,都是迭代迴歸樹,都是累加每顆樹結果做爲最終結果(Multiple Additive Regression Tree),每棵樹都在學習前N-1棵樹尚存的不足,從整體流程和輸入輸出上二者是沒有區別的; 
  二者的不一樣主要在於每步迭代時,是否使用Gradient做爲求解方法。前者不用Gradient而是用殘差—-殘差是全局最優值,Gradient是局部最優方向*步長,即前者每一步都在試圖讓結果變成最好,後者則每步試圖讓結果更好一點。 
  二者優缺點。看起來前者更科學一點–有絕對最優方向不學,爲何捨近求遠去估計一個局部最優方向呢?緣由在於靈活性。前者最大問題是,因爲它依賴殘差,cost function通常固定爲反映殘差的均方差,所以很難處理純迴歸問題以外的問題。然後者求解方法爲梯度降低,只要可求導的cost function均可以使用。 
  算法以下: 
  可參考http://blog.csdn.net/starzhou/article/details/51648219 
  其實這些算法都來自Friedman的論文,想要深度研究該算法的原理,最好閱讀原文本身推導一遍。 
  這裏寫圖片描述 框架

前向分步算法(forward stagewise algorithm)

  能夠看出GBDT是一種前向分步算法。 
  更廣泛的,前向分步算法有兩種形式,前一種是更新模型,是一種是加法模型: 
這裏寫圖片描述 
  通俗理解就是:向前一步一步的走,逐漸逼近想要的結果。固然走的快慢,也是能夠再增長一個控制參數,一個叫學習率的參數來控制(見下面正則化部分)。 less

正則化(學習率)

Shrinkage 
proposed a simple regularization strategy that scales the contribution of each weak learner by a factor \nu
這裏寫圖片描述 
The parameter \nu: is also called the learning rate because it scales the step length the the gradient descent procedure; it can be set via the learning_rate parameter.

  學習率和正則化怎麼在一塊兒了?通俗理解就是:每次走很小的一步逐漸逼近的效果,要比每次邁一大步很快逼近結果的方式更容易避免過擬合。

Spark2.0中GBDT

GBDT的優勢

  GBDT和隨機森林同樣,都具有決策樹的一些優勢: 
  (1)能夠處理類別特徵和連續特徵; 
  (2)不須要對數據進行標準化預處理; 
  (3)能夠分析特徵之間的相互影響 
  值得注意的是,Spark中的GBDT目前還不能處理多分類問題,僅能夠用於二分類和迴歸問題。(Spark隨機森林能夠處理多分類問題) 

  Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. Like decision trees, GBTs handle categorical features, do not require feature scaling, and are able to capture non-linearities and feature interactions.

  spark.mllib supports GBTs for binary classification and for regression, using both continuous and categorical features. spark.mllib implements GBTs using the existing decision tree implementation. Please see the decision tree guide for more information on trees.

  Note: GBTs do not yet support multiclass classification. For multiclass problems, please use decision trees or Random Forests.

GBDT與隨機森林應用時的對比

  GBDT和隨機森林雖然都是決策樹的組合算法,可是二者的訓練過程仍是很不相同的。 
  GBDT訓練是每次一棵,一棵接着一棵(串行),所以與隨機森林並行計算多棵樹相比起來,會須要更長的訓練時間。 
  在GBDT中,相對於隨機森林而言(隨機森林中的樹能夠不作不少的剪枝),通常會選擇更淺(depth更小)的樹,這樣運算時間會減小。 
  隨機森林更不容易過擬合,並且森林中包含越多的樹彷佛越不會出現過擬合。用統計學的語言來說,就是說越多的樹包含進來,會下降預測結果的方差(屢次預測結果會更加穩定)。可是GBDT則剛好相反,包含預測的樹(即迭代的次數越多),反而會更傾向於過擬合,用統計學的語言來將,就是GBDT迭代次數的增長減小的是誤差(預測結果和訓練數據label之間的差別)。(誤差和方差這兩個概念是不一樣的概念,見後面的圖) 
  隨機森林參數相對更容易調試一些,這是因爲隨着所包含的決策樹的個數增長,其預測效果通常是單調的向好的方向變。而GBDT則不一樣,一開始預測表現會隨着樹的數目增大而變好,可是到必定程度以後,反而會隨着樹的數目增長而變差。 
  總而言之,這兩種算法都仍是很是有效的算法,如何選擇取決於實際的數據。

   Gradient-Boosted Trees vs. Random Forests 
  Both Gradient-Boosted Trees (GBTs) and Random Forests are algorithms for learning ensembles of trees, but the training processes are different. There are several practical trade-offs:

  GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests can train multiple trees in parallel.

  On the other hand, it is often reasonable to use smaller (shallower) trees with GBTs than with Random Forests, and training smaller trees takes less time.

  Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting, but training more trees with GBTs increases the likelihood of overfitting. (In statistical language, Random Forests reduce variance by using more trees, whereas GBTs reduce bias by using more trees.)

  Random Forests can be easier to tune since performance improves monotonically with the number of trees (whereas performance can start to decrease for GBTs if the number of trees grows too large).

  In short, both algorithms can be effective, and the choice should be based on the particular dataset.

誤差和方差的區別: 
  誤差:描述的是預測值(估計值)的指望與真實值之間的差距。誤差越大,越偏離真實數據,以下圖第二行所示。 
  方差:描述的是預測值的變化範圍,離散程度,也就是離其指望值的距離。方差越大,數據的分佈越分散,以下圖右列所示。 
這裏寫圖片描述

關鍵參數

  有三個關鍵參數須要仔細分析:loss,numIterations,learningRate。能夠經過下面的方式設置

//定義GBTClassifier,注意在Spark中輸出(預測列)都有默認的設置,能夠不本身設置
GBTClassifier gbtClassifier=new GBTClassifier()
                            .setLabelCol("indexedLabel")//輸入label
                            .setFeaturesCol("indexedFeatures")//輸入features vector
                            .setMaxIter(MaxIter)//最大迭代次數
                            .setImpurity("entropy")//or "gini"
                            .setMaxDepth(3)//決策樹的深度
                            .setStepSize(0.3)//範圍是(0, 1]
                            .setSeed(1234); //能夠設一個隨機數種子點

loss(損失函數的類型)

  Spark中已經實現的損失函數類型有如下三種,注意每一種都只適合一類問題,要麼是迴歸,要麼是分類。 
  分類只可選擇 Log Loss,迴歸問題可選擇平方偏差和絕對值偏差。分別又稱爲L2損失和L1損失。絕對值偏差(L1損失)在處理帶有離羣值的數據時比L2損失更加具備魯棒性。 
這裏寫圖片描述

numIterations(迭代次數)

  GBDT迭代次數,每一次迭代將產生一棵樹,所以numIterations也是算法中所包含的樹的數目。增長numIterations會提升訓練集數據預測準確率(注意是訓練集數據上的準確率哦)。可是相應的會增長訓練的時間。如何選擇合適的參數防止過擬合,必定須要作驗證。將數據分爲兩份,一份是訓練集,一份是驗證集。 
  隨着迭代次數的增長,一開始在驗證集上預測偏差會減少,迭代次數增大到必定程度後偏差反而會增長,那麼經過準確度vs.迭代次數曲線能夠選擇最合適的numIterations。

learningRate(學習率)

  這個參數通常不須要調試,若是發現算法面對某個數據集,變現得極其不穩定,那麼就要減少學習率再試一下,通常會有改善(穩定性變好)。小的學習率(步長)確定會增長訓練的時間。

 (1) loss: See the section above for information on losses and their applicability to tasks (classification vs. regression). Different losses can give significantly different results, depending on the dataset. 
  (2) numIterations: This sets the number of trees in the ensemble. Each iteration produces one tree. Increasing this number makes the model more expressive, improving training data accuracy. However, test-time accuracy may suffer if this is too large. 
Gradient boosting can overfit when trained with more trees. In order to prevent overfitting, it is useful to validate while training. The method runWithValidation has been provided to make use of this option. It takes a pair of RDD’s as arguments, the first one being the training dataset and the second being the validation dataset. 
  The training is stopped when the improvement in the validation error is not more than a certain tolerance (supplied by the validationTol argument in BoostingStrategy). In practice, the validation error decreases initially and later increases. There might be cases in which the validation error does not change monotonically, and the user is advised to set a large enough negative tolerance and examine the validation curve using evaluateEachIteration (which gives the error or loss per iteration) to tune the number of iterations. 
  (3) learningRate: This parameter should not need to be tuned. If the algorithm behavior seems unstable, decreasing this value may improve stability.

Validation while training 
  Gradient boosting can overfit when trained with more trees. In order to prevent overfitting, it is useful to validate while training. The method runWithValidation has been provided to make use of this option. It takes a pair of RDD’s as arguments, the first one being the training dataset and the second being the validation dataset. 
  The training is stopped when the improvement in the validation error is not more than a certain tolerance (supplied by the validationTol argument in BoostingStrategy). In practice, the validation error decreases initially and later increases. There might be cases in which the validation error does not change monotonically, and the user is advised to set a large enough negative tolerance and examine the validation curve using evaluateEachIteration (which gives the error or loss per iteration) to tune the number of iterations.

基於Spark2.0 DataFrame、pipeline代碼須要一些預處理流程,能夠參考我另外一篇文章,有詳細的說明: 
Spark2.0決策樹的幾種類型差別
http://www.cnblogs.com/itboys/p/8312894.html

//Spark 2.0 GBDT完整代碼
package my.spark.ml.practice.classification;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.GBTClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.feature.IndexToString;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.ml.feature.StringIndexerModel;
import org.apache.spark.ml.feature.VectorIndexer;
import org.apache.spark.ml.feature.VectorIndexerModel;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class myGDBT {

    public static void main(String[] args) {
        SparkSession spark=SparkSession
                .builder()
                .appName("CoFilter")
                .master("local[4]")
                .config("spark.sql.warehouse.dir",
                        "file///:G:/Projects/Java/Spark/spark-warehouse" )
                .getOrCreate();         

        String path="C:/Users/user/Desktop/ml_dataset/classify/horseColicTraining2libsvm.txt";
        String path2="C:/Users/user/Desktop/ml_dataset/classify/horseColicTest2libsvm.txt";
        //屏蔽日誌
        Logger.getLogger("org.apache.spark").setLevel(Level.ERROR);//WARN
        Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF);   

        Dataset<Row> training=spark.read().format("libsvm").load(path);
        Dataset<Row> test=spark.read().format("libsvm").load(path2);        

        StringIndexerModel indexerModel=new StringIndexer()
                .setInputCol("label")
                .setOutputCol("indexedLabel")
                .fit(training);
        VectorIndexerModel vectorIndexerModel=new VectorIndexer()
                .setInputCol("features")
                .setOutputCol("indexedFeatures")
                .fit(training);
        IndexToString converter=new IndexToString()
                .setInputCol("prediction")
                .setOutputCol("convertedPrediction")
                .setLabels(indexerModel.labels());
        //調試參數MaxIter,learningRate,maxDepth,也對兩種不純度進行了測試                
       for (int MaxIter = 30; MaxIter < 40; MaxIter+=10)
          for (int maxDepth = 2; maxDepth < 3; maxDepth+=1)
              for (int impurityType = 1; impurityType <2; impurityType+=1)
                 for (int setpSize = 1; setpSize< 10; setpSize+=1) {    
                    long begin = System.currentTimeMillis();//訓練開始時間
                    String impurityType_=null;//不純度類型選擇
                    if (impurityType==1) {
                        impurityType_="gini";
                    }
                    else  {
                        impurityType_="entropy";
                    }
                    double setpSize_=0.1*setpSize;
                    GBTClassifier gbtClassifier=new GBTClassifier()
                            .setLabelCol("indexedLabel")
                            .setFeaturesCol("indexedFeatures")
                            .setMaxIter(MaxIter)
                            .setImpurity(impurityType_)//.setImpurity("entropy")
                            .setMaxDepth(maxDepth)
                            .setStepSize(setpSize_)//範圍是(0, 1]
                            .setSeed(1234);                     

                    PipelineModel pipeline=new Pipeline().setStages
                            (new PipelineStage[]
                                    {indexerModel,vectorIndexerModel,gbtClassifier,converter})
                            .fit(training);     
                    long end=System.currentTimeMillis();        

                    //必定要在測試數據集上作驗證
                    Dataset<Row> predictDataFrame=pipeline.transform(test);     

                    double accuracy=new MulticlassClassificationEvaluator()
                            .setLabelCol("indexedLabel")
                            .setPredictionCol("prediction")
                            .setMetricName("accuracy").evaluate(predictDataFrame);          
                    String str_accuracy=String.format(" accuracy = %.4f ", accuracy);
                    String str_time=String.format(" trainig time = %d ", (end-begin));
                    String str_maxIter=String.format(" maxIter = %d ", MaxIter);
                    String str_maxDepth=String.format(" maxDepth = %d ", maxDepth);
                    String str_stepSize=String.format(" setpSize = %.2f ", setpSize_);
                    String str_impurityType_=" impurityType = "+impurityType_;
                    System.out.println(str_maxIter+str_maxDepth+str_impurityType_+
                            str_stepSize+str_accuracy+str_time);

                }//Params Cycle         
    }   
}

/*下面的參數分析只是針對這個小數據集,實際不一樣數據會有很大差異,僅僅是一種很是的簡單的測試而已*/
/**迭代次數影響:隨着次數的增長,開始在測試上準確度會提升,訓練時間呈線性增加。
maxIter = 1  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7313  trainig time = 1753 
 maxIter = 11  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7463  trainig time = 2820 
 maxIter = 21  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7612  trainig time = 5043 
 maxIter = 31  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 7217 
 maxIter = 41  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 9932 
 maxIter = 51  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 12337 
 maxIter = 61  maxDepth = 1  impurityType = entropy setpSize = 0.10  accuracy = 0.7761  trainig time = 15091 
 */
/**

隨maxDepth=2時,預測準確度最高,而後開始降低,確實說明:GDBT中的決策樹要設置淺一些 
訓練時間隨maxDepth增長而增長,但不是線性增長,: 
這裏寫圖片描述 
這裏寫圖片描述

/**兩種不純的比較:這個數據和參數,沒有差異 
maxIter = 30 maxDepth = 2 impurityType = gini setpSize = 0.10 accuracy = 0.7910 trainig time = 10522 
maxIter = 30 maxDepth = 2 impurityType = entropy setpSize = 0.10 accuracy = 0.7910 trainig time = 8824 
*/

學習率(步長):學習率也會影響預測準確率,設置太大精度會下降。 
這裏寫圖片描述

Scikit中繼續學習GBDT

  機器學習庫Scikit-learn中通常有更豐富的文檔和實例,接着再深刻學學吧。 
  他叫作:Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT)。實際上是一個東西,GBDT中的樹通常就是迴歸樹(不是分類樹)。這個算法在搜索排序中用的不少。

 Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

在Scikit中實現起來就更簡單了:

from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

#加載一個Demo數據集
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

#定義參數,訓練分類器
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1, random_state=0).fit(X_train, y_train)

#測試集上評估
clf.score(X_test, y_test)
Out[7]: 0.91300000000000003

n_estimators弱分類器的個數,實際上就是Spark 2.0中的最大迭代次數maxIter(即決策樹的個數,這裏的弱分類就是決策樹啊)。 
learning_rate應該對應的就是Spark2.0中的stepSize。 
值得注意的是n_estimators和learning_rate是相互影響的,小一點的學習率須要更多的弱分類器,這樣才能維持一個恆定的訓練偏差。 
[HTF2009]實驗代表設置一個小一點的學習,小一些的學習率在測試數據集上會有更高的預測準確率。 
[R2007] 也建議將學習率設置爲選擇一個小的恆定值(好比小於等於0.1),並選擇一個n_estimators做爲訓練的早期中止條件。

[HTF2009] Hastie, R. Tibshirani and J. Friedman, 「Elements of 
Statistical Learning Ed. 2」, Springer, 2009. 
[R2007] Ridgeway,Generalized Boosted Models: A guide to the gbm package」, 2007 
尚未時間看這兩個文獻,但願有時間再學習學習。 
The parameter learning_rate strongly interacts with the parameter n_estimators, the number of weak learners to fit. Smaller values of learning_rate require larger numbers of weak learners to maintain a constant training error. Empirical evidence suggests that small values of learning_rate favor better test error. [HTF2009] recommend to set the learning rate to a small constant (e.g. learning_rate <= 0.1) and choose n_estimators by early stopping. For a more detailed discussion of the interaction between learning_rate and n_estimators see [R2007].

能夠用相似的循環很方便各類完成測試

#GDBT python參數測試代碼
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]

y_train, y_test = y[:2000], y[2000:]

'''
n_estimators_ =[10,100,300,500,1000]
learning_rate_=[0.05,0.10,0.2,0.5,1.0]
for i in range(5):
    for j in range(5): 
        clf = GradientBoostingClassifier(n_estimators=n_estimators_[i],\
        learning_rate=learning_rate_[j],\
        max_depth=1,random_state=0).fit(X_train, y_train)

        print ("n_estimators = "+str(n_estimators_[i])\
        +"  learning_rate = "+str(learning_rate_[j])+ \
        "  score = "+str(clf.score(X_test, y_test)))
'''
n_estimators_ =[10,100,300,500,1000,2000,5000]
learning_rate_=[0.05]
for i in range(7):
    for j in range(1): 
        clf = GradientBoostingClassifier(n_estimators=n_estimators_[i],\
        learning_rate=learning_rate_[j],\
        max_depth=1,random_state=0).fit(X_train, y_train)

        print ("n_estimators = "+str(n_estimators_[i])\
        +"  learning_rate = "+str(learning_rate_[j])+ \
        "  score = "+str(clf.score(X_test, y_test)))

設置一個很是小的學習率=0.05,逐步增長弱分類器的數目 
能夠看出學習率很小時,的確須要不少的弱分類器才能獲得較好的結果。可是預測效果一直在變好。

這裏寫圖片描述

學習率很大時,較少的n_estimators 值就能夠達到相似的結果。(可是考慮到模型的穩定,仍是不建議選一個很大的學習率)

n_estimators = 10 learning_rate = 0.5 score = 0.6889 
n_estimators = 100 learning_rate = 0.5 score = 0.8987 
n_estimators = 300 learning_rate = 0.5 score = 0.9291 
n_estimators = 500 learning_rate = 0.5 score = 0.9378 
n_estimators = 1000 learning_rate = 0.5 score = 0.9444 
n_estimators = 2000 learning_rate = 0.5 score = 0.9475 
n_estimators = 5000 learning_rate = 0.5 score = 0.9469

超級多的樹會組合什麼結果呢?(即便toy-dataset也訓練漫長) 
咱們能夠看到最終預測準確率會收斂到一個值(大於2000-5000次之後)

n_estimators = 100 learning_rate = 0.1 score = 0.8189 
n_estimators = 500 learning_rate = 0.1 score = 0.8975 
n_estimators = 1000 learning_rate = 0.1 score = 0.9203 
n_estimators = 5000 learning_rate = 0.1 score = 0.9428 
n_estimators = 10000 learning_rate = 0.1 score = 0.9463 
n_estimators = 20000 learning_rate = 0.1 score = 0.9465 
n_estimators = 50000 learning_rate = 0.1 score = 0.9457

參考文獻:

(1)Spark document 
http://spark.apache.org/docs/latest/mllib-ensembles.html 
(2)機器學習中的數學(1)-迴歸(regression)、梯度降低(gradient descent) 
http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_gradient_descent.html 
(3)GBDT(MART) 迭代決策樹入門教程 | 簡介 
http://blog.csdn.net/w28971023/article/details/8240756    

相關文章
相關標籤/搜索