今天來看一個迴歸問題——Kaggle競賽Bike Sharing Demand,根據日期時間、天氣、溫度等特徵,預測自行車的租借量。訓練與測試數據集大概長這樣:python
// train datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count 2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0,3,13,16 2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0,8,32,40 // test datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed 2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027 2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,
觀察上面的數據,咱們能夠發現:租借量等於註冊用戶租借量加上未註冊用戶租借量,即casual
+ registered
。評價指標是loss函數RMSLE (Root Mean Squared Logarithmic Error):函數
\[ \sqrt{\frac{1}{n} \sum_{i=1}^n (\log (p_i +1) - \log (a_i+1))^2 } \]測試
其中,\(p_i\)爲預測的租借量,\(a_i\)爲實際的租借量,\(n\)爲樣本數。實際上,RMSLE就是一個偏差函數。spa
日期時間放在一個string字段裏,咱們須要解析出年、月、weekday、小時等,之因此沒有選擇天作特徵,是由於weekday更具備週期性、表明性。code
import pandas as pd train = pd.read_csv("data/train.csv", parse_dates=[0], date_parser=lambda d: pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S')) train['year'] = train['datetime'].map(lambda d: d.year) train['month'] = train['datetime'].map(lambda d: d.month) train['hour'] = train['datetime'].map(lambda d: d.hour) train['weekday'] = train['datetime'].map(lambda d: d.weekday()) train['day'] = train['datetime'].map(lambda d: d.day)
爲了方便計算,咱們categorical化部分特徵:get
df['weather'] = df['weather'].astype('category') df['holiday'] = df['holiday'].astype('category') df['workingday'] = df['workingday'].astype('category') df['season'] = df['season'].astype('category') df['hour'] = df['hour'].astype('category')
選用的特徵以下:string
features = ['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed', 'time', 'weekday', 'year']
之因此丟掉了month特徵,是由於發現有過擬合。pandas
選用GBM來作迴歸,參數是經過grid_search挑出來的:it
booster = ensemble.GradientBoostingRegressor(n_estimators=500) param_grid = {'learning_rate': [0.1, 0.05, 0.01], 'max_depth': [10, 15, 20], 'min_samples_leaf': [3, 5, 10, 20], } gs_cv = GridSearchCV(booster, param_grid, n_jobs=4).fit(training[features], training['log-count']) # best hyperparameter setting print(gs_cv.best_params_) # {'learning_rate': 0.05, 'max_depth': 10, 'min_samples_leaf': 20}
該方法的RMSLE爲0.43789。前面提到了租借量爲casual
+ registered
之和,那麼咱們能夠把這二者看作類別,分別用GBM進行預測,而後相加後獲得結果。結果的確將RMSLE下降到了0.41983。table
前面只用到了一種迴歸方法,那能不能將GBM與RF的結果合到一塊兒呢?答案是能夠的,經過賦權值0.5(即平均ensemble)的方式將兩個結果組合起來,RMSLE下降到了0.37022。評價指標結果對好比下:
特徵 | 迴歸 | RMSLE |
---|---|---|
+年、星期、小時 | GBM | 0.43789 |
GBM + GBM | 0.41983 | |
GBM RF ensemble | 0.37022 |
結論:ensemble真是個好方法,三個臭皮匠勝過諸葛亮。
[1] 阿波, kaggle上的自行車出租數量預測.
[2] Damien RJ, Forecasting Bike Sharing Demand.