做者:大樹javascript
更新時間:01.20css
email:59888745@qq.comhtml
數據處理,機器學習html5
Kaggle上有不少有意思的項目,你們得空能夠試着作一作,其中有個關於香港賽馬預測的項目,若你們作的效果好,
預測的結果準確度高的話,能夠輕鬆的 get money 了,記得香港報紙有報道說有個大學教授經過統計學建模
進行賽馬贏了5000萬港幣,相信經過機器學習,深度學習,必定能夠提升投注的準確度,恭喜你們發財阿,加油學吧~~
Kaggle自行車租賃預測比賽,這是一個連續值預測的問題,
也就是咱們說的機器學習中的迴歸問題,我們一塊兒來看看這個問題。
這是一個城市自行車租賃系統,提供的數據爲2年內華盛頓按小時記錄的自行車租賃數據,其中訓練集由每月
的前19天組成,測試集由20號以後的時間組成(須要咱們本身去預測)。
Kaggle自行車租賃預測比賽:https://www.kaggle.com/c/bike-sharing-demand
1.加載數據
2.數據分析
3.特徵數據提取
4.準備訓練集數據,測試集數據
5.模型選擇,先用本身合適算法跑一個baseline的model出來,再進行後續的分析步驟,一步步提升。
6.參數調優,用Grid Search找最好的參數
7.用模型預測打分
#load data, review the fild and data type
import pandas as pd
df_train = pd.read_csv('kaggle_bike_competition_train.csv',header=0)
df_train.head(5)
df_train.dtypes
#look the data rows,columns
df_train.shape
#看看有沒有缺省的字段, 沒有發現缺省值
df_train.count()
#來處理時間,由於它包含的信息老是很是多的,畢竟變化都是隨着時間發生的嘛
df_train.head()
df_train['hour']=pd.DatetimeIndex(df_train.datetime).hour
df_train['day']=pd.DatetimeIndex(df_train.datetime).dayofweek
df_train['month']=pd.DatetimeIndex(df_train.datetime).month
#other method
# df_train['dt']=pd.to_datetime(df_train['datetime'])
# df_train['day_of_week']=df_train['dt'].apply(lambda x:x.dayofweek)
# df_train['day_of_month']=df_train['dt'].apply(lambda x:x.day)
df_train.head()
#提取 相關特徵字段
# df = df_train.drop(['datetime','casual','registered'],axis=1,inplace=True)
df_train = df_train[['season','holiday','workingday','weather','temp','atemp',
'humidity','windspeed','count','month','day','hour']]
#df=df_train['datetime']
df_train.head(5)
df_train.shape
準備訓練集數據,測試集數據:
1. df_train_target:目標,也就是count字段。
2. df_train_data:用於產出特徵的數據
df_train_target = df_train['count'].values
print(df_train_target.shape)
df_train_data = df_train.drop(['count'],axis =1).values
print(df_train_data.shape)
算法
我們依舊會使用交叉驗證的方式(交叉驗證集約佔所有數據的20%)來看看模型的效果,
咱們會試 支持向量迴歸/Suport Vector Regression, 嶺迴歸/Ridge Regression 和
隨機森林迴歸/Random Forest Regressor。每一個模型會跑3趟看平均的結果。
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.learning_curve import learning_curve
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import explained_variance_score
# 切分一下數據(訓練集和測試集)
cv = cross_validation.ShuffleSplit(len(df_train_data), n_iter=3, test_size=0.2,
random_state=0)
# 各類模型來一圈
print("嶺迴歸")
for train, test in cv:
svc = linear_model.Ridge().fit(df_train_data[train], df_train_target[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
print("支持向量迴歸/SVR(kernel='rbf',C=10,gamma=.001)")
for train, test in cv:
svc = svm.SVR(kernel ='rbf', C = 10, gamma = .001).fit(df_train_data[train], df_train_target[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
print("隨機森林迴歸/Random Forest(n_estimators = 100)")
for train, test in cv:
svc = RandomForestRegressor(n_estimators = 100).fit(df_train_data[train], df_train_target[train])
print("train score: {0:.3f}, test score: {1:.3f}\n".format(
svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
隨機森林迴歸得到了最佳結果
不過,參數設置得是否是最好的,這個咱們能夠用GridSearch來幫助測試,找最好的參數
X = df_train_data
y = df_train_target
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.2, random_state=0)
tuned_parameters = [{'n_estimators':[10,100,500,550]}]
scores = ['r2']
for score in scores:
print(score)
clf = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=5, scoring=score)
clf.fit(X_train, y_train)
print("最佳參數找到了:")
print("")
#best_estimator_ returns the best estimator chosen by the search
print(clf.best_estimator_)
print("")
print("得分分別是:")
print("")
#grid_scores_的返回值:
# * a dict of parameter settings
# * the mean score over the cross-validation folds
# * the list of scores for each fold
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() / 2, params))
print("")
Grid Search幫挑參數仍是蠻方便的, 並且要看看模型狀態是否是,過擬合or欠擬合
咱們發現n_estimators=500,550時,擬合得最好。