項目上要求給出一個可配置的類自動化的流程,下面根據本身的思考給出自動訓練模型的部分。python
決策樹模型關鍵參數有兩個:樹深度和樹棵數(模型訓練中稱爲迭代次數,下稱迭代次數)app
樹深度
樹的深度如何決定,我的以爲:每棵樹最好都能用到全部的特徵,因此樹深度跟特徵數相關,對特徵個數對2求對數,而後上取整便可dom
# 經過特徵個數計算決策樹深度
# 計算邏輯:把全部的特徵都放到決策樹的葉子節點至少須要深度爲多少的二叉樹
tree_depth = math.ceil(math.log(num_of_features,2))
下面作了一個驗證這樣選擇樹深度合理的實驗函數
特徵個數:6,正樣本:2903,計算獲得樹深度爲3,最大迭代次數(看下面)爲47測試
先看樹深度爲2,3,4的評分結果:y是評分結果,t是train_fbeta,e是eval_fbetaspa
樹深度爲2:設計
樹深度爲3code
樹深度爲4:orm
這個實驗也證實這樣選擇樹深度是比較合理的:blog
樹深度太小,模型不容易收斂,須要更大的迭代次數;
樹深度過大,模型容易過擬合,甚至可能在迭代次數爲一的時候就已通過擬合。
對比樹深度爲2,3,4的結果,能夠發現樹深度爲3的時候比較合理,沒有一開始就過擬合,也不須要太大的迭代次數再能收斂。
迭代次數
相對樹深度來講,迭代次數很差肯定,是本文的重點,下面詳細給出本身的作法
上下限
下限比較容易,至少也得生成一一棵決策樹,下限能夠定爲1,固然,若是你知道更高的下限,那就更好不過了
上限能夠根據經驗來,詳細見下面的註釋。
# 經過樣本數和特徵數(樹深度)計算最大的迭代次數
# 計算邏輯是:每一個特徵須要多少個樣原本支撐(這個須要根據樣本自己和實踐經驗獲得)
# 這裏咱們使用正樣本的數量來計算,根據經驗,每一個特徵咱們假設須要10個樣本支撐(這個值能夠根據不一樣的樣原本選擇)
# 樣本越多,樣本越密集,每一個特徵須要支撐的樣本數增長
sustain_num = 10
max_num_rounds= pos_num/(num_of_features*sustain_num)
模型好壞
模型自己好壞
這裏只研究二分類問題,對於給定的樣本,模型的好壞仍是有比較成熟的標準去判斷,比較精度和召回率,fbeta值等等
通常來講精度和召回率是此起彼伏的關係,因此通常用他們的加權值fbeta來衡量
from sklearn import metrics
# 二分類問題的fbeta,在這裏做爲衡量訓練模型好壞的標準,而不是直接用精度和召回率
def getFbeta(train_y,predict):
# print set(train_y),set(predict)
return metrics.fbeta_score(train_y, predict, 0.5,labels=[0,1])
模型魯棒性
通常隨着迭代次數的增長,模型趨向過擬合,也就是模型自己會過度擬合訓練數據,致使可能在其餘的樣本集表現失常,從而使得模型魯棒性差。
魯棒性強的模型在訓練集和測試集(有人叫評價集或者驗證集)表現穩定,
下面定義了一個衡量模型好壞的函數,綜合評價了模型自己的好壞(train_fbeta)和魯棒性(train_fbeta-eval_fbeta)
不一樣的數據集模型可能表現不同,能夠根據本身的須要修改評分函數
# fbeta是衡量模型訓練結果的好壞,訓練集的fbeta通常要比驗證集的高,由於模型使用訓練集訓練出來的,通常會更加擬合訓練集
# 這個函數主要是經過模型在驗證集上的表現和在訓練集的表現,給出該模型最後的總評分
# 例如 (74.5%,74.6%)跟(84%,85%)那個結果更優?
# 設計邏輯:咱們認爲差值在1%以內是比較能夠接受的,在差值≤1%的時候,均值越大越好,當差值>1%的時候,咱們更傾向於差值≤1%的模型,
# 同時,差值≤1%的時候,咱們不會計較具體值是多少
# 也就是說,差值≤1%的模型咱們傾向於給出更高的評分,給出幾個例子吧
# 組1 組2 更優(?只是直觀感受)
# (74.5%, 74.6%) (84.0%, 85.0%) (84.0%, 85.0%)
# (64.5%, 64.9%) (88.0%, 85.0%) (64.5%, 64.6%)
# (64.5%, 64.6%) (66.3%, 66.6%) (66.3%, 66.6%)
# (63.5%, 64.6%) (72.3%, 68.6%) (63.5%, 64.6%)
def eval_model(train_fbeta,eval_fbeta):
difference = abs(train_fbeta-eval_fbeta)
average = (train_fbeta+eval_fbeta)/2.0
if difference <= 1.0:
return fbeta_weights*average
elif difference > 1.0 and difference <=2.0:
return average
else:
return average/(1+difference/20.0)
自動訓練模型
最壞的狀況是不知道模型在何時會達到最優,這個時候咱們必須遍歷全部的迭代次數才能獲得最優的迭代次數和最好的模型。
通常狀況下,隨着迭代次數的增長,評分會先上升後降低,有最大值;而且在最大值以後評分會持續降低。
若是實驗次數夠多,樹深度選擇合理,期待的評分曲線是一個先陡坡後緩坡的形狀,這種狀況下咱們提早結束模型的訓練。
# 可以自動訓練的核心理論基礎是:隨着迭代次數的增長,評分會先上升後降低,有最大值;而且在最大值以後評分會持續降低
# 因此只須要判斷在到達一個極大值以後,若是在能夠接受的一段區間內評分都降低,就能夠認爲這個極大值就是最大值(這裏咱們認爲若是20次都沒有比極大值更大的,那麼這個極大值就是最大值)
def auto_train(dtrain,tree_depth,min_num_rounds,max_num_rounds):
max_score=-1;max_score_num_round=0;
label=0
for i in xrange(min_num_rounds,max_num_rounds):
score = train(dtrain, tree_depth, i, train_rounds)
if score > max_score:
max_score = score
max_score_num_round = i
label=0
label+=1
if label>=20:
break
return max_score_num_round,max_score
模型重現
因爲模型訓練具備偶然性,咱們不直接使用自動訓練模型的結果,而是採用自動訓練模型給出的最優樹深度和迭代次數從新訓練模型,
若是能夠重現最優結果,那麼咱們就能夠認爲自動訓練出來的模型是可靠的;反之,若是,從新訓練模型好久都不能重現自動訓練的最優結果,咱們認爲此次自動訓練模型的結果具備較大的偶然性,須要從新自動訓練模型
while True:
if abs(best_score -best_train(dtrain, tree_depth, num_round))<1.0:
print "決策樹深度:",tree_depth
print "迭代次數:",num_round
print "train_fbeta:",res_list[num_round-1][1]
print "eval_fbeta:",res_list[num_round-1][2]
print "best_score:",best_score
break
count+=1
if count>20:
print "auto rerun!"
os.system("rm model_depth{0}_{1}trees_{2}".format(int(tree_depth), num_round, datetime.datetime.now().strftime('%Y-%m-%d')))
os.system("python auto_train.py")
break
至此,自動訓練決策樹模型基本成型,下面附上完整的代碼,供參考。
# auto_train.py
# coding:utf8
import os,sys,time,math
import numpy as np
import xgboost as xgb
import datetime
from sklearn import metrics
from sklearn.metrics import classification_report
ISOTIMEFORMAT='%Y-%m-%d %X'
fbeta_difference=1.0
fbeta_weights=1.1
train_rounds=3
res_list=[]
def getFbeta(train_y,predict):
return metrics.fbeta_score(train_y, predict, 0.5,labels=[0,1])
def eval_model(train_fbeta,eval_fbeta):
difference = abs(train_fbeta-eval_fbeta)
average = (train_fbeta+eval_fbeta)/2.0
if difference <= 1.0:
return fbeta_weights*average
elif difference > 1.0 and difference <=2.0:
return average
else:
return average/(1+difference/20.0)
def loadfile(filename):
X, y = [], []
count = 0;
fields = 0;
with open(filename) as f:
fields= len(f.readline().strip("\n").split("\t"))
with open(filename) as f:
for line in f:
try:
line = line.strip("\n").split('\t')
if len(line)!=fields:
continue
X.append([float(item) for item in line[1:]])
y.append(int(line[0]))
count += 1
except:
print traceback.format_exc()
data = xgb.DMatrix(X, label=y)
return data,count,fields-1
def train(dtrain, max_depth, num_rounds, train_rounds):
train_fbeta_list=[];eval_fbeta_list=[]
for x in xrange(0,train_rounds):
shuffled_idx = range(dtrain.num_row())
np.random.seed(time.localtime())
np.random.shuffle(shuffled_idx)
dsubtrain = dtrain.slice(shuffled_idx[:len(shuffled_idx) / 2 + 1])
dsubeval = dtrain.slice(shuffled_idx[len(shuffled_idx) / 2 + 1:])
param = {'max_depth': int(max_depth), 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
watchlist = [(dsubeval, 'eval'), (dsubtrain, 'train')]
num_rounds = int(num_rounds)
bst = xgb.train(param, dsubtrain, num_rounds, watchlist)
leafindex = bst.predict(dsubtrain, output_margin=True, pred_leaf=False)
pred_res = [1 if index > 0.5 else 0 for index in leafindex]
train_fbeta_list.append(getFbeta(dsubtrain.get_label(), pred_res))
leafindex = bst.predict(dsubeval, output_margin=True, pred_leaf=False)
pred_res = [1 if index > 0.5 else 0 for index in leafindex]
eval_fbeta_list.append(getFbeta(dsubeval.get_label(), pred_res))
train_fbeta=sum(train_fbeta_list)/len(train_fbeta_list)*100
eval_fbeta=sum(eval_fbeta_list)/len(eval_fbeta_list)*100
score=eval_model(train_fbeta,eval_fbeta)
res_list.append([num_rounds,train_fbeta,eval_fbeta,score])
return score
def best_train(dtrain, max_depth, num_rounds):
shuffled_idx = range(dtrain.num_row())
np.random.seed(time.localtime())
np.random.shuffle(shuffled_idx)
dsubtrain = dtrain.slice(shuffled_idx[:len(shuffled_idx) / 2 + 1])
dsubeval = dtrain.slice(shuffled_idx[len(shuffled_idx) / 2 + 1:])
param = {'max_depth': int(max_depth), 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
watchlist = [(dsubeval, 'eval'), (dsubtrain, 'train')]
num_rounds = int(num_rounds)
bst = xgb.train(param, dsubtrain, num_rounds, watchlist)
bst.save_model(
"model_depth{0}_{1}trees_{2}".format(int(max_depth), num_rounds, datetime.datetime.now().strftime('%Y-%m-%d')))
leafindex = bst.predict(dsubtrain, output_margin=True, pred_leaf=False)
pred_res = [1 if index > 0.5 else 0 for index in leafindex]
train_fbeta = getFbeta(dsubtrain.get_label(), pred_res)*100
leafindex = bst.predict(dsubeval, output_margin=True, pred_leaf=False)
pred_res = [1 if index > 0.5 else 0 for index in leafindex]
eval_fbeta = getFbeta(dsubeval.get_label(), pred_res)*100
score=eval_model(train_fbeta,eval_fbeta)
return score
def auto_train(dtrain,tree_depth,min_num_rounds,max_num_rounds):
max_score=-1;max_score_num_round=0;
label=0
for i in xrange(min_num_rounds,max_num_rounds):
score = train(dtrain, tree_depth, i, train_rounds)
if score > max_score:
max_score = score
max_score_num_round = i
label=0
label+=1
if label>=20:
break
return max_score_num_round,max_score
if __name__ == '__main__':
filename="oldtrain10"
dtrain,count,num_of_features=loadfile(filename)
tree_depth = math.ceil(math.log(num_of_features,2))
pos_num = sum(dtrain.get_label())
sustain_num = 10
max_num_rounds= pos_num/(num_of_features*sustain_num)
num_round,best_score=auto_train(dtrain, tree_depth, 1, int(max_num_rounds))
count=0
while True:
if abs(best_score -best_train(dtrain, tree_depth, num_round))<1.0:
print "決策樹深度:",tree_depth
print "迭代次數:",num_round
print "train_fbeta:",res_list[num_round-1][1]
print "eval_fbeta:",res_list[num_round-1][2]
print "best_score:",best_score
break
count+=1
if count>20:
print "auto rerun!"
os.system("rm model_depth{0}_{1}trees_{2}".format(int(tree_depth), num_round, datetime.datetime.now().strftime('%Y-%m-%d')))
os.system("python auto_train.py")
break