手把手教寫出XGBoost實戰程序

時間 2019-11-24

標籤手把手寫出 xgboost 實戰程序简体版

原文原文鏈接

簡單介紹：

這是一個真實的比賽。賽題來源是天池大數據的 "商場中精肯定位用戶所在店鋪"。原數據有114萬條，計算起來很是困難。爲了讓初學者有一個更好的學習體驗，也更加基礎，我將數據集縮小了以後放在這裏，密碼：ndfd。供你們下載。python

在個人數據中，數據是這樣子的： train.csvlinux

user_id	用戶的id	time_stamp	時間戳
latitude	緯度	wifi_strong 1-10	十個wifi的信號強度
longitude	經度	wifi_id 1-10	十個wifi的id
shop_id	商店的id	con_sta 1-10	十個wifi鏈接狀態

test.csvgit

user_id	用戶的id	time_stamp	時間戳
latitude	緯度	wifi_id 1-10	十個wifi的id
longitude	經度	con_sta 1-10	十個wifi鏈接狀態
row_id	行標	wifi_strong 1-10	十個wifi的信號強度
shop_id	商店的id

這個題目的意思是，咱們在商場中，因爲不一樣層數和GPS精度限制，咱們並不能僅根據經緯度準確知道某用戶具體在哪一家商店中。咱們經過手機與附近10個wifi點的鏈接狀況，來精準判斷出用戶在哪一個商店中。方便公司根據用戶的位置投放相應店家的廣告。算法

開始實戰

準備實戰以前，固然要對整個XGBoost有一個基本瞭解，對這個模型不太熟悉的朋友，建議看我以前的文章《XGBoost》。編程

實戰的流程通常是先將數據預處理，成爲咱們模型可處理的數據，包括丟失值處理，數據拆解，類型轉換等等。而後將其導入模型運行，最後根據結果正確率調整參數，反覆調參數達到最優。windows

咱們在機器學習實戰的時候必定要脫離一個思惟慣性————一切都得咱們思考周全才能夠運行。這是一個頗有趣的思惟慣性，怎麼解釋呢？好比這道賽題，我也是學通訊出身的，看到十個wifi強度值，就想找這中間的關係，而後編程來求解人的確切位置。這本質上仍是咱們的思惟停留在顯式編程的層面上，以爲程序只有寫清楚纔可達到預約的目標。但其實大數據處理並非這個原理。決策樹無論遇到什麼數據，不管是時間仍是地理位置，都是同樣的按照必定規則生成樹，最後讓新數據按照這個樹走一遍獲得預測的結果。也就是說咱們沒必要花不少精力去考慮每一個數據的具體物理意義，只要把他們放進模型裏面就能夠了。(調參須要簡單地考慮物理意義來給各個數據以權重，這個之後再說)bash

分析一下數據

咱們的數據的意義都在上面那張表裏面，咱們有用戶的id、經緯度、時間戳、商店id、wifi信息。咱們簡單思考能夠知道：多線程

user_id並無什麼實際意義，僅僅是一個代號而已
shop_id是咱們預測的目標，咱們題目要求就是咱們根據其餘信息來預測出用戶所在的shop_id,因此 shop_id 是咱們的訓練目標
經緯度跟咱們的位置有關，是有用的信息
wifi_id 讓咱們知道是哪一個路由器，這個不一樣的路由器位置不同，因此有用
wifi_strong是信號強度，跟咱們離路由器距離有關，有用
con_sta是鏈接狀態，也就是有沒有連上。原本我看數據中基本都是沒連上，覺得沒有用。後來得高人提醒，說若是有人自動連上某商店wifi，不是能夠說明他常來麼，這個對於判斷顧客也是有一點用的。
咱們看test.csv整體差很少，就多了個row_id,咱們輸出結果要注意對應上就能夠

python庫準備

import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
複製代碼

咱這個XGBoost比較簡單，因此就使用了最必要的三個庫，pandas數據處理庫，xgboost庫，從大名鼎鼎的機器學習庫sklearn中導入了preprocessing庫，這個pandas庫對數據的基本處理有不少封裝函數，用起來比較順手。想看例子的戳這個連接，我寫的pandas.Dataframe基本拆解數據的方法。app

先進行數據預處理

咱得先導入一份數據：機器學習

train = pd.read_csv(r'D:\XGBoost_learn\mall_location\train2.csv')
tests = pd.read_csv(r'D:\XGBoost_learn\mall_location\test_pre.csv')
複製代碼

咱們使用pandas裏面的read_csv函數直接讀取csv文件。csv文件全名是Comma-Separated Values文件，就是每一個數據之間都以逗號隔開，比較簡潔，也是各個數據比賽經常使用的格式。咱們須要注意的是路徑問題，windows下是\,linux下是/，這個有區別。而且咱們寫的路徑常常會與庫裏的函數字段重合，因此在路徑最前加一個r來禁止與庫裏匹配，重合報錯。r是raw的意思，生的，你們根據名字自行理解一下。

咱們的time_stamp原來是一個str類型的數據，計算機是不會知道它是什麼東西的，只知道是一串字符串。因此咱們進行轉化成datetime處理：

train['time_stamp'] = pd.to_datetime(pd.Series(train['time_stamp']))
tests['time_stamp'] = pd.to_datetime(pd.Series(tests['time_stamp']))
複製代碼

train和tests都要處理。這也體現了pandas的強大。接下來咱們看time_stamp數據的樣子：2017/8/6 21:20，看數據集可知，是一個十分鐘爲精確度(粒度)的數據，感受這個數據包含太多信息了呢，放一塊兒很浪費(實際上是容易過擬合，由於一個結點會被分的很細)，咱們就將其拆開吧：

train['Year'] = train['time_stamp'].apply(lambda x: x.year)
train['Month'] = train['time_stamp'].apply(lambda x: x.month)
train['weekday'] = train['time_stamp'].dt.dayofweek
train['time'] = train['time_stamp'].dt.time
tests['Year'] = tests['time_stamp'].apply(lambda x: x.year)
tests['Month'] = tests['time_stamp'].apply(lambda x: x.month)
tests['weekday'] = tests['time_stamp'].dt.dayofweek
tests['time'] = tests['time_stamp'].dt.time
複製代碼

細心的朋友可能會發現，這裏採用了兩種寫法，一種是.apply(lambda x: x.year)，這是什麼意思呢？這實際上是採用了一種叫匿名函數的寫法.匿名函數就是咱們相要寫一個函數，但並不想費神去思考這個函數該如何命名，這時候咱們就須要一個匿名函數，來實現一些小功能。咱們這裏採用的是.apply(lambda x: x.year)其實是調用了apply函數，是加這一列的意思，加的列的內容就是x.year。咱們要是以爲這樣寫不直觀的話，也能夠這樣寫：

YearApply(x)：
   return x.year
   
train['Year'] = train['time_stamp'].apply(YearApply)
複製代碼

這兩種寫法意義都是同樣的。在調用weekday和datetime的時候，咱們使用的是numpy裏面的函數dt，用法如代碼所示。其實這weekday也能夠這樣寫： train['weekday'] = train['time_stamp'].apply(lambda x: x.weekday())，注意多了個括號，因爲weekday須要計算一下才能夠獲得，因此還調用了一下內部的函數。爲何採用weekday呢，由於星期幾比幾號對於購物來講更加有特徵性。接下來咱們將這個time_stamp丟掉，由於已經有了year、month那些：

train = train.drop('time_stamp', axis=1)
tests = tests.drop('time_stamp', axis=1)
複製代碼

再丟掉缺失值，或者補上缺失值。

train = train.dropna(axis=0)
tests = tests.fillna(method='pad')
複製代碼

咱們看到我對訓練集和測試集作了兩種不一樣方式的處理。訓練集數據比較多，並且缺失值比例比較少，因而就將全部缺失值使用dropna函數，tests文件由於是測試集，不能丟失一個信息，哪怕數據不少缺失值不多，因此咱們用各類方法來補上，這裏採用前一個非nan值補充的方式（method=「pad」），固然也有其餘方式，好比用這一列出現頻率最高的值來補充。

class DataFrameImputer(TransformerMixin):
   def fit(self, X, y=None):
       for c in X:
           if X[c].dtype == np.dtype('O'):
               fill_number = X[c].value_counts().index[0]
               self.fill = pd.Series(fill_number, index=X.columns)
           else:
               fill_number = X[c].median()
               self.fill = pd.Series(fill_number, index=X.columns)
       return self
       
       def transform(self, X, y=None):
           return X.fillna(self.fill)
       
train = DataFrameImputer().fit_transform(train)
複製代碼

這一段代碼有一點拗口，意思是對於X中的每個c，若是X[c]的類型是object（‘O’表示object）的話就將[X[c].value_counts().index[0]傳給空值，[X[c].value_counts().index[0]表示的是重複出現最多的那個數，若是不是object類型的話，就傳回去X[c].median()，也就是這些數的中位數。

在這裏咱們可使用print來輸出一下咱們的數據是什麼樣子的。

print(train.info())
複製代碼

<class 'pandas.core.frame.DataFrame' at 0x0000024527C50D08>
Int64Index: 467 entries, 0 to 499
Data columns (total 38 columns):
user_id          467 non-null object
shop_id          467 non-null object
longitude        467 non-null float64
latitude         467 non-null float64
wifi_id1         467 non-null object
wifi_strong1     467 non-null int64
con_sta1         467 non-null bool
wifi_id2         467 non-null object
wifi_strong2     467 non-null int64
con_sta2         467 non-null object
wifi_id3         467 non-null object
wifi_strong3     467 non-null float64
con_sta3         467 non-null object
wifi_id4         467 non-null object
wifi_strong4     467 non-null float64
con_sta4         467 non-null object
wifi_id5         467 non-null object
wifi_strong5     467 non-null float64
con_sta5         467 non-null object
wifi_id6         467 non-null object
wifi_strong6     467 non-null float64
con_sta6         467 non-null object
wifi_id7         467 non-null object
wifi_strong7     467 non-null float64
con_sta7         467 non-null object
wifi_id8         467 non-null object
wifi_strong8     467 non-null float64
con_sta8         467 non-null object
wifi_id9         467 non-null object
wifi_strong9     467 non-null float64
con_sta9         467 non-null object
wifi_id10        467 non-null object
wifi_strong10    467 non-null float64
con_sta10        467 non-null object
Year             467 non-null int64
Month            467 non-null int64
weekday          467 non-null int64
time             467 non-null object
dtypes: bool(1), float64(10), int64(5), object(22)
memory usage: 139.1+ KB
None
複製代碼

咱們能夠清晰地看出咱們代碼的結構，有多少列，每一列下有多少個值等等，有沒有空值咱們能夠根據值的數量來判斷。咱們在缺失值處理以前加入這個print(train.info())就會獲得：

<class 'pandas.core.frame.DataFrame' at 0x000001ECFA6D6718>
RangeIndex: 500 entries, 0 to 499
複製代碼

這裏面就有500個值，處理後就只剩467個值了，可見丟棄了很多。一樣的咱們也能夠將test的信息輸出一下：

<class 'pandas.core.frame.DataFrame' at 0x0000019E13A96F48>
RangeIndex: 500 entries, 0 to 499
複製代碼

500個值一個沒少。都給補上了。這裏我只取了輸出信息的標題，沒有全貼過來，由於全信息篇幅很長。咱們注意到這個數據中有bool、float、int、object四種類型，咱們XGBoost是一種迴歸樹，只能處理數字類的數據，因此咱們要轉化。對於那些字符串類型的數據咱們該如何處理呢？咱們採用LabelEncoder方法：

for f in train.columns:
    if train[f].dtype=='object':
        if f != 'shop_id':
            print(f)
            lbl = preprocessing.LabelEncoder()
            train[f] = lbl.fit_transform(list(train[f].values))
for f in tests.columns:
    if tests[f].dtype == 'object':
        print(f)
        lbl = preprocessing.LabelEncoder()
        tests[f] = lbl.fit_transform(list(tests[f].values))
複製代碼

這段代碼的意思是調用sklearn中preprocessing裏面的LabelEncoder方法，對數據進行標籤編碼，做用主要就是使其變成數字類數據，有的進行歸一化處理，使其運行更快等等。咱們看這段代碼，lbl只是LabelEncoder的簡寫，lbl = preprocessing.LabelEncoder()，這段代碼只有一個代換顯得一行不那麼長而已，沒有實際運行什麼。第二句lbl.fit_transform(list(train[f].values))是將train裏面的每個值進行編碼，咱們在其先後輸出一下train[f].values就能夠看出來：

print(train[f].values)
train[f] = lbl.fit_transform(list(train[f].values))
print(train[f].values)
複製代碼

我加上那一串0和/的目的是分隔開輸出數據。咱們獲得：

user_id
['u_376' 'u_376' 'u_1041' 'u_1158' 'u_1654' 'u_2733' 'u_2848' 'u_3063'
 'u_3063' 'u_3063' 'u_3604' 'u_4250' 'u_4508' 'u_5026' 'u_5488' 'u_5488'
 'u_5602' 'u_5602' 'u_5602' 'u_5870' 'u_6429' 'u_6429' 'u_6870' 'u_6910'
 'u_7037' 'u_7079' 'u_7869' 'u_8045' 'u_8209']
[ 7  7  0  1  2  3  4  5  5  5  6  8  9 10 11 11 12 12 12 13 14 14 15 16 17
 18 19 20 21]
複製代碼

咱們能夠看出，LabelEncoder將咱們的str類型的數據轉換成數字了。按照它本身的一套標準。對於tests數據，咱們能夠看到，我單獨將shop_id給避開了。這樣處理的緣由就是shop_id是咱們要提交的數據，不能有任何編碼行爲，必定要保持這種str狀態。

接下來須要將train和tests轉化成matrix類型，方便XGBoost運算：

feature_columns_to_use = ['Year', 'Month', 'weekday',
'time', 'longitude', 'latitude',
'wifi_id1', 'wifi_strong1', 'con_sta1',
 'wifi_id2', 'wifi_strong2', 'con_sta2',
'wifi_id3', 'wifi_strong3', 'con_sta3',
'wifi_id4', 'wifi_strong4', 'con_sta4',
'wifi_id5', 'wifi_strong5', 'con_sta5',
'wifi_id6', 'wifi_strong6', 'con_sta6',
'wifi_id7', 'wifi_strong7', 'con_sta7',
'wifi_id8', 'wifi_strong8', 'con_sta8',
'wifi_id9', 'wifi_strong9', 'con_sta9',
'wifi_id10', 'wifi_strong10', 'con_sta10',]
train_for_matrix = train[feature_columns_to_use]
test_for_matrix = tests[feature_columns_to_use]
train_X = train_for_matrix.as_matrix()
test_X = test_for_matrix.as_matrix()
train_y = train['shop_id']
複製代碼

待訓練目標是咱們的shop_id,因此train_y是shop_id。

導入模型生成決策樹

gbm = xgb.XGBClassifier(silent=1, max_depth=10, n_estimators=1000, learning_rate=0.05)
gbm.fit(train_X, train_y)
複製代碼

這兩句其實能夠合併成一句，咱們也就是在XGBClassifier裏面設定好參數，其全部參數以及其默認值(缺省值)我寫在這,內容來自XGBoost源代碼：

max_depth=3, 這表明的是樹的最大深度，默認值爲三層。max_depth越大，模型會學到更具體更局部的樣本。

learning_rate=0.1,學習率，也就是梯度提高中乘以的係數，越小，使得降低越慢，但也是降低的越精確。

n_estimators=100,也就是弱學習器的最大迭代次數，或者說最大的弱學習器的個數。通常來講n_estimators過小，容易欠擬合，n_estimators太大，計算量會太大，而且n_estimators到必定的數量後，再增大n_estimators得到的模型提高會很小，因此通常選擇一個適中的數值。默認是100。

silent=True,是咱們訓練xgboost樹的時候後臺要不要輸出信息，True表明將生成樹的信息都輸出。

objective="binary:logistic",這個參數定義須要被最小化的損失函數。最經常使用的值有：

binary:logistic 二分類的邏輯迴歸，返回預測的機率(不是類別)。

multi:softmax 使用softmax的多分類器，返回預測的類別(不是機率)。在這種狀況下，你還須要多設一個參數：num_class(類別數目)。

multi:softprob和multi:softmax參數同樣，可是返回的是每一個數據屬於各個類別的機率。

nthread=-1, 多線程控制，根據本身電腦核心設，想用幾個線程就能夠設定幾個，若是你想用所有核心，就不要設定，算法會自動識別

`gamma=0,在節點分裂時，只有分裂後損失函數的值降低了，纔會分裂這個節點。Gamma指定了節點分裂所需的最小損失函數降低值。這個參數的值越大，算法越保守。這個參數的值和損失函數息息相關，因此是須要調整的。

min_child_weight=1,決定最小葉子節點樣本權重和。和GBM的 min_child_leaf 參數相似，但不徹底同樣。XGBoost的這個參數是最小樣本權重的和，而GBM參數是最小樣本總數。這個參數用於避免過擬合。當它的值較大時，能夠避免模型學習到局部的特殊樣本。可是若是這個值太高，會致使欠擬合。這個參數須要使用CV來調整

max_delta_step=0, 決定最小葉子節點樣本權重和。和GBM的 min_child_leaf 參數相似，但不徹底同樣。XGBoost的這個參數是最小樣本權重的和，而GBM參數是最小樣本總數。這個參數用於避免過擬合。當它的值較大時，能夠避免模型學習到局部的特殊樣本。可是若是這個值太高，會致使欠擬合。這個參數須要使用CV來調整。

subsample=1, 和GBM中的subsample參數如出一轍。這個參數控制對於每棵樹，隨機採樣的比例。減少這個參數的值，算法會更加保守，避免過擬合。可是，若是這個值設置得太小，它可能會致使欠擬合。典型值：0.5-1

colsample_bytree=1, 用來控制每棵隨機採樣的列數的佔比(每一列是一個特徵)。典型值：0.5-1

colsample_bylevel=1,用來控制樹的每一級的每一次分裂，對列數的採樣的佔比。其實subsample參數和colsample_bytree參數能夠起到類似的做用。

reg_alpha=0,權重的L1正則化項。(和Lasso regression相似)。能夠應用在很高維度的狀況下，使得算法的速度更快。

reg_lambda=1, 權重的L2正則化項這個參數是用來控制XGBoost的正則化部分的。這個參數越大就越能夠懲罰樹的複雜度

scale_pos_weight=1,在各種別樣本十分不平衡時，把這個參數設定爲一個正值，可使

base_score=0.5, 全部實例的初始化預測分數，全局偏置；爲了足夠的迭代次數，改變這個值將不會有太大的影響。

seed=0, 隨機數的種子設置它能夠復現隨機數據的結果，也能夠用於調整參數

數據經過樹生成預測結果

predictions = gbm.predict(test_X)
複製代碼

將tests裏面的數據經過這生成好的模型，得出預測結果。

submission = pd.DataFrame({'row_id': tests['row_id'],
                            'shop_id': predictions})
print(submission)
submission.to_csv("submission.csv", index=False)
複製代碼

將預測結果寫入到csv文件裏。咱們注意寫入文件的格式，row_id在前，shop_id在後。index=False的意思是不寫入行的名稱。改爲True就把每一行的行標也寫入了。

附錄

參考資料

機器學習系列(12)_XGBoost參數調優徹底指南（附Python代碼）http://blog.csdn.net/han_xiaoyang/article/details/52665396
Kaggle比賽：泰坦尼克之災： https://www.kaggle.com/c/titanic

完整代碼

import pandas as pd
import xgboost as xgb
from sklearn import preprocessing


train = pd.read_csv(r'D:\mall_location\train.csv')
tests = pd.read_csv(r'D:\mall_location\test.csv')

train['time_stamp'] = pd.to_datetime(pd.Series(train['time_stamp']))
tests['time_stamp'] = pd.to_datetime(pd.Series(tests['time_stamp']))

print(train.info())

train['Year'] = train['time_stamp'].apply(lambda x:x.year)
train['Month'] = train['time_stamp'].apply(lambda x: x.month)
train['weekday'] = train['time_stamp'].apply(lambda x: x.weekday())
train['time'] = train['time_stamp'].dt.time
tests['Year'] = tests['time_stamp'].apply(lambda x: x.year)
tests['Month'] = tests['time_stamp'].apply(lambda x: x.month)
tests['weekday'] = tests['time_stamp'].dt.dayofweek
tests['time'] = tests['time_stamp'].dt.time
train = train.drop('time_stamp', axis=1)
train = train.dropna(axis=0)
tests = tests.drop('time_stamp', axis=1)
tests = tests.fillna(method='pad')
for f in train.columns:
    if train[f].dtype=='object':
        if f != 'shop_id':
            print(f)
            lbl = preprocessing.LabelEncoder()
            train[f] = lbl.fit_transform(list(train[f].values))
for f in tests.columns:
    if tests[f].dtype == 'object':
        print(f)
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(tests[f].values))
        tests[f] = lbl.transform(list(tests[f].values))


feature_columns_to_use = ['Year', 'Month', 'weekday',
'time', 'longitude', 'latitude',
'wifi_id1', 'wifi_strong1', 'con_sta1',
 'wifi_id2', 'wifi_strong2', 'con_sta2',
'wifi_id3', 'wifi_strong3', 'con_sta3',
'wifi_id4', 'wifi_strong4', 'con_sta4',
'wifi_id5', 'wifi_strong5', 'con_sta5',
'wifi_id6', 'wifi_strong6', 'con_sta6',
'wifi_id7', 'wifi_strong7', 'con_sta7',
'wifi_id8', 'wifi_strong8', 'con_sta8',
'wifi_id9', 'wifi_strong9', 'con_sta9',
'wifi_id10', 'wifi_strong10', 'con_sta10',]

big_train = train[feature_columns_to_use]
big_test = tests[feature_columns_to_use]
train_X = big_train.as_matrix()
test_X = big_test.as_matrix()
train_y = train['shop_id']

gbm = xgb.XGBClassifier(silent=1, max_depth=10,
                    n_estimators=1000, learning_rate=0.05)
gbm.fit(train_X, train_y)
predictions = gbm.predict(test_X)

submission = pd.DataFrame({'row_id': tests['row_id'],
                            'shop_id': predictions})
print(submission)
submission.to_csv("submission.csv",index=False)
複製代碼

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。