<!-- TOC -->python
<!-- /TOC -->算法
- 該項目是針對kaggle中的homesite進行的算法預測,使用xgboost的sklearn接口,進行數據建模,購買預測。
import pandas as pd import numpy as np import xgboost as xgb from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import GridSearchCV
train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') train.head()
<div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>QuoteNumber</th> <th>Original_Quote_Date</th> <th>QuoteConversion_Flag</th> <th>Field6</th> <th>Field7</th> <th>Field8</th> <th>Field9</th> <th>Field10</th> <th>Field11</th> <th>Field12</th> <th>...</th> <th>GeographicField59A</th> <th>GeographicField59B</th> <th>GeographicField60A</th> <th>GeographicField60B</th> <th>GeographicField61A</th> <th>GeographicField61B</th> <th>GeographicField62A</th> <th>GeographicField62B</th> <th>GeographicField63</th> <th>GeographicField64</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>2013-08-16</td> <td>0</td> <td>B</td> <td>23</td> <td>0.9403</td> <td>0.0006</td> <td>965</td> <td>1.0200</td> <td>N</td> <td>...</td> <td>9</td> <td>9</td> <td>-1</td> <td>8</td> <td>-1</td> <td>18</td> <td>-1</td> <td>10</td> <td>N</td> <td>CA</td> </tr> <tr> <th>1</th> <td>2</td> <td>2014-04-22</td> <td>0</td> <td>F</td> <td>7</td> <td>1.0006</td> <td>0.0040</td> <td>548</td> <td>1.2433</td> <td>N</td> <td>...</td> <td>10</td> <td>10</td> <td>-1</td> <td>11</td> <td>-1</td> <td>17</td> <td>-1</td> <td>20</td> <td>N</td> <td>NJ</td> </tr> <tr> <th>2</th> <td>4</td> <td>2014-08-25</td> <td>0</td> <td>F</td> <td>7</td> <td>1.0006</td> <td>0.0040</td> <td>548</td> <td>1.2433</td> <td>N</td> <td>...</td> <td>15</td> <td>18</td> <td>-1</td> <td>21</td> <td>-1</td> <td>11</td> <td>-1</td> <td>8</td> <td>N</td> <td>NJ</td> </tr> <tr> <th>3</th> <td>6</td> <td>2013-04-15</td> <td>0</td> <td>J</td> <td>10</td> <td>0.9769</td> <td>0.0004</td> <td>1,165</td> <td>1.2665</td> <td>N</td> <td>...</td> <td>6</td> <td>5</td> <td>-1</td> <td>10</td> <td>-1</td> <td>9</td> <td>-1</td> <td>21</td> <td>N</td> <td>TX</td> </tr> <tr> <th>4</th> <td>8</td> <td>2014-01-25</td> <td>0</td> <td>E</td> <td>23</td> <td>0.9472</td> <td>0.0006</td> <td>1,487</td> <td>1.3045</td> <td>N</td> <td>...</td> <td>18</td> <td>22</td> <td>-1</td> <td>10</td> <td>-1</td> <td>11</td> <td>-1</td> <td>12</td> <td>N</td> <td>IL</td> </tr> </tbody> </table> <p>5 rows × 299 columns</p> </div>dom
train=train.drop('QuoteNumber',axis=1)
test = test.drop('QuoteNumber', axis=1)
時間格式的轉化
train['Date']=pd.to_datetime(train['Original_Quote_Date']) train= train.drop('Original_Quote_Date',axis=1)
test['Date']=pd.to_datetime(test['Original_Quote_Date']) test= test.drop('Original_Quote_Date',axis=1)
train['year']=train['Date'].dt.year
train['month']=train['Date'].dt.month train['weekday']=train['Date'].dt.weekday
train.head()
<div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>QuoteConversion_Flag</th> <th>Field6</th> <th>Field7</th> <th>Field8</th> <th>Field9</th> <th>Field10</th> <th>Field11</th> <th>Field12</th> <th>CoverageField1A</th> <th>CoverageField1B</th> <th>...</th> <th>GeographicField61A</th> <th>GeographicField61B</th> <th>GeographicField62A</th> <th>GeographicField62B</th> <th>GeographicField63</th> <th>GeographicField64</th> <th>Date</th> <th>year</th> <th>month</th> <th>weekday</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0</td> <td>B</td> <td>23</td> <td>0.9403</td> <td>0.0006</td> <td>965</td> <td>1.0200</td> <td>N</td> <td>17</td> <td>23</td> <td>...</td> <td>-1</td> <td>18</td> <td>-1</td> <td>10</td> <td>N</td> <td>CA</td> <td>2013-08-16</td> <td>2013</td> <td>8</td> <td>4</td> </tr> <tr> <th>1</th> <td>0</td> <td>F</td> <td>7</td> <td>1.0006</td> <td>0.0040</td> <td>548</td> <td>1.2433</td> <td>N</td> <td>6</td> <td>8</td> <td>...</td> <td>-1</td> <td>17</td> <td>-1</td> <td>20</td> <td>N</td> <td>NJ</td> <td>2014-04-22</td> <td>2014</td> <td>4</td> <td>1</td> </tr> <tr> <th>2</th> <td>0</td> <td>F</td> <td>7</td> <td>1.0006</td> <td>0.0040</td> <td>548</td> <td>1.2433</td> <td>N</td> <td>7</td> <td>12</td> <td>...</td> <td>-1</td> <td>11</td> <td>-1</td> <td>8</td> <td>N</td> <td>NJ</td> <td>2014-08-25</td> <td>2014</td> <td>8</td> <td>0</td> </tr> <tr> <th>3</th> <td>0</td> <td>J</td> <td>10</td> <td>0.9769</td> <td>0.0004</td> <td>1,165</td> <td>1.2665</td> <td>N</td> <td>3</td> <td>2</td> <td>...</td> <td>-1</td> <td>9</td> <td>-1</td> <td>21</td> <td>N</td> <td>TX</td> <td>2013-04-15</td> <td>2013</td> <td>4</td> <td>0</td> </tr> <tr> <th>4</th> <td>0</td> <td>E</td> <td>23</td> <td>0.9472</td> <td>0.0006</td> <td>1,487</td> <td>1.3045</td> <td>N</td> <td>8</td> <td>13</td> <td>...</td> <td>-1</td> <td>11</td> <td>-1</td> <td>12</td> <td>N</td> <td>IL</td> <td>2014-01-25</td> <td>2014</td> <td>1</td> <td>5</td> </tr> </tbody> </table> <p>5 rows × 301 columns</p> </div>測試
test['year']=test['Date'].dt.year test['month']=test['Date'].dt.month test['weekday']=test['Date'].dt.weekday
train = train.drop('Date', axis=1) test = test.drop('Date', axis=1)
查看數據類型
train.dtypes
QuoteConversion_Flag int64 Field6 object Field7 int64 Field8 float64 Field9 float64 Field10 object Field11 float64 Field12 object CoverageField1A int64 CoverageField1B int64 CoverageField2A int64 CoverageField2B int64 CoverageField3A int64 CoverageField3B int64 CoverageField4A int64 CoverageField4B int64 CoverageField5A int64 CoverageField5B int64 CoverageField6A int64 CoverageField6B int64 CoverageField8 object CoverageField9 object CoverageField11A int64 CoverageField11B int64 SalesField1A int64 SalesField1B int64 SalesField2A int64 SalesField2B int64 SalesField3 int64 SalesField4 int64 ... GeographicField50B int64 GeographicField51A int64 GeographicField51B int64 GeographicField52A int64 GeographicField52B int64 GeographicField53A int64 GeographicField53B int64 GeographicField54A int64 GeographicField54B int64 GeographicField55A int64 GeographicField55B int64 GeographicField56A int64 GeographicField56B int64 GeographicField57A int64 GeographicField57B int64 GeographicField58A int64 GeographicField58B int64 GeographicField59A int64 GeographicField59B int64 GeographicField60A int64 GeographicField60B int64 GeographicField61A int64 GeographicField61B int64 GeographicField62A int64 GeographicField62B int64 GeographicField63 object GeographicField64 object year int64 month int64 weekday int64 Length: 300, dtype: object
查看DataFrame的詳細信息
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 260753 entries, 0 to 260752 Columns: 300 entries, QuoteConversion_Flag to weekday dtypes: float64(6), int64(267), object(27) memory usage: 596.8+ MB
填充缺失值
train = train.fillna(-999) test = test.fillna(-999)
category 數據類型轉化
from sklearn import preprocessing features = list(train.columns[1:]) for i in features: if train[i].dtype=='object': le=preprocessing.LabelEncoder() le.fit(list(train[i].values)+list(test[i].values)) train[i] = le.transform(list(train[i].values)) test[i] = le.transform(list(test[i].values))
模型參數設定
#brute force scan for all parameters, here are the tricks #usually max_depth is 6,7,8 #learning rate is around 0.05, but small changes may make big diff #tuning min_child_weight subsample colsample_bytree can have #much fun of fighting against overfit #n_estimators is how many round of boosting #finally, ensemble xgboost with multiple seeds may reduce variance xgb_model = xgb.XGBClassifier() parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower 'objective':['binary:logistic'], 'learning_rate': [0.05,0.1], #so called `eta` value 'max_depth': [6], 'min_child_weight': [11], 'silent': [1], 'subsample': [0.8], 'colsample_bytree': [0.7], 'n_estimators': [5], #number of trees, change it to 1000 for better results 'missing':[-999], 'seed': [1337]}
sfolder = StratifiedKFold(n_splits=5,random_state=42,shuffle=True) clf= GridSearchCV(xgb_model,parameters,n_jobs=4,cv=sfolder.split(train[features], train["QuoteConversion_Flag"]),scoring='roc_auc', verbose=2, refit=True,return_train_score=True) clf.fit(train[features], train["QuoteConversion_Flag"])
Fitting 5 folds for each of 2 candidates, totalling 10 fits [Parallel(n_jobs=4)]: Done 10 out of 10 | elapsed: 2.4min finished GridSearchCV(cv=<generator object _BaseKFold.split at 0x0000000018459888>, error_score='raise', estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1), fit_params=None, iid=True, n_jobs=4, param_grid={'nthread': [4], 'objective': ['binary:logistic'], 'learning_rate': [0.05, 0.1], 'max_depth': [6], 'min_child_weight': [11], 'silent': [1], 'subsample': [0.8], 'colsample_bytree': [0.7], 'n_estimators': [5], 'missing': [-999], 'seed': [1337]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring='roc_auc', verbose=2)
clf.grid_scores_
c:\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20 DeprecationWarning) [mean: 0.94416, std: 0.00118, params: {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 6, 'min_child_weight': 11, 'missing': -999, 'n_estimators': 5, 'nthread': 4, 'objective': 'binary:logistic', 'seed': 1337, 'silent': 1, 'subsample': 0.8}, mean: 0.94589, std: 0.00120, params: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 6, 'min_child_weight': 11, 'missing': -999, 'n_estimators': 5, 'nthread': 4, 'objective': 'binary:logistic', 'seed': 1337, 'silent': 1, 'subsample': 0.8}]
pd.DataFrame(clf.cv_results_['params'])
<div> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>colsample_bytree</th> <th>learning_rate</th> <th>max_depth</th> <th>min_child_weight</th> <th>missing</th> <th>n_estimators</th> <th>nthread</th> <th>objective</th> <th>seed</th> <th>silent</th> <th>subsample</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.7</td> <td>0.05</td> <td>6</td> <td>11</td> <td>-999</td> <td>5</td> <td>4</td> <td>binary:logistic</td> <td>1337</td> <td>1</td> <td>0.8</td> </tr> <tr> <th>1</th> <td>0.7</td> <td>0.10</td> <td>6</td> <td>11</td> <td>-999</td> <td>5</td> <td>4</td> <td>binary:logistic</td> <td>1337</td> <td>1</td> <td>0.8</td> </tr> </tbody> </table> </div>編碼
best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1]) print('Raw AUC score:', score) for param_name in sorted(best_parameters.keys()): print("%s: %r" % (param_name, best_parameters[param_name]))
Raw AUC score: 0.9458947562485674 colsample_bytree: 0.7 learning_rate: 0.1 max_depth: 6 min_child_weight: 11 missing: -999 n_estimators: 5 nthread: 4 objective: 'binary:logistic' seed: 1337 silent: 1 subsample: 0.8 c:\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20 DeprecationWarning)
test_probs = clf.predict_proba(test[features])[:,1] sample = pd.read_csv('sample_submission.csv') sample.QuoteConversion_Flag = test_probs sample.to_csv("xgboost_best_parameter_submission.csv", index=False)
clf.best_estimator_.predict_proba(test[features])
array([[0.6988076 , 0.3011924 ], [0.6787684 , 0.3212316 ], [0.6797658 , 0.32023418], ..., [0.5018287 , 0.4981713 ], [0.6988076 , 0.3011924 ], [0.62464744, 0.37535256]], dtype=float32)
下面的截斷值0.5能夠本身根據實際的項目設定截斷值
kears_result=pd.read_csv('keras_nn_test.csv') result1=[1 if i>0.5 else 0 for i in kears_result['QuoteConversion_Flag']] xgb_result=pd.read_csv('xgboost_best_parameter_submission.csv') result2=[1 if i>0.5 else 0 for i in xgb_result['QuoteConversion_Flag']] from sklearn import metrics metrics.accuracy_score(result1,result2)
0.8566004740099864
metrics.confusion_matrix(result1,result2)
array([[148836, 24862], [ 66, 72]], dtype=int64)
結論
- 對數據的時間進行了預處理
- 對數據中的category類型進行了label化,我以爲有必要對這個進行從新考慮,我的以爲應該使用one-hot進行category的處理,而不是LabelEncoder處理(疑慮)
- Label encoding在某些狀況下頗有用,可是場景限制不少。再舉一例:好比有[dog,cat,dog,mouse,cat],咱們把其轉換爲[1,2,1,3,2]。這裏就產生了一個奇怪的現象:dog和mouse的平均值是cat。因此目前尚未發現標籤編碼的普遍使用。
- 獲得的模型對測試集進行處理,Raw AUC 0.94,而對應的準確率只有85%,實際上並無實際的分類效果,對於其實是0的,預測成1的太多了,也就是假陽性過高了,實際中的轉換率也不會很高。
- 其實模型還有不少能夠調整的參數都沒有調整,若是對調參有興趣的能夠查看美團的文本分類項目中的例子。