首先看數據集意義,肯定以哪一個數據集爲基層數據經過添加特徵豐富數據,最後造成訓練集。python
而後看預測結果集的格式,對於二分類問題是造成最終的預測(如0,1)仍是預測機率(如5e^-4)。app
最重要的是要看手冊,避免自身操做帶來的失誤。dom
p = pd.read_csv('../data/A/test_prof_bill.csv') p.head()
查看錶的前五行,方便簡單瞭解數據的大體內容和數值例子。函數
p.info()
具體查看錶的組成,包括列名,列中元素的類型,空值狀況測試
import os import seaborn as sns import matplotlib.pyplot as plt color = sns.color_palette() group_df = train_L.標籤.value_counts().reset_index() k = group_df['標籤'].sum() plt.figure(figsize = (12,8)) sns.barplot(group_df['index'], (group_df.標籤/k), alpha=0.8, color=color[0]) print((group_df.標籤/k)) plt.ylabel('Frequency', fontsize = 12) plt.xlabel('Attributed', fontsize = 12) plt.title('Frequency of Attributed', fontsize = 16) plt.xticks(rotation='vertical') plt.show()
表中某一列分佈狀況概覽,這種方法主要用於數據中正負樣本的統計,根據統計結果可一選擇採樣比例。優化
colormap = plt.cm.viridis plt.figure(figsize=(16,16)) plt.title(' The Absolute Correlation Coefficient of Features', y=1.05, size=15) sns.heatmap(abs(bg.astype(float).corr()),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True, ) plt.show()
以協方差爲衡量指標,繪製可視化界面spa
僅以一例說明,在本數據中有一行爲的七個種類,分別爲ABCDEFG,這七種行爲的發生次數比例會對結果有不錯的影響:orm
現將這七種行爲的發生次數按照用戶統計,分別統計了用戶某一行爲的發生次數和用戶總行爲次數:blog
count = p1.groupby(['用戶標識','行爲類型']).count()
maxi = p1.groupby(['用戶標識','行爲類型']).max()
而後將兩個臨時表合併一下:utf-8
merge = pd.merge(c,a,how='left',on='用戶標識') #選擇左鏈接方式
計算出來的每一行爲概率做爲一個特徵:
with open('../data/A/behavier_ratio.csv','rt', encoding="utf-8") as csvfile: reader = csv.reader(csvfile) next(csvfile) writefile = open('../data/A/behavier_analy.csv','w+',newline='') writer = csv.writer(writefile) flag = 1 user = '1' tmp = [] l = [] for raw in reader: if(raw[0] != user): flag = 0 user = raw[0] if(flag == 0): if(len(l) != 0): writer.writerow(l) l = [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0] l[0] = raw[0] l[int(raw[1])+1] = raw[4] flag = 1 else: l[int(raw[1])+1]=raw[4] tmp = l writer.writerow(tmp) csvfile.close() writefile.close()
因爲數據自己緣由,在轉變的過程當中會產生空值的現象,能夠暴力填補:
a.fillna(0,inplace=True)
注意,若不添加inplace參數的話是不會在原有基礎上進行填補
a = a.drop(['Unnamed: 0'],1)
正常來講,讀取這樣既可:
total.to_csv('../data/A/total_del.csv') c = pd.read_csv('../data/A/count_del.csv')
可是會出現,表中沒有標題行的狀況:
b = pd.read_csv('../data/A/bankStatement_analy.csv',header=None, names = ['用戶標識','type0_ratio','type1_ratio','type0_money','type1_money']) a.to_csv('../data/A/test_3.csv',index=False)
本次選用xgboost模型,使用貝葉斯優化器進行最優參數的選取:
對於評價函數,本次比賽使用KS值,由於一直用的是auc評價函數,所以前期吃了很多虧。
對於貝葉斯優化器來講,默認的評價函數並無ks,所以須要本身實現:
def eval_ks(estimator,x,y): preds = estimator.predict_proba(x) #獲取預測的機率 preds = preds[:,1] fpr, tpr, thresholds = metrics.roc_curve(y, preds) #傳入真實值,預測值,獲取正樣本、負樣本以及判別正負樣本的閾值 ks = 0 for i in range(len(thresholds)): if abs(tpr[i] - fpr[i]) > ks: ks = abs(tpr[i] - fpr[i]) print('KS score = ',ks) return ks
經過查看官方手冊可知,自定義評價函數須要傳入三個參數
import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 100 # 1000
TRAINING_SIZE = 100000 # 20000000
TEST_SIZE = 40000
# Load data
train = pd.read_csv(
'../data/step2/train2_1.csv'
)
X = train.drop(['label'],1)
Y = train['label']
bayes_cv_tuner = BayesSearchCV(
estimator = xgb.XGBClassifier(
n_jobs = 1,
objective = 'binary:logistic',
eval_metric = 'auc',
silent=1,
tree_method='approx'
),
search_spaces = {
'learning_rate': (0.01, 1.0, 'log-uniform'),
'min_child_weight': (0, 10),
'max_depth': (0, 50),
'max_delta_step': (0, 20),
'subsample': (0.01, 1.0, 'uniform'),
'colsample_bytree': (0.01, 1.0, 'uniform'),
'colsample_bylevel': (0.01, 1.0, 'uniform'),
'reg_lambda': (1e-9, 1000, 'log-uniform'),
'reg_alpha': (1e-9, 1.0, 'log-uniform'),
'gamma': (1e-9, 0.5, 'log-uniform'),
'min_child_weight': (0, 5),
'n_estimators': (50, 100),
'scale_pos_weight': (1e-6, 500, 'log-uniform')
},
scoring = eval_ks,
cv = StratifiedKFold(
n_splits=3,
shuffle=True,
random_state=42
),
n_jobs = 3,
n_iter = ITERATIONS,
verbose = 0,
refit = True,
random_state = 42
)
result = bayes_cv_tuner.fit(X.values, Y.values)
all_models = pd.DataFrame(bayes_cv_tuner.cv_results_)
best_params = pd.Series(bayes_cv_tuner.best_params_)
print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(
len(all_models),
np.round(bayes_cv_tuner.best_score_, 4),
bayes_cv_tuner.best_params_
))
# Save all model results
clf_name = bayes_cv_tuner.estimator.__class__.__name__
all_models.to_csv('../data/_cv_results.csv')
訓練結果是定義的迭代次數中,分數最好的參數配置,使用該參數配置應用於測試集的預測:
import csv test = pd.read_csv('../data/B/test2_1.csv') clf = xgb.XGBClassifier(colsample_bylevel= 0.782142304086966, colsample_bytree= 0.9019863190224396, gamma= 0.0001491431487281734, learning_rate= 0.1675067687563292, max_delta_step= 3,max_depth= 10, min_child_weight= 4, n_estimators= 76, reg_alpha= 0.0026534914283041435, reg_lambda= 211.46421106591836, scale_pos_weight= 0.5414848749017023, subsample= 0.8406121867576984) clf.fit(X,Y) preds = clf.predict_proba(test) #該函數產生的是機率,predict函數產生的是0,1結果 upload = pd.DataFrame() upload['客戶號'] = test['用戶標識'] upload['違約機率'] = preds[:,1] #注意,模型訓練出來的結果是兩列,第一列是預測爲0的機率,第二列是預測爲1的機率,根據題意,取爲1的機率做爲最終機率 upload.to_csv('../data/A/upload.csv',index=False) with open('../data/A/upload.csv','rt', encoding="utf-8") as csvfile: reader = csv.reader(csvfile) next(csvfile) writefile = open('../data/A/up.csv','w+',newline='') writer = csv.writer(writefile) for raw in reader: writer.writerow(raw) csvfile.close() writefile.close()