數據挖掘

0 上手前準備

  首先看數據集意義,肯定以哪一個數據集爲基層數據經過添加特徵豐富數據,最後造成訓練集。python

  而後看預測結果集的格式,對於二分類問題是造成最終的預測(如0,1)仍是預測機率(如5e^-4)。app

  最重要的是要看手冊,避免自身操做帶來的失誤。dom

1 數據挖掘基礎操做

1.1 查看錶

p = pd.read_csv('../data/A/test_prof_bill.csv')
p.head()

  查看錶的前五行,方便簡單瞭解數據的大體內容和數值例子。函數

p.info()

  具體查看錶的組成,包括列名,列中元素的類型,空值狀況測試

import os
import seaborn as sns
import matplotlib.pyplot as plt
color = sns.color_palette()
group_df = train_L.標籤.value_counts().reset_index()
k = group_df['標籤'].sum()
plt.figure(figsize = (12,8))
sns.barplot(group_df['index'], (group_df.標籤/k), alpha=0.8, color=color[0])
print((group_df.標籤/k))
plt.ylabel('Frequency', fontsize = 12)
plt.xlabel('Attributed', fontsize = 12)
plt.title('Frequency of Attributed', fontsize = 16)
plt.xticks(rotation='vertical')
plt.show()

  表中某一列分佈狀況概覽,這種方法主要用於數據中正負樣本的統計,根據統計結果可一選擇採樣比例。優化

 1.2 查看各個基礎特徵對結果的影響因子

colormap = plt.cm.viridis
plt.figure(figsize=(16,16))
plt.title(' The Absolute Correlation Coefficient of Features', y=1.05, size=15)
sns.heatmap(abs(bg.astype(float).corr()),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True, )
plt.show()

  以協方差爲衡量指標,繪製可視化界面spa

 

 1.3 造成新特徵

  僅以一例說明,在本數據中有一行爲的七個種類,分別爲ABCDEFG,這七種行爲的發生次數比例會對結果有不錯的影響:orm

 

  現將這七種行爲的發生次數按照用戶統計,分別統計了用戶某一行爲的發生次數和用戶總行爲次數:blog

count = p1.groupby(['用戶標識','行爲類型']).count()
maxi = p1.groupby(['用戶標識','行爲類型']).max()

  而後將兩個臨時表合併一下:utf-8

merge = pd.merge(c,a,how='left',on='用戶標識') #選擇左鏈接方式

  計算出來的每一行爲概率做爲一個特徵:

with open('../data/A/behavier_ratio.csv','rt', encoding="utf-8") as csvfile:
    reader = csv.reader(csvfile)
    next(csvfile)
    writefile = open('../data/A/behavier_analy.csv','w+',newline='')
    writer = csv.writer(writefile)
    flag = 1
    user = '1'
    tmp = []
    l = []
    for raw in reader:
        if(raw[0] != user):
            flag = 0
            user = raw[0]
        if(flag == 0):
            if(len(l) != 0):
                writer.writerow(l)
            l = [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
            l[0] = raw[0]
            l[int(raw[1])+1] = raw[4]
            flag = 1
        else:
            l[int(raw[1])+1]=raw[4]
        tmp = l
writer.writerow(tmp)
csvfile.close()
writefile.close()

  因爲數據自己緣由,在轉變的過程當中會產生空值的現象,能夠暴力填補:

a.fillna(0,inplace=True)

  注意,若不添加inplace參數的話是不會在原有基礎上進行填補

1.4 無用特徵的刪除

a = a.drop(['Unnamed: 0'],1)

1.5 結果的存取

  正常來講,讀取這樣既可:

total.to_csv('../data/A/total_del.csv')
c = pd.read_csv('../data/A/count_del.csv')

  可是會出現,表中沒有標題行的狀況:

b = pd.read_csv('../data/A/bankStatement_analy.csv',header=None, names = ['用戶標識','type0_ratio','type1_ratio','type0_money','type1_money'])
a.to_csv('../data/A/test_3.csv',index=False)

2 模型的選取與優化

  本次選用xgboost模型,使用貝葉斯優化器進行最優參數的選取:

  對於評價函數,本次比賽使用KS值,由於一直用的是auc評價函數,所以前期吃了很多虧。

 

   對於貝葉斯優化器來講,默認的評價函數並無ks,所以須要本身實現:

def eval_ks(estimator,x,y):
    preds = estimator.predict_proba(x) #獲取預測的機率
    preds = preds[:,1]
    fpr, tpr, thresholds = metrics.roc_curve(y, preds) #傳入真實值,預測值,獲取正樣本、負樣本以及判別正負樣本的閾值
    ks = 0
    for i in range(len(thresholds)):
        if abs(tpr[i] - fpr[i]) > ks:
            ks = abs(tpr[i] - fpr[i])
    print('KS score = ',ks)
    return ks

  經過查看官方手冊可知,自定義評價函數須要傳入三個參數

import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold

# SETTINGS - CHANGE THESE TO GET SOMETHING MEANINGFUL
ITERATIONS = 100 # 1000
TRAINING_SIZE = 100000 # 20000000
TEST_SIZE = 40000

# Load data
train = pd.read_csv(
    '../data/step2/train2_1.csv'
)

X = train.drop(['label'],1)
Y = train['label']
bayes_cv_tuner = BayesSearchCV(
    estimator = xgb.XGBClassifier(
        n_jobs = 1,
        objective = 'binary:logistic',
        eval_metric = 'auc',
        silent=1,
        tree_method='approx'
    ),
search_spaces = {
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (0, 50),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'reg_alpha': (1e-9, 1.0, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 5),
        'n_estimators': (50, 100),
        'scale_pos_weight': (1e-6, 500, 'log-uniform')
    },    
    scoring = eval_ks,
    cv = StratifiedKFold(
        n_splits=3,
        shuffle=True,
        random_state=42
    ),
    n_jobs = 3,
    n_iter = ITERATIONS,
    verbose = 0,
    refit = True,
    random_state = 42
)
result = bayes_cv_tuner.fit(X.values, Y.values)
all_models = pd.DataFrame(bayes_cv_tuner.cv_results_)
best_params = pd.Series(bayes_cv_tuner.best_params_)
print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(
    len(all_models),
    np.round(bayes_cv_tuner.best_score_, 4),
    bayes_cv_tuner.best_params_
    ))
    
    # Save all model results
clf_name = bayes_cv_tuner.estimator.__class__.__name__
all_models.to_csv('../data/_cv_results.csv') 

 

   訓練結果是定義的迭代次數中,分數最好的參數配置,使用該參數配置應用於測試集的預測:

import csv
test = pd.read_csv('../data/B/test2_1.csv')
clf = xgb.XGBClassifier(colsample_bylevel= 0.782142304086966, colsample_bytree=  0.9019863190224396, gamma= 0.0001491431487281734, learning_rate= 0.1675067687563292, max_delta_step= 3,max_depth= 10, min_child_weight= 4, n_estimators= 76, reg_alpha= 0.0026534914283041435, reg_lambda= 211.46421106591836, scale_pos_weight= 0.5414848749017023, subsample= 0.8406121867576984)
clf.fit(X,Y)
preds = clf.predict_proba(test) #該函數產生的是機率,predict函數產生的是0,1結果
upload = pd.DataFrame()
upload['客戶號'] = test['用戶標識']
upload['違約機率'] = preds[:,1] #注意,模型訓練出來的結果是兩列,第一列是預測爲0的機率,第二列是預測爲1的機率,根據題意,取爲1的機率做爲最終機率
upload.to_csv('../data/A/upload.csv',index=False)
with open('../data/A/upload.csv','rt', encoding="utf-8") as csvfile:
    reader = csv.reader(csvfile)
    next(csvfile)
    writefile = open('../data/A/up.csv','w+',newline='')
    writer = csv.writer(writefile)
    for raw in reader:
        writer.writerow(raw)
csvfile.close()
writefile.close()
相關文章
相關標籤/搜索