請點擊上方「AI公園」,關注公衆號python
本文選自Kaggle
微信
做者:Will Koehrsen
app
編譯:ronghuaiyangdom
Kaggle的信用卡違約風險預測競賽,很是有參考價值,作風控和大數據挖掘的同窗能夠參考一下,很是詳細,很是適合入門,從數據處理到模型的構建,很是全面,文章比較長,分幾回發出來,這是第四部分,主要是總結和LightGBM的介紹。機器學習
本文中,咱們展現瞭如何開始一個Kaggle的機器學習競賽。咱們須要首先理解數據,咱們的任務,還有評估的方法。而後咱們會進行EDA來觀察數據相互之間的關係,趨勢或者異常值等來幫助咱們建模。在這個過程當中,咱們會使用類別特徵編碼,插補缺失值,縮放數據到一個範圍。咱們還會從已有的數據找那個建立新的特徵,看看是否是對模型有幫助。ide
一旦數據都準備好了,特徵工程也作好了。咱們實現了一個基線模型,而後咱們再建立更加複雜的模型來戰勝咱們的基線模型。咱們還作了實驗,看看咱們加入的那些新的變量的效果。學習
咱們按照通用的機器學習的大綱:大數據
理解問題和數據優化
數據清洗和格式化ui
探索性數據分析
基線模型
提高模型
模型解釋
機器學習競賽對於不一樣的問題差異經常不大,咱們通常只關注如何取得最佳的結果,不太關注解釋性。可是,經過理解咱們的模型作的預測的方式,咱們能夠經過糾正犯錯的樣原本提高咱們的模型。將來咱們會創建更復雜的模型,觀察更多的數據,提升咱們的分數。
接下來的內容
接下來我會持續的優化這個項目,下面是後續的一些內容
手工特徵第一部分
手工特徵第二部分
自動特徵工程介紹
高級自動特徵工程
特徵選擇
模型超參數調試介紹: 網格和隨機搜索
歡迎你們反饋!
玩一玩: Light Gradient Boosting Machine
如今,咱們能夠試試真正的機器學習模型:LightGBM中的gradient boost machine。這個方法是目前在結構化數據上使用最早進的模型,特別是Kaggle中。儘管代碼看起來比較嚇人,其實就是建立模型的一些步驟。我加了這段代碼是想展現一下這個項目還有多少潛力可挖,使用這個方法能夠拿到好一點的分數。將來咱們會看到使用更多的高級模型,使用更多的特徵工程,特徵選擇。
In [55]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc
def model(features, test_features, encoding = 'ohe', n_folds = 5):
"""Train and test a light gradient boosting model using
cross validation.
Parameters
--------
features (pd.DataFrame):
dataframe of training features to use
for training a model. Must include the TARGET column.
test_features (pd.DataFrame):
dataframe of testing features to use
for making predictions with the model.
encoding (str, default = 'ohe'):
method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
n_folds (int, default = 5): number of folds to use for cross validation
Return
--------
submission (pd.DataFrame):
dataframe with `SK_ID_CURR` and `TARGET` probabilities
predicted by the model.
feature_importances (pd.DataFrame):
dataframe with the feature importances from the model.
valid_metrics (pd.DataFrame):
dataframe with training and validation metrics (ROC AUC) for each fold and overall.
"""
# Extract the ids
train_ids = features['SK_ID_CURR']
test_ids = test_features['SK_ID_CURR']
# Extract the labels for training
labels = features['TARGET']
# Remove the ids and target
features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
test_features = test_features.drop(columns = ['SK_ID_CURR'])
# One Hot Encoding
if encoding == 'ohe':
features = pd.get_dummies(features)
test_features = pd.get_dummies(test_features)
# Align the dataframes by the columns
features, test_features = features.align(test_features, join = 'inner', axis = 1)
# No categorical indices to record
cat_indices = 'auto'
# Integer label encoding
elif encoding == 'le':
# Create a label encoder
label_encoder = LabelEncoder()
# List for storing categorical indices
cat_indices = []
# Iterate through each column
for i, col in enumerate(features):
if features[col].dtype == 'object':
# Map the categorical features to integers
features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))
# Record the categorical indices
cat_indices.append(i)
# Catch error if label encoding scheme is not valid
else:
raise ValueError("Encoding must be either 'ohe' or 'le'")
print('Training Data Shape: ', features.shape)
print('Testing Data Shape: ', test_features.shape)
# Extract feature names
feature_names = list(features.columns)
# Convert to np arrays
features = np.array(features)
test_features = np.array(test_features)
# Create the kfold object
k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
# Empty array for feature importances
feature_importance_values = np.zeros(len(feature_names))
# Empty array for test predictions
test_predictions = np.zeros(test_features.shape[0])
# Empty array for out of fold validation predictions
out_of_fold = np.zeros(features.shape[0])
# Lists for recording validation and training scores
valid_scores = []
train_scores = []
# Iterate through each fold
for train_indices, valid_indices in k_fold.split(features):
# Training data for the fold
train_features, train_labels = features[train_indices], labels[train_indices]
# Validation data for the fold
valid_features, valid_labels = features[valid_indices], labels[valid_indices]
# Create the model
model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary',
class_weight = 'balanced', learning_rate = 0.05,
reg_alpha = 0.1, reg_lambda = 0.1,
subsample = 0.8, n_jobs = -1, random_state = 50)
# Train the model
model.fit(train_features, train_labels, eval_metric = 'auc',
eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
eval_names = ['valid', 'train'], categorical_feature = cat_indices,
early_stopping_rounds = 100, verbose = 200)
# Record the best iteration
best_iteration = model.best_iteration_
# Record the feature importances
feature_importance_values += model.feature_importances_ / k_fold.n_splits
# Make predictions
test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
# Record the out of fold predictions
out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
# Record the best score
valid_score = model.best_score_['valid']['auc']
train_score = model.best_score_['train']['auc']
valid_scores.append(valid_score)
train_scores.append(train_score)
# Clean up memory
gc.enable()
del model, train_features, valid_features
gc.collect()
# Make the submission dataframe
submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
# Make the feature importance dataframe
feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
# Overall validation score
valid_auc = roc_auc_score(labels, out_of_fold)
# Add the overall scores to the metrics
valid_scores.append(valid_auc)
train_scores.append(np.mean(train_scores))
# Needed for creating dataframe of validation scores
fold_names = list(range(n_folds))
fold_names.append('overall')
# Dataframe of validation scores
metrics = pd.DataFrame({'fold': fold_names,
'train': train_scores,
'valid': valid_scores})
return submission, feature_importances, metrics
In [56]:
submission, fi, metrics = model(app_train, app_test)
print('Baseline metrics')
print(metrics)
Training Data Shape: (307511, 239)
Testing Data Shape: (48744, 239)
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.754949 train's auc: 0.79887
Early stopping, best iteration is:
[208] valid's auc: 0.755109 train's auc: 0.80025
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.758539 train's auc: 0.798518
Early stopping, best iteration is:
[217] valid's auc: 0.758619 train's auc: 0.801374
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.762652 train's auc: 0.79774
[400] valid's auc: 0.762202 train's auc: 0.827288
Early stopping, best iteration is:
[320] valid's auc: 0.763103 train's auc: 0.81638
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.757496 train's auc: 0.799107
Early stopping, best iteration is:
[183] valid's auc: 0.75759 train's auc: 0.796125
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.758099 train's auc: 0.798268
Early stopping, best iteration is:
[227] valid's auc: 0.758251 train's auc: 0.802746
Baseline metrics
fold train valid
0 0 0.800250 0.755109
1 1 0.801374 0.758619
2 2 0.816380 0.763103
3 3 0.796125 0.757590
4 4 0.802746 0.758251
5 overall 0.803375 0.758537
In [57]:
fi_sorted = plot_feature_importances(fi)
In [58]:
submission.to_csv('baseline_lgb.csv', index = False)
這個提交應該獲得0.735的分數,將來咱們會獲得更高的分數。
In [59]:
app_train_domain['TARGET'] = train_labels
# Test the domain knolwedge features
submission_domain, fi_domain, metrics_domain = model(app_train_domain, app_test_domain)
print('Baseline with domain knowledge features metrics')
print(metrics_domain)
Training Data Shape: (307511, 243)
Testing Data Shape: (48744, 243)
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.762577 train's auc: 0.804531
Early stopping, best iteration is:
[237] valid's auc: 0.762858 train's auc: 0.810671
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.765594 train's auc: 0.804304
Early stopping, best iteration is:
[227] valid's auc: 0.765861 train's auc: 0.808665
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.770139 train's auc: 0.803753
[400] valid's auc: 0.770328 train's auc: 0.834338
Early stopping, best iteration is:
[302] valid's auc: 0.770629 train's auc: 0.820401
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.765653 train's auc: 0.804487
Early stopping, best iteration is:
[262] valid's auc: 0.766318 train's auc: 0.815066
Training until validation scores don't improve for 100 rounds.
[200] valid's auc: 0.764456 train's auc: 0.804527
Early stopping, best iteration is:
[235] valid's auc: 0.764517 train's auc: 0.810422
Baseline with domain knowledge features metrics
fold train valid
0 0 0.810671 0.762858
1 1 0.808665 0.765861
2 2 0.820401 0.770629
3 3 0.815066 0.766318
4 4 0.810422 0.764517
5 overall 0.813045 0.766050
In [60]:
fi_sorted = plot_feature_importances(fi_domain)
咱們再一次看到了咱們以前選出來的特徵的重要性。看到這個,咱們也許會想,領域特徵是否是用這個方法也能起到做用。
In [61]:
submission_domain.to_csv('baseline_lgb_domain_features.csv', index = False)
此次,咱們的模型的分數是0.754,這說明領域特徵對模型的提高仍是有效果的,特徵工程確實是很是重要的部分。(全部的機器學習問題都是如此!)
本文能夠任意轉載,轉載時請註明做者及原文地址。
請長按或掃描二維碼關注咱們
本文分享自微信公衆號 - AI公園(AI_Paradise)。
若有侵權,請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」,歡迎正在閱讀的你也加入,一塊兒分享。