請點擊上方「AI公園」，關注公衆號python

本文選自Kaggle
微信

做者：Will Koehrsen
app

編譯：ronghuaiyangdom

Kaggle的信用卡違約風險預測競賽，很是有參考價值，作風控和大數據挖掘的同窗能夠參考一下，很是詳細，很是適合入門，從數據處理到模型的構建，很是全面，文章比較長，分幾回發出來，這是第四部分，主要是總結和LightGBM的介紹。機器學習

結論

本文中，咱們展現瞭如何開始一個Kaggle的機器學習競賽。咱們須要首先理解數據，咱們的任務，還有評估的方法。而後咱們會進行EDA來觀察數據相互之間的關係，趨勢或者異常值等來幫助咱們建模。在這個過程當中，咱們會使用類別特徵編碼，插補缺失值，縮放數據到一個範圍。咱們還會從已有的數據找那個建立新的特徵，看看是否是對模型有幫助。ide

一旦數據都準備好了，特徵工程也作好了。咱們實現了一個基線模型，而後咱們再建立更加複雜的模型來戰勝咱們的基線模型。咱們還作了實驗，看看咱們加入的那些新的變量的效果。學習

咱們按照通用的機器學習的大綱：大數據

理解問題和數據優化
數據清洗和格式化ui
探索性數據分析
基線模型
提高模型
模型解釋

機器學習競賽對於不一樣的問題差異經常不大，咱們通常只關注如何取得最佳的結果，不太關注解釋性。可是，經過理解咱們的模型作的預測的方式，咱們能夠經過糾正犯錯的樣原本提高咱們的模型。將來咱們會創建更復雜的模型，觀察更多的數據，提升咱們的分數。

接下來的內容

接下來我會持續的優化這個項目，下面是後續的一些內容

手工特徵第一部分
手工特徵第二部分
自動特徵工程介紹
高級自動特徵工程
特徵選擇
模型超參數調試介紹: 網格和隨機搜索

歡迎你們反饋！

玩一玩: Light Gradient Boosting Machine

如今，咱們能夠試試真正的機器學習模型：LightGBM中的gradient boost machine。這個方法是目前在結構化數據上使用最早進的模型，特別是Kaggle中。儘管代碼看起來比較嚇人，其實就是建立模型的一些步驟。我加了這段代碼是想展現一下這個項目還有多少潛力可挖，使用這個方法能夠拿到好一點的分數。將來咱們會看到使用更多的高級模型，使用更多的特徵工程，特徵選擇。

In [55]:

from sklearn.model_selection import KFold
  from sklearn.metrics import roc_auc_score
  import lightgbm as lgb
  import gc
  
  def model(features, test_features, encoding = 'ohe', n_folds = 5):
      
      """Train and test a light gradient boosting model using
      cross validation. 
      
      Parameters
      --------
          features (pd.DataFrame): 
              dataframe of training features to use 
              for training a model. Must include the TARGET column.
          test_features (pd.DataFrame): 
              dataframe of testing features to use
              for making predictions with the model. 
          encoding (str, default = 'ohe'): 
              method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
              n_folds (int, default = 5): number of folds to use for cross validation
          
      Return
      --------
          submission (pd.DataFrame): 
              dataframe with `SK_ID_CURR` and `TARGET` probabilities
              predicted by the model.
          feature_importances (pd.DataFrame): 
              dataframe with the feature importances from the model.
          valid_metrics (pd.DataFrame): 
              dataframe with training and validation metrics (ROC AUC) for each fold and overall.
          
      """
      
      # Extract the ids
      train_ids = features['SK_ID_CURR']
      test_ids = test_features['SK_ID_CURR']
      
      # Extract the labels for training
      labels = features['TARGET']
      
      # Remove the ids and target
      features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
      test_features = test_features.drop(columns = ['SK_ID_CURR'])
      
      
      # One Hot Encoding
      if encoding == 'ohe':
          features = pd.get_dummies(features)
          test_features = pd.get_dummies(test_features)
          
          # Align the dataframes by the columns
          features, test_features = features.align(test_features, join = 'inner', axis = 1)
          
          # No categorical indices to record
          cat_indices = 'auto'
      
      # Integer label encoding
      elif encoding == 'le':
          
          # Create a label encoder
          label_encoder = LabelEncoder()
          
          # List for storing categorical indices
          cat_indices = []
          
          # Iterate through each column
          for i, col in enumerate(features):
              if features[col].dtype == 'object':
                  # Map the categorical features to integers
                  features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                  test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))
  
                  # Record the categorical indices
                  cat_indices.append(i)
      
      # Catch error if label encoding scheme is not valid
      else:
          raise ValueError("Encoding must be either 'ohe' or 'le'")
          
      print('Training Data Shape: ', features.shape)
      print('Testing Data Shape: ', test_features.shape)
      
      # Extract feature names
      feature_names = list(features.columns)
      
      # Convert to np arrays
      features = np.array(features)
      test_features = np.array(test_features)
      
      # Create the kfold object
      k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
      
      # Empty array for feature importances
      feature_importance_values = np.zeros(len(feature_names))
      
      # Empty array for test predictions
      test_predictions = np.zeros(test_features.shape[0])
      
      # Empty array for out of fold validation predictions
      out_of_fold = np.zeros(features.shape[0])
      
      # Lists for recording validation and training scores
      valid_scores = []
      train_scores = []
      
      # Iterate through each fold
      for train_indices, valid_indices in k_fold.split(features):
          
          # Training data for the fold
          train_features, train_labels = features[train_indices], labels[train_indices]
          # Validation data for the fold
          valid_features, valid_labels = features[valid_indices], labels[valid_indices]
          
          # Create the model
          model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                     class_weight = 'balanced', learning_rate = 0.05, 
                                     reg_alpha = 0.1, reg_lambda = 0.1, 
                                     subsample = 0.8, n_jobs = -1, random_state = 50)
          
          # Train the model
          model.fit(train_features, train_labels, eval_metric = 'auc',
                    eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                    eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                    early_stopping_rounds = 100, verbose = 200)
          
          # Record the best iteration
          best_iteration = model.best_iteration_
          
          # Record the feature importances
          feature_importance_values += model.feature_importances_ / k_fold.n_splits
          
          # Make predictions
          test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
          
          # Record the out of fold predictions
          out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
          
          # Record the best score
          valid_score = model.best_score_['valid']['auc']
          train_score = model.best_score_['train']['auc']
          
          valid_scores.append(valid_score)
          train_scores.append(train_score)
          
          # Clean up memory
          gc.enable()
          del model, train_features, valid_features
          gc.collect()
          
      # Make the submission dataframe
      submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
      
      # Make the feature importance dataframe
      feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
      
      # Overall validation score
      valid_auc = roc_auc_score(labels, out_of_fold)
      
      # Add the overall scores to the metrics
      valid_scores.append(valid_auc)
      train_scores.append(np.mean(train_scores))
      
      # Needed for creating dataframe of validation scores
      fold_names = list(range(n_folds))
      fold_names.append('overall')
      
      # Dataframe of validation scores
      metrics = pd.DataFrame({'fold': fold_names,
                              'train': train_scores,
                              'valid': valid_scores}) 
      
      return submission, feature_importances, metrics

In [56]:

submission, fi, metrics = model(app_train, app_test)
  print('Baseline metrics')
  print(metrics)

Training Data Shape:  (307511, 239)
  Testing Data Shape:  (48744, 239)
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.754949   train's auc: 0.79887
  Early stopping, best iteration is:
  [208]   valid's auc: 0.755109   train's auc: 0.80025
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.758539   train's auc: 0.798518
  Early stopping, best iteration is:
  [217]   valid's auc: 0.758619   train's auc: 0.801374
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.762652   train's auc: 0.79774
  [400]   valid's auc: 0.762202   train's auc: 0.827288
  Early stopping, best iteration is:
  [320]   valid's auc: 0.763103   train's auc: 0.81638
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.757496   train's auc: 0.799107
  Early stopping, best iteration is:
  [183]   valid's auc: 0.75759    train's auc: 0.796125
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.758099   train's auc: 0.798268
  Early stopping, best iteration is:
  [227]   valid's auc: 0.758251   train's auc: 0.802746
  Baseline metrics
        fold     train     valid
  0        0  0.800250  0.755109
  1        1  0.801374  0.758619
  2        2  0.816380  0.763103
  3        3  0.796125  0.757590
  4        4  0.802746  0.758251
  5  overall  0.803375  0.758537

In [57]:

fi_sorted = plot_feature_importances(fi)

In [58]:

submission.to_csv('baseline_lgb.csv', index = False)

這個提交應該獲得0.735的分數，將來咱們會獲得更高的分數。

In [59]:

app_train_domain['TARGET'] = train_labels
  
  # Test the domain knolwedge features
  submission_domain, fi_domain, metrics_domain = model(app_train_domain, app_test_domain)
  print('Baseline with domain knowledge features metrics')
  print(metrics_domain)

Training Data Shape:  (307511, 243)
  Testing Data Shape:  (48744, 243)
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.762577   train's auc: 0.804531
  Early stopping, best iteration is:
  [237]   valid's auc: 0.762858   train's auc: 0.810671
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.765594   train's auc: 0.804304
  Early stopping, best iteration is:
  [227]   valid's auc: 0.765861   train's auc: 0.808665
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.770139   train's auc: 0.803753
  [400]   valid's auc: 0.770328   train's auc: 0.834338
  Early stopping, best iteration is:
  [302]   valid's auc: 0.770629   train's auc: 0.820401
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.765653   train's auc: 0.804487
  Early stopping, best iteration is:
  [262]   valid's auc: 0.766318   train's auc: 0.815066
  Training until validation scores don't improve for 100 rounds.
  [200]   valid's auc: 0.764456   train's auc: 0.804527
  Early stopping, best iteration is:
  [235]   valid's auc: 0.764517   train's auc: 0.810422
  Baseline with domain knowledge features metrics
        fold     train     valid
  0        0  0.810671  0.762858
  1        1  0.808665  0.765861
  2        2  0.820401  0.770629
  3        3  0.815066  0.766318
  4        4  0.810422  0.764517
  5  overall  0.813045  0.766050

In [60]:

fi_sorted = plot_feature_importances(fi_domain)

咱們再一次看到了咱們以前選出來的特徵的重要性。看到這個，咱們也許會想，領域特徵是否是用這個方法也能起到做用。

In [61]:

submission_domain.to_csv('baseline_lgb_domain_features.csv', index = False)

此次，咱們的模型的分數是0.754，這說明領域特徵對模型的提高仍是有效果的，特徵工程確實是很是重要的部分。（全部的機器學習問題都是如此！）

本文能夠任意轉載，轉載時請註明做者及原文地址。

請長按或掃描二維碼關注咱們

本文分享自微信公衆號 - AI公園（AI_Paradise）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。

Kaggle競賽介紹: Home Credit default risk（五）

結論

接下來的內容

玩一玩: Light Gradient Boosting Machine