博主在以前的博客 Kaggle:Home Credit Default Risk 數據探索及可視化(1) 中介紹了 Home Credit Default Risk 競賽中一個優秀 kernel 關於數據的探索及可視化的工做,本篇博客將圍繞如何構建特徵工程展開敘述,原文連接地址:Start Here: A Gentle Introductionhtml
特徵工程是指一個基因過程,能夠涉及特徵構建:從現有數據中添加新特徵和特徵選擇:僅選擇最重要的特徵或其餘降維方法。咱們可使用許多技術來建立特徵和選擇特徵。
當咱們開始使用其餘數據源時,咱們會作不少功能工程,本次,咱們只會嘗試兩種簡單的功能構建方法:git
一個簡單的特徵構造方法稱爲多項式特徵。例如,咱們能夠建立變量EXT_SOURCE_1 ^ 2和EXT_SOURCE_2 ^ 2以及變量,例如EXT_SOURCE_1 x EXT_SOURCE_2,EXT_SOURCE_1 x EXT_SOURCE_2 ^ 2,EXT_SOURCE_1 ^ 2 x EXT_SOURCE_2 ^ 2等等。這些由多個單獨變量組合而成的特徵被稱爲交互項,由於它們捕捉變量之間的交互做用。換句話說,雖然兩個變量自己可能不會對目標產生強烈的影響,但將它們組合成單個交互變量可能會顯示與目標的關係。統計模型中一般使用交互項來捕捉多個變量的影響,但我沒有看到它們在機器學習中常常使用。儘管如此,咱們能夠嘗試一些看看他們是否能夠幫助咱們的模型預測客戶是否會償還貸款。
Jake VanderPlas 在他的優秀着做 Python for Data Science 中爲那些想要了解更多信息的人寫了多項式特徵。github
在下面的代碼中,咱們使用EXT_SOURCE變量和DAYS_BIRTH變量建立多項式特徵。Scikit-Learn有一個稱爲PolynomialFeatures 的有用類,能夠建立多項式和交互項達到指定的程度。咱們可使用3度來查看結果(當咱們建立多項式特徵時,咱們但願避免使用太高的度數,這是由於特徵的數量隨着度數指數級地變化,而且由於咱們可能遇到問題過擬合)。app
# Make a new dataframe for polynomial features
框架
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
dom
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
機器學習
# imputer for handling missing values
學習
from sklearn.preprocessing import Imputer
測試
imputer = Imputer(strategy = 'median')
ui
poly_target = poly_features['TARGET']
poly_features = poly_features.drop(columns = ['TARGET'])
# Need to impute missing values
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)
from sklearn.preprocessing import PolynomialFeatures
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 3)
# Train the polynomial features
poly_transformer.fit(poly_features)
# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)
這創造了至關多的新功能。 要獲取名稱,咱們必須使用多項式特性 get_feature_names 方法。
poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:15]
有35個功能提高到3級和互動條件。 如今,咱們能夠看到這些新功能是否與目標相關。
# Create a dataframe of the features
poly_features = pd.DataFrame(poly_features,
columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH']))
# Add in the target
poly_features['TARGET'] = poly_target
# Find the correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values()
# Display most negative and most positive
print(poly_corrs.head(10))
print(poly_corrs.tail(5))
下圖是原始特徵相關性係數排序圖:
與原始特徵相比,幾個新變量與目標的相關性更大(以絕對幅度表示)。 當咱們構建機器學習模型時,咱們能夠嘗試使用和不使用這些功能來肯定它們是否真的有助於模型學習。
咱們將這些功能添加到培訓和測試數據的副本中,而後評估具備和不具備功能的模型。 不少次機器學習,要知道一種方法是否可行,惟一的方法就是嘗試一下!
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test,
columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH']))
# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')
# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')
# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)
# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape: ', app_test_poly.shape)
也許把這種「領域知識」稱爲不徹底正確,由於我不是信用專家,但也許咱們能夠稱之爲「嘗試應用有限的金融知識」。 在這種思惟框架下,咱們能夠製做一些功能,試圖捕捉咱們認爲可能對於客戶是否違約的貸款很重要。 在這裏我將使用由 Aguiar 的這個腳本啓發的五個特性:
app_train_domain = app_train.copy()
app_test_domain = app_test.copy()
app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']
app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']
咱們應該在圖形中直觀地探索這些領域知識變量。 對於全部這些,咱們將製做與目標值相同的 KDE 圖(核密度估計圖)。
plt.figure(figsize = (12, 20))
# iterate through the new features
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
# create a new subplot for each source
plt.subplot(4, 1, i + 1)
# plot repaid loans
sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 0, feature], label = 'target == 0')
# plot loans that were not repaid
sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 1, feature], label = 'target == 1')
# Label the plots
plt.title('Distribution of %s by Target Value' % source)
plt.xlabel('%s' % source); plt.ylabel('Density');
plt.tight_layout(h_pad = 2.5)