Kaggle競賽 —— 房價預測 (House Prices)

時間 2019-12-06

標籤 kaggle 競賽房價預測 house prices 简体版

原文原文鏈接

完整代碼見kaggle kernel 或 Github

比賽頁面：https://www.kaggle.com/c/house-prices-advanced-regression-techniquespython

這個比賽總的狀況就是給你79個特徵而後根據這些預測房價 (SalePrice)，這其中既有離散型也有連續性特徵，並且存在大量的缺失值。不過好在比賽方提供了data_description.txt這個文件，裏面對各個特徵的含義進行了描述，理解了其中內容後對於大部分缺失值就都能順利插補了。git

參加比賽首先要作的事是瞭解其評價指標，若是一開始就搞錯了到最後可能就白費功夫了-。- House Prices的評估指標是均方根偏差 (RMSE)，這是常見的用於迴歸問題的指標 :github

\[\sqrt{\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{N}}\]算法

我目前的得分是0.11421app

對個人分數提高最大的主要有兩塊：dom

特徵工程 : 主要爲離散型變量的排序賦值，特徵組合和PCA
模型融合 : 主要爲加權平均和Stacking

將在下文中一一說明。測試

探索性可視化（Exploratory Visualization）

因爲原始特徵較多，這裏只選擇建造年份 (YearBuilt) 來進行可視化：ui

plt.figure(figsize=(15,8))
sns.boxplot(train.YearBuilt, train.SalePrice)

通常認爲新房子比較貴，老房子比較便宜，從圖上看大體也是這個趨勢，因爲建造年份 (YearBuilt) 這個特徵存在較多的取值 (從1872年到2010年)，直接one hot encoding會形成過於稀疏的數據，所以在特徵工程中會將其進行數字化編碼 (LabelEncoder) 。編碼

數據清洗 (Data Cleaning)

這裏主要的工做是處理缺失值，首先來看各特徵的缺失值數量：lua

aa = full.isnull().sum()
aa[aa>0].sort_values(ascending=False)

PoolQC          2908
MiscFeature     2812
Alley           2719
Fence           2346
SalePrice       1459
FireplaceQu     1420
LotFrontage      486
GarageQual       159
GarageCond       159
GarageFinish     159
GarageYrBlt      159
GarageType       157
BsmtExposure      82
BsmtCond          82
BsmtQual          81
BsmtFinType2      80
BsmtFinType1      79
MasVnrType        24
MasVnrArea        23
MSZoning           4
BsmtFullBath       2
BsmtHalfBath       2
Utilities          2
Functional         2
Electrical         1
BsmtUnfSF          1
Exterior1st        1
Exterior2nd        1
TotalBsmtSF        1
GarageCars         1
BsmtFinSF2         1
BsmtFinSF1         1
KitchenQual        1
SaleType           1
GarageArea         1

若是咱們仔細觀察一下data_description裏面的內容的話，會發現不少缺失值都有跡可循，好比上表第一個PoolQC，表示的是游泳池的質量，其值缺失表明的是這個房子自己沒有游泳池，所以能夠用「None」來填補。

下面給出的這些特徵均可以用「None」來填補：

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
    full[col].fillna("None", inplace=True)

下面的這些特徵多爲表示XX面積，好比 TotalBsmtSF 表示地下室的面積，若是一個房子自己沒有地下室，則缺失值就用0來填補。

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
    full[col].fillna(0, inplace=True)

LotFrontage這個特徵與LotAreaCut和Neighborhood有比較大的關係，因此這裏用這兩個特徵分組後的中位數進行插補。

full['LotFrontage']=full.groupby(['LotAreaCut','Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))

特徵工程 (Feature Engineering)

離散型變量的排序賦值

對於離散型特徵，通常採用pandas中的get_dummies進行數值化，但在這個比賽中光這樣可能還不夠，因此下面我採用的方法是按特徵進行分組，計算該特徵每一個取值下SalePrice的平均數和中位數，再以此爲基準排序賦值，下面舉個例子：

MSSubClass這個特徵表示房子的類型，將數據按其分組：

full.groupby(['MSSubClass'])[['SalePrice']].agg(['mean','median','count'])

按表中進行排序：

'180' : 1
          '30' : 2   '45' : 2
          '190' : 3, '50' : 3, '90' : 3,
          '85' : 4, '40' : 4, '160' : 4
          '70' : 5, '20' : 5, '75' : 5, '80' : 5, '150' : 5
          '120': 6, '60' : 6

我總共大體排了20多個特徵，具體見完整代碼。

特徵組合

將原始特徵進行組合一般能產生意想不到的效果，然而這個數據集中原始特徵有不少，不可能全部都一一組合，因此這裏先用Lasso進行特徵篩選，選出較重要的一些特徵進行組合。

lasso=Lasso(alpha=0.001)
lasso.fit(X_scaled,y_log)
FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=data_pipe.columns)

FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))
plt.xticks(rotation=90)
plt.show()

最終加了這些特徵，這其中也包括了不少其餘的各類嘗試：

class add_feature(BaseEstimator, TransformerMixin):
    def __init__(self,additional=1):
        self.additional = additional
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        if self.additional==1:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]   
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
            
        else:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]   
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
            
            X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"]
            X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"]
            X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"]
            X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"]
            X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"]
            X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"]
            X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"]
            X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"]
            X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"]
            
            X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"]
            X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"]
            X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"]
            X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"]
            X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"]
            X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"]
            
           
            X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"]
            X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"]
            X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]
            X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

            return X

PCA

PCA是很是重要的一環，對於最終分數的提高很大。由於我新增的這些特徵都是和原始特徵高度相關的，這可能致使較強的多重共線性 (Multicollinearity) ，而PCA恰能夠去相關性。由於這裏使用PCA的目的不是降維，因此 n_components 用了和原來差很少的維度，這是我多方實驗的結果，即前面加XX特徵，後面再降到XX維。

pca = PCA(n_components=410)

X_scaled=pca.fit_transform(X_scaled)
test_X_scaled = pca.transform(test_X_scaled)

基本建模&評估（Basic Modeling & Evaluation）

首先定義RMSE的交叉驗證評估指標：

def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    return rmse

使用了13個算法和5折交叉驗證來評估baseline效果：

LinearRegression
Ridge
Lasso
Random Forrest
Gradient Boosting Tree
Support Vector Regression
Linear Support Vector Regression
ElasticNet
Stochastic Gradient Descent
BayesianRidge
KernelRidge
ExtraTreesRegressor
XgBoost

names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"]
for name, model in zip(names, models):
    score = rmse_cv(model, X_scaled, y_log)
    print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))

結果以下，總的來講樹模型廣泛不如線性模型，可能仍是由於get_dummies後帶來的數據稀疏性，不過這些模型都是沒調過參的。

LR: 1026870159.526766, 488528070.4534
Ridge: 0.117596, 0.0091
Lasso: 0.121474, 0.0060
RF: 0.140764, 0.0052
GBR: 0.124154, 0.0072
SVR: 0.112727, 0.0047
LinSVR: 0.121564, 0.0081
Ela: 0.111113, 0.0059
SGD: 0.159686, 0.0092
Bay: 0.110577, 0.0060
Ker: 0.109276, 0.0055
Extra: 0.136668, 0.0073
Xgb: 0.126614, 0.0070

接下來創建一個調參的方法，應時刻牢記評估指標是RMSE，因此打印出的分數也要是RMSE。

class grid():
    def __init__(self,model):
        self.model = model
    
    def grid_get(self,X,y,param_grid):
        grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error")
        grid_search.fit(X,y)
        print(grid_search.best_params_, np.sqrt(-grid_search.best_score_))
        grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])
        print(pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score','std_test_score']])

舉例Lasso的調參：

grid(Lasso()).grid_get(X_scaled,y_log,{'alpha': [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],'max_iter':[10000]})

{'max_iter': 10000, 'alpha': 0.0005} 0.111296607965
                                 params  mean_test_score  std_test_score
0  {'max_iter': 10000, 'alpha': 0.0003}         0.111869        0.001513
1  {'max_iter': 10000, 'alpha': 0.0002}         0.112745        0.001753
2  {'max_iter': 10000, 'alpha': 0.0004}         0.111463        0.001392
3  {'max_iter': 10000, 'alpha': 0.0005}         0.111297        0.001339
4  {'max_iter': 10000, 'alpha': 0.0007}         0.111538        0.001284
5  {'max_iter': 10000, 'alpha': 0.0006}         0.111359        0.001315
6  {'max_iter': 10000, 'alpha': 0.0009}         0.111915        0.001206
7  {'max_iter': 10000, 'alpha': 0.0008}         0.111706        0.001229

通過漫長的多輪測試，最後選擇了這六個模型：

lasso = Lasso(alpha=0.0005,max_iter=10000)
ridge = Ridge(alpha=60)
svr = SVR(gamma= 0.0004,kernel='rbf',C=13,epsilon=0.009)
ker = KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 , coef0=0.8)
ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)
bay = BayesianRidge()

集成方法 (Ensemble Methods)

加權平均

根據權重對各個模型加權平均：

class AverageWeight(BaseEstimator, RegressorMixin):
    def __init__(self,mod,weight):
        self.mod = mod
        self.weight = weight
    
    def fit(self,X,y):
        self.models_ = [clone(x) for x in self.mod]
        for model in self.models_:
            model.fit(X,y)
        return self

    def predict(self,X):
        w = list()
        pred = np.array([model.predict(X) for model in self.models_])
        # for every data point, single model prediction times weight, then add them together
        for data in range(pred.shape[1]):
            single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)]
            w.append(np.sum(single))
        return w

weight_avg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6])
score = rmse_cv(weight_avg,X_scaled,y_log)
print(score.mean())           # 0.10768459878025885

分數爲0.10768，比任何單個模型都好。

然而若只用SVR和Kernel Ridge兩個模型，則效果更好，看來是其餘幾個模型拖後腿了。。

weight_avg = AverageWeight(mod = [svr,ker],weight=[0.5,0.5])
score = rmse_cv(weight_avg,X_scaled,y_log)
print(score.mean())           # 0.10668349587195189

Stacking

Stacking的原理見下圖：

若是是像圖中那樣的兩層stacking，則是第一層5個模型，第二層1個元模型。第一層模型的做用是訓練獲得一個\(\mathbb{R}^{n×m}\)的特徵矩陣來用於輸入第二層模型訓練，其中n爲訓練數據行數，m爲第一層模型個數。

class stacking(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self,mod,meta_model):
        self.mod = mod
        self.meta_model = meta_model
        self.kf = KFold(n_splits=5, random_state=42, shuffle=True)
        
    def fit(self,X,y):
        self.saved_model = [list() for i in self.mod]
        oof_train = np.zeros((X.shape[0], len(self.mod)))
        
        for i,model in enumerate(self.mod):
            for train_index, val_index in self.kf.split(X,y):
                renew_model = clone(model)
                renew_model.fit(X[train_index], y[train_index])
                self.saved_model[i].append(renew_model)
                oof_train[val_index,i] = renew_model.predict(X[val_index])
        
        self.meta_model.fit(oof_train,y)
        return self
    
    def predict(self,X):
        whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1) 
                                      for single_model in self.saved_model]) 
        return self.meta_model.predict(whole_test)
    
    def get_oof(self,X,y,test_X):
        oof = np.zeros((X.shape[0],len(self.mod)))
        test_single = np.zeros((test_X.shape[0],5))
        test_mean = np.zeros((test_X.shape[0],len(self.mod)))
        for i,model in enumerate(self.mod):
            for j, (train_index,val_index) in enumerate(self.kf.split(X,y)):
                clone_model = clone(model)
                clone_model.fit(X[train_index],y[train_index])
                oof[val_index,i] = clone_model.predict(X[val_index])
                test_single[:,j] = clone_model.predict(test_X)
            test_mean[:,i] = test_single.mean(axis=1)
        return oof, test_mean

最開始我用get_oof的方法將第一層模型的特徵矩陣提取出來，再和原始特徵進行拼接，最後的cv分數降低到了0.1018，然而在leaderboard上的分數卻變差了，看來這種方法會致使過擬合。

X_train_stack, X_test_stack = stack_model.get_oof(a,b,test_X_scaled)
X_train_add = np.hstack((a,X_train_stack))          
X_test_add = np.hstack((test_X_scaled,X_test_stack))
print(rmse_cv(stack_model,X_train_add,b).mean())    # 0.101824682747

最後的結果提交，我用了Lasso，Ridge，SVR，Kernel Ridge，ElasticNet，BayesianRidge做爲第一層模型，Kernel Ridge做爲第二層模型。

stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)
stack_model.fit(a,b)
pred = np.exp(stack_model.predict(test_X_scaled))

result=pd.DataFrame({'Id':test.Id, 'SalePrice':pred})
result.to_csv("submission.csv",index=False)