比賽頁面:https://www.kaggle.com/c/house-prices-advanced-regression-techniquespython
這個比賽總的狀況就是給你79個特徵而後根據這些預測房價 (SalePrice),這其中既有離散型也有連續性特徵,並且存在大量的缺失值。不過好在比賽方提供了data_description.txt這個文件,裏面對各個特徵的含義進行了描述,理解了其中內容後對於大部分缺失值就都能順利插補了。git
參加比賽首先要作的事是瞭解其評價指標,若是一開始就搞錯了到最後可能就白費功夫了-。- House Prices的評估指標是均方根偏差 (RMSE),這是常見的用於迴歸問題的指標 :github
\[\sqrt{\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{N}}\]算法
我目前的得分是0.11421app
對個人分數提高最大的主要有兩塊:dom
將在下文中一一說明。測試
因爲原始特徵較多,這裏只選擇建造年份 (YearBuilt) 來進行可視化:ui
plt.figure(figsize=(15,8)) sns.boxplot(train.YearBuilt, train.SalePrice)
通常認爲新房子比較貴,老房子比較便宜,從圖上看大體也是這個趨勢,因爲建造年份 (YearBuilt) 這個特徵存在較多的取值 (從1872年到2010年),直接one hot encoding會形成過於稀疏的數據,所以在特徵工程中會將其進行數字化編碼 (LabelEncoder) 。編碼
這裏主要的工做是處理缺失值,首先來看各特徵的缺失值數量:lua
aa = full.isnull().sum() aa[aa>0].sort_values(ascending=False)
PoolQC 2908 MiscFeature 2812 Alley 2719 Fence 2346 SalePrice 1459 FireplaceQu 1420 LotFrontage 486 GarageQual 159 GarageCond 159 GarageFinish 159 GarageYrBlt 159 GarageType 157 BsmtExposure 82 BsmtCond 82 BsmtQual 81 BsmtFinType2 80 BsmtFinType1 79 MasVnrType 24 MasVnrArea 23 MSZoning 4 BsmtFullBath 2 BsmtHalfBath 2 Utilities 2 Functional 2 Electrical 1 BsmtUnfSF 1 Exterior1st 1 Exterior2nd 1 TotalBsmtSF 1 GarageCars 1 BsmtFinSF2 1 BsmtFinSF1 1 KitchenQual 1 SaleType 1 GarageArea 1
若是咱們仔細觀察一下data_description裏面的內容的話,會發現不少缺失值都有跡可循,好比上表第一個PoolQC,表示的是游泳池的質量,其值缺失表明的是這個房子自己沒有游泳池,所以能夠用 「None」 來填補。
下面給出的這些特徵均可以用 「None」 來填補:
cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"] for col in cols1: full[col].fillna("None", inplace=True)
下面的這些特徵多爲表示XX面積,好比 TotalBsmtSF 表示地下室的面積,若是一個房子自己沒有地下室,則缺失值就用0來填補。
cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"] for col in cols: full[col].fillna(0, inplace=True)
LotFrontage這個特徵與LotAreaCut和Neighborhood有比較大的關係,因此這裏用這兩個特徵分組後的中位數進行插補。
full['LotFrontage']=full.groupby(['LotAreaCut','Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))
對於離散型特徵,通常採用pandas中的get_dummies進行數值化,但在這個比賽中光這樣可能還不夠,因此下面我採用的方法是按特徵進行分組,計算該特徵每一個取值下SalePrice的平均數和中位數,再以此爲基準排序賦值,下面舉個例子:
MSSubClass這個特徵表示房子的類型,將數據按其分組:
full.groupby(['MSSubClass'])[['SalePrice']].agg(['mean','median','count'])
按表中進行排序:
'180' : 1 '30' : 2 '45' : 2 '190' : 3, '50' : 3, '90' : 3, '85' : 4, '40' : 4, '160' : 4 '70' : 5, '20' : 5, '75' : 5, '80' : 5, '150' : 5 '120': 6, '60' : 6
我總共大體排了20多個特徵,具體見完整代碼。
將原始特徵進行組合一般能產生意想不到的效果,然而這個數據集中原始特徵有不少,不可能全部都一一組合,因此這裏先用Lasso進行特徵篩選,選出較重要的一些特徵進行組合。
lasso=Lasso(alpha=0.001) lasso.fit(X_scaled,y_log) FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=data_pipe.columns) FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25)) plt.xticks(rotation=90) plt.show()
最終加了這些特徵,這其中也包括了不少其餘的各類嘗試:
class add_feature(BaseEstimator, TransformerMixin): def __init__(self,additional=1): self.additional = additional def fit(self,X,y=None): return self def transform(self,X): if self.additional==1: X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] else: X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"] X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"] X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"] X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"] X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"] X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"] X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"] X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"] X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"] X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"] X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"] X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"] X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"] X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"] X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"] X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"] X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"] X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"] X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"] return X
PCA是很是重要的一環,對於最終分數的提高很大。由於我新增的這些特徵都是和原始特徵高度相關的,這可能致使較強的多重共線性 (Multicollinearity) ,而PCA恰能夠去相關性。由於這裏使用PCA的目的不是降維,因此 n_components 用了和原來差很少的維度,這是我多方實驗的結果,即前面加XX特徵,後面再降到XX維。
pca = PCA(n_components=410) X_scaled=pca.fit_transform(X_scaled) test_X_scaled = pca.transform(test_X_scaled)
首先定義RMSE的交叉驗證評估指標:
def rmse_cv(model,X,y): rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5)) return rmse
使用了13個算法和5折交叉驗證來評估baseline效果:
LinearRegression
XgBoost
names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"] for name, model in zip(names, models): score = rmse_cv(model, X_scaled, y_log) print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))
結果以下, 總的來講樹模型廣泛不如線性模型,可能仍是由於get_dummies後帶來的數據稀疏性,不過這些模型都是沒調過參的。
LR: 1026870159.526766, 488528070.4534 Ridge: 0.117596, 0.0091 Lasso: 0.121474, 0.0060 RF: 0.140764, 0.0052 GBR: 0.124154, 0.0072 SVR: 0.112727, 0.0047 LinSVR: 0.121564, 0.0081 Ela: 0.111113, 0.0059 SGD: 0.159686, 0.0092 Bay: 0.110577, 0.0060 Ker: 0.109276, 0.0055 Extra: 0.136668, 0.0073 Xgb: 0.126614, 0.0070
接下來創建一個調參的方法,應時刻牢記評估指標是RMSE,因此打印出的分數也要是RMSE。
class grid(): def __init__(self,model): self.model = model def grid_get(self,X,y,param_grid): grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error") grid_search.fit(X,y) print(grid_search.best_params_, np.sqrt(-grid_search.best_score_)) grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score']) print(pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score','std_test_score']])
舉例Lasso的調參:
grid(Lasso()).grid_get(X_scaled,y_log,{'alpha': [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],'max_iter':[10000]})
{'max_iter': 10000, 'alpha': 0.0005} 0.111296607965 params mean_test_score std_test_score 0 {'max_iter': 10000, 'alpha': 0.0003} 0.111869 0.001513 1 {'max_iter': 10000, 'alpha': 0.0002} 0.112745 0.001753 2 {'max_iter': 10000, 'alpha': 0.0004} 0.111463 0.001392 3 {'max_iter': 10000, 'alpha': 0.0005} 0.111297 0.001339 4 {'max_iter': 10000, 'alpha': 0.0007} 0.111538 0.001284 5 {'max_iter': 10000, 'alpha': 0.0006} 0.111359 0.001315 6 {'max_iter': 10000, 'alpha': 0.0009} 0.111915 0.001206 7 {'max_iter': 10000, 'alpha': 0.0008} 0.111706 0.001229
通過漫長的多輪測試,最後選擇了這六個模型:
lasso = Lasso(alpha=0.0005,max_iter=10000) ridge = Ridge(alpha=60) svr = SVR(gamma= 0.0004,kernel='rbf',C=13,epsilon=0.009) ker = KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 , coef0=0.8) ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000) bay = BayesianRidge()
根據權重對各個模型加權平均:
class AverageWeight(BaseEstimator, RegressorMixin): def __init__(self,mod,weight): self.mod = mod self.weight = weight def fit(self,X,y): self.models_ = [clone(x) for x in self.mod] for model in self.models_: model.fit(X,y) return self def predict(self,X): w = list() pred = np.array([model.predict(X) for model in self.models_]) # for every data point, single model prediction times weight, then add them together for data in range(pred.shape[1]): single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)] w.append(np.sum(single)) return w
weight_avg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6]) score = rmse_cv(weight_avg,X_scaled,y_log) print(score.mean()) # 0.10768459878025885
分數爲0.10768,比任何單個模型都好。
然而若只用SVR和Kernel Ridge兩個模型,則效果更好,看來是其餘幾個模型拖後腿了。。
weight_avg = AverageWeight(mod = [svr,ker],weight=[0.5,0.5]) score = rmse_cv(weight_avg,X_scaled,y_log) print(score.mean()) # 0.10668349587195189
Stacking的原理見下圖:
若是是像圖中那樣的兩層stacking,則是第一層5個模型,第二層1個元模型。第一層模型的做用是訓練獲得一個\(\mathbb{R}^{n×m}\)的特徵矩陣來用於輸入第二層模型訓練,其中n爲訓練數據行數,m爲第一層模型個數。
class stacking(BaseEstimator, RegressorMixin, TransformerMixin): def __init__(self,mod,meta_model): self.mod = mod self.meta_model = meta_model self.kf = KFold(n_splits=5, random_state=42, shuffle=True) def fit(self,X,y): self.saved_model = [list() for i in self.mod] oof_train = np.zeros((X.shape[0], len(self.mod))) for i,model in enumerate(self.mod): for train_index, val_index in self.kf.split(X,y): renew_model = clone(model) renew_model.fit(X[train_index], y[train_index]) self.saved_model[i].append(renew_model) oof_train[val_index,i] = renew_model.predict(X[val_index]) self.meta_model.fit(oof_train,y) return self def predict(self,X): whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1) for single_model in self.saved_model]) return self.meta_model.predict(whole_test) def get_oof(self,X,y,test_X): oof = np.zeros((X.shape[0],len(self.mod))) test_single = np.zeros((test_X.shape[0],5)) test_mean = np.zeros((test_X.shape[0],len(self.mod))) for i,model in enumerate(self.mod): for j, (train_index,val_index) in enumerate(self.kf.split(X,y)): clone_model = clone(model) clone_model.fit(X[train_index],y[train_index]) oof[val_index,i] = clone_model.predict(X[val_index]) test_single[:,j] = clone_model.predict(test_X) test_mean[:,i] = test_single.mean(axis=1) return oof, test_mean
最開始我用get_oof的方法將第一層模型的特徵矩陣提取出來,再和原始特徵進行拼接,最後的cv分數降低到了0.1018,然而在leaderboard上的分數卻變差了,看來這種方法會致使過擬合。
X_train_stack, X_test_stack = stack_model.get_oof(a,b,test_X_scaled) X_train_add = np.hstack((a,X_train_stack)) X_test_add = np.hstack((test_X_scaled,X_test_stack)) print(rmse_cv(stack_model,X_train_add,b).mean()) # 0.101824682747
最後的結果提交,我用了Lasso,Ridge,SVR,Kernel Ridge,ElasticNet,BayesianRidge做爲第一層模型,Kernel Ridge做爲第二層模型。
stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker) stack_model.fit(a,b) pred = np.exp(stack_model.predict(test_X_scaled)) result=pd.DataFrame({'Id':test.Id, 'SalePrice':pred}) result.to_csv("submission.csv",index=False)