Kaggle入門級賽題：房價預測——數據分析篇

時間 2019-12-04

標籤 kaggle 入門房價預測數據分析简体版

原文原文鏈接

本次分享的項目來自 Kaggle 的經典賽題：房價預測。分爲數據分析和數據挖掘兩部分介紹。本篇爲數據分析篇。dom

賽題解讀

比賽概述

影響房價的因素有不少，在本題的數據集中有 79 個變量幾乎描述了愛荷華州艾姆斯 (Ames, Iowa) 住宅的方方面面，要求預測最終的房價。函數

技術棧

特徵工程 (Creative feature engineering)
迴歸模型 (Advanced regression techniques like random forest and
gradient boosting)

最終目標

預測出每間房屋的價格，對於測試集中的每個Id，給出變量SalePrice相應的值。學習

提交格式

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

數據分析

數據描述

首先咱們導入數據並查看：測試

train_df = pd.read_csv('./input/train.csv', index_col=0)
test_df = pd.read_csv('./input/test.csv', index_col=0)

train_df.head()

咱們能夠看到有 80 列，也就是有 79 個特徵。ui

接下來將訓練集和測試集合並在一塊兒，這麼作是爲了進行數據預處理的時候更加方便，讓測試集和訓練集的特徵變換爲相同的格式，等預處理進行完以後，再把他們分隔開。spa

咱們知道SalePrice做爲咱們的訓練目標，只出如今訓練集中，不出如今測試集，所以咱們須要把這一列拿出來再進行合併。在拿出這一列前，咱們先來觀察它，看看它長什麼樣子，也就是查看它的分佈。rest

prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])})
prices.hist()

由於label自己並不平滑，爲了咱們分類器的學習更加準確，咱們須要首先把label給平滑化（正態化）。我在這裏使用的是log1p, 也就是 log(x+1)。要注意的是咱們這一步把數據平滑化了，在最後算結果的時候，還要把預測到的平滑數據給變回去，那麼log1p()的反函數就是expm1()，後面用到時再具體細說。 code

而後咱們把這一列拿出來：ip

y_train = np.log1p(train_df.pop('SalePrice'))

y_train.head()

有ci

Id
1    12.247699
2    12.109016
3    12.317171
4    11.849405
5    12.429220
Name: SalePrice, dtype: float64

這時，y_train就是SalePrice那一列。

而後咱們把兩個數據集合並起來：

df = pd.concat((train_df, test_df), axis=0)

查看shape:

df.shape

(2919, 79)

df就是咱們合併以後的DataFrame。

數據預處理

根據 kaggle 給出的說明，有如下特徵及其說明：

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

接下來咱們對特徵進行分析。上述列出了一個目標變量SalePrice和 79 個特徵，數量較多，這一步的特徵分析是爲了以後的特徵工程作準備。

咱們來查看哪些特徵存在缺失值：

print(pd.isnull(df).sum())

這樣並不方便觀察，咱們先查看缺失值最多的 10 個特徵：

df.isnull().sum().sort_values(ascending=False).head(10)

爲了更清楚的表示，咱們用缺失率來考察缺失狀況：

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'缺失率': df_na})
missing_data.head(10)

對其進行可視化：

f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=df_na.index, y=df_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

咱們能夠看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特徵存在大量缺失，LotFrontage 有 16.7% 的缺失率，GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近，這些特徵有的是 category 數據，有的是 numerical 數據，對它們的缺失值如何處理，將在關於特徵工程的部分給出。

最後，咱們對每一個特徵進行相關性分析，查看熱力圖：

corrmat = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corrmat, vmax=0.9, square=True)

咱們看到有些特徵相關性大，容易形成過擬合現象，所以須要進行剔除。在下一篇的數據挖掘篇咱們來對這些特徵進行處理並訓練模型。

不足之處，歡迎指正。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。