本次分享的項目來自 Kaggle 的經典賽題:房價預測。分爲數據分析和數據挖掘兩部分介紹。本篇爲數據分析篇。dom
影響房價的因素有不少,在本題的數據集中有 79 個變量幾乎描述了愛荷華州艾姆斯 (Ames, Iowa) 住宅的方方面面,要求預測最終的房價。函數
預測出每間房屋的價格,對於測試集中的每個Id
,給出變量SalePrice
相應的值。學習
Id,SalePrice 1461,169000.1 1462,187724.1233 1463,175221 etc.
首先咱們導入數據並查看:測試
train_df = pd.read_csv('./input/train.csv', index_col=0) test_df = pd.read_csv('./input/test.csv', index_col=0)
train_df.head()
咱們能夠看到有 80 列,也就是有 79 個特徵。ui
接下來將訓練集和測試集合並在一塊兒,這麼作是爲了進行數據預處理的時候更加方便,讓測試集和訓練集的特徵變換爲相同的格式,等預處理進行完以後,再把他們分隔開。spa
咱們知道SalePrice
做爲咱們的訓練目標,只出如今訓練集中,不出如今測試集,所以咱們須要把這一列拿出來再進行合併。在拿出這一列前,咱們先來觀察它,看看它長什麼樣子,也就是查看它的分佈。rest
prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])}) prices.hist()
由於label
自己並不平滑,爲了咱們分類器的學習更加準確,咱們須要首先把label
給平滑化(正態化)。我在這裏使用的是log1p
, 也就是 log(x+1)
。要注意的是咱們這一步把數據平滑化了,在最後算結果的時候,還要把預測到的平滑數據給變回去,那麼log1p()
的反函數就是expm1()
,後面用到時再具體細說。 code
而後咱們把這一列拿出來:ip
y_train = np.log1p(train_df.pop('SalePrice')) y_train.head()
有ci
Id 1 12.247699 2 12.109016 3 12.317171 4 11.849405 5 12.429220 Name: SalePrice, dtype: float64
這時,y_train
就是SalePrice
那一列。
而後咱們把兩個數據集合並起來:
df = pd.concat((train_df, test_df), axis=0)
查看shape
:
df.shape (2919, 79)
df
就是咱們合併以後的DataFrame。
根據 kaggle 給出的說明,有如下特徵及其說明:
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. MSSubClass: The building class MSZoning: The general zoning classification LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access Alley: Type of alley access LotShape: General shape of property LandContour: Flatness of the property Utilities: Type of utilities available LotConfig: Lot configuration LandSlope: Slope of property Neighborhood: Physical locations within Ames city limits Condition1: Proximity to main road or railroad Condition2: Proximity to main road or railroad (if a second is present) BldgType: Type of dwelling HouseStyle: Style of dwelling OverallQual: Overall material and finish quality OverallCond: Overall condition rating YearBuilt: Original construction date YearRemodAdd: Remodel date RoofStyle: Type of roof RoofMatl: Roof material Exterior1st: Exterior covering on house Exterior2nd: Exterior covering on house (if more than one material) MasVnrType: Masonry veneer type MasVnrArea: Masonry veneer area in square feet ExterQual: Exterior material quality ExterCond: Present condition of the material on the exterior Foundation: Type of foundation BsmtQual: Height of the basement BsmtCond: General condition of the basement BsmtExposure: Walkout or garden level basement walls BsmtFinType1: Quality of basement finished area BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Quality of second finished area (if present) BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating HeatingQC: Heating quality and condition CentralAir: Central air conditioning Electrical: Electrical system 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Number of bedrooms above basement level Kitchen: Number of kitchens KitchenQual: Kitchen quality TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality rating Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality GarageType: Garage location GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality GarageCond: Garage condition PavedDrive: Paved driveway WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Fence: Fence quality MiscFeature: Miscellaneous feature not covered in other categories MiscVal: $Value of miscellaneous feature MoSold: Month Sold YrSold: Year Sold SaleType: Type of sale SaleCondition: Condition of sale
接下來咱們對特徵進行分析。上述列出了一個目標變量SalePrice
和 79 個特徵,數量較多,這一步的特徵分析是爲了以後的特徵工程作準備。
咱們來查看哪些特徵存在缺失值:
print(pd.isnull(df).sum())
這樣並不方便觀察,咱們先查看缺失值最多的 10 個特徵:
df.isnull().sum().sort_values(ascending=False).head(10)
爲了更清楚的表示,咱們用缺失率來考察缺失狀況:
df_na = (df.isnull().sum() / len(df)) * 100 df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False) missing_data = pd.DataFrame({'缺失率': df_na}) missing_data.head(10)
對其進行可視化:
f, ax = plt.subplots(figsize=(15,12)) plt.xticks(rotation='90') sns.barplot(x=df_na.index, y=df_na) plt.xlabel('Features', fontsize=15) plt.ylabel('Percent of missing values', fontsize=15) plt.title('Percent missing data by feature', fontsize=15)
咱們能夠看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特徵存在大量缺失,LotFrontage 有 16.7% 的缺失率,GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近,這些特徵有的是 category 數據,有的是 numerical 數據,對它們的缺失值如何處理,將在關於特徵工程的部分給出。
最後,咱們對每一個特徵進行相關性分析,查看熱力圖:
corrmat = train_df.corr() plt.subplots(figsize=(15,12)) sns.heatmap(corrmat, vmax=0.9, square=True)
咱們看到有些特徵相關性大,容易形成過擬合現象,所以須要進行剔除。在下一篇的數據挖掘篇咱們來對這些特徵進行處理並訓練模型。
不足之處,歡迎指正。