基於XGBoost模型的幸福度預測——阿里天池學習賽

時間 2020-12-21

標籤 python 算法 api app 框架機器學習分佈式函數 oop 學習欄目悠閒生活简体版

原文原文鏈接

本文根據阿里天池學習賽《快來一塊兒挖掘幸福感！》撰寫python

加載數據

加載的是完整版的數據 happiness_train_complete.csv 。算法

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

# 將 id 列做爲 DataFrame 的 index 而且指定 survey_time 爲時間序列
data_origin = pd.read_csv('./data/happiness_train_complete.csv', index_col='id', parse_dates=['survey_time'], encoding='gbk')

數據集基本信息的探索

下面簡單輸出前5行查看。api

data_origin.head()

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
id
1	4	1	12	32	59	2015-08-04 14:18:00	1	1959	1	1	...	4	50	60	50	50	30.0	30	50	50	50
2	4	2	18	52	85	2015-07-21 15:04:00	1	1992	1	1	...	3	90	70	70	80	85.0	70	90	60	60
3	4	2	29	83	126	2015-07-21 13:24:00	2	1967	1	0	...	4	90	80	75	79	80.0	90	90	90	75
4	5	2	10	28	51	2015-07-25 17:33:00	2	1943	1	1	...	3	100	90	70	80	80.0	90	90	80	80
5	4	1	7	18	36	2015-08-10 09:50:00	2	1994	1	1	...	2	50	50	50	50	50.0	50	50	50	50

5 rows × 139 columnsapp

查看數據的詳細信息，共8000條記錄，139個特徵。框架

第二列爲特證名、第三列爲非空記錄個數、第四列爲特徵的數據格式。機器學習

data_origin.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8000 entries, 1 to 8000
Data columns (total 139 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   happiness             8000 non-null   int64         
 1   survey_type           8000 non-null   int64         
 2   province              8000 non-null   int64         
 3   city                  8000 non-null   int64         
 4   county                8000 non-null   int64         
 5   survey_time           8000 non-null   datetime64[ns]
 6   gender                8000 non-null   int64         
 7   birth                 8000 non-null   int64         
 8   nationality           8000 non-null   int64         
 9   religion              8000 non-null   int64         
 10  religion_freq         8000 non-null   int64         
 11  edu                   8000 non-null   int64         
 12  edu_other             3 non-null      object        
 13  edu_status            6880 non-null   float64       
 14  edu_yr                6028 non-null   float64       
 15  income                8000 non-null   int64         
 16  political             8000 non-null   int64         
 17  join_party            824 non-null    float64       
 18  floor_area            8000 non-null   float64       
 19  property_0            8000 non-null   int64         
 20  property_1            8000 non-null   int64         
 21  property_2            8000 non-null   int64         
 22  property_3            8000 non-null   int64         
 23  property_4            8000 non-null   int64         
 24  property_5            8000 non-null   int64         
 25  property_6            8000 non-null   int64         
 26  property_7            8000 non-null   int64         
 27  property_8            8000 non-null   int64         
 28  property_other        66 non-null     object        
 29  height_cm             8000 non-null   int64         
 30  weight_jin            8000 non-null   int64         
 31  health                8000 non-null   int64         
 32  health_problem        8000 non-null   int64         
 33  depression            8000 non-null   int64         
 34  hukou                 8000 non-null   int64         
 35  hukou_loc             7996 non-null   float64       
 36  media_1               8000 non-null   int64         
 37  media_2               8000 non-null   int64         
 38  media_3               8000 non-null   int64         
 39  media_4               8000 non-null   int64         
 40  media_5               8000 non-null   int64         
 41  media_6               8000 non-null   int64         
 42  leisure_1             8000 non-null   int64         
 43  leisure_2             8000 non-null   int64         
 44  leisure_3             8000 non-null   int64         
 45  leisure_4             8000 non-null   int64         
 46  leisure_5             8000 non-null   int64         
 47  leisure_6             8000 non-null   int64         
 48  leisure_7             8000 non-null   int64         
 49  leisure_8             8000 non-null   int64         
 50  leisure_9             8000 non-null   int64         
 51  leisure_10            8000 non-null   int64         
 52  leisure_11            8000 non-null   int64         
 53  leisure_12            8000 non-null   int64         
 54  socialize             8000 non-null   int64         
 55  relax                 8000 non-null   int64         
 56  learn                 8000 non-null   int64         
 57  social_neighbor       7204 non-null   float64       
 58  social_friend         7204 non-null   float64       
 59  socia_outing          8000 non-null   int64         
 60  equity                8000 non-null   int64         
 61  class                 8000 non-null   int64         
 62  class_10_before       8000 non-null   int64         
 63  class_10_after        8000 non-null   int64         
 64  class_14              8000 non-null   int64         
 65  work_exper            8000 non-null   int64         
 66  work_status           2951 non-null   float64       
 67  work_yr               2951 non-null   float64       
 68  work_type             2951 non-null   float64       
 69  work_manage           2951 non-null   float64       
 70  insur_1               8000 non-null   int64         
 71  insur_2               8000 non-null   int64         
 72  insur_3               8000 non-null   int64         
 73  insur_4               8000 non-null   int64         
 74  family_income         7999 non-null   float64       
 75  family_m              8000 non-null   int64         
 76  family_status         8000 non-null   int64         
 77  house                 8000 non-null   int64         
 78  car                   8000 non-null   int64         
 79  invest_0              8000 non-null   int64         
 80  invest_1              8000 non-null   int64         
 81  invest_2              8000 non-null   int64         
 82  invest_3              8000 non-null   int64         
 83  invest_4              8000 non-null   int64         
 84  invest_5              8000 non-null   int64         
 85  invest_6              8000 non-null   int64         
 86  invest_7              8000 non-null   int64         
 87  invest_8              8000 non-null   int64         
 88  invest_other          29 non-null     object        
 89  son                   8000 non-null   int64         
 90  daughter              8000 non-null   int64         
 91  minor_child           6934 non-null   float64       
 92  marital               8000 non-null   int64         
 93  marital_1st           7172 non-null   float64       
 94  s_birth               6282 non-null   float64       
 95  marital_now           6230 non-null   float64       
 96  s_edu                 6282 non-null   float64       
 97  s_political           6282 non-null   float64       
 98  s_hukou               6282 non-null   float64       
 99  s_income              6282 non-null   float64       
 100 s_work_exper          6282 non-null   float64       
 101 s_work_status         2565 non-null   float64       
 102 s_work_type           2565 non-null   float64       
 103 f_birth               8000 non-null   int64         
 104 f_edu                 8000 non-null   int64         
 105 f_political           8000 non-null   int64         
 106 f_work_14             8000 non-null   int64         
 107 m_birth               8000 non-null   int64         
 108 m_edu                 8000 non-null   int64         
 109 m_political           8000 non-null   int64         
 110 m_work_14             8000 non-null   int64         
 111 status_peer           8000 non-null   int64         
 112 status_3_before       8000 non-null   int64         
 113 view                  8000 non-null   int64         
 114 inc_ability           8000 non-null   int64         
 115 inc_exp               8000 non-null   float64       
 116 trust_1               8000 non-null   int64         
 117 trust_2               8000 non-null   int64         
 118 trust_3               8000 non-null   int64         
 119 trust_4               8000 non-null   int64         
 120 trust_5               8000 non-null   int64         
 121 trust_6               8000 non-null   int64         
 122 trust_7               8000 non-null   int64         
 123 trust_8               8000 non-null   int64         
 124 trust_9               8000 non-null   int64         
 125 trust_10              8000 non-null   int64         
 126 trust_11              8000 non-null   int64         
 127 trust_12              8000 non-null   int64         
 128 trust_13              8000 non-null   int64         
 129 neighbor_familiarity  8000 non-null   int64         
 130 public_service_1      8000 non-null   int64         
 131 public_service_2      8000 non-null   int64         
 132 public_service_3      8000 non-null   int64         
 133 public_service_4      8000 non-null   int64         
 134 public_service_5      8000 non-null   float64       
 135 public_service_6      8000 non-null   int64         
 136 public_service_7      8000 non-null   int64         
 137 public_service_8      8000 non-null   int64         
 138 public_service_9      8000 non-null   int64         
dtypes: datetime64[ns](1), float64(25), int64(110), object(3)
memory usage: 8.5+ MB

查看數據整體統計量。分佈式

data_origin.describe()

	happiness	survey_type	province	city	county	gender	birth	nationality	religion	religion_freq	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
count	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.00000	8000.000000	8000.00000	8000.000000	8000.000000	...	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.000000	8000.00000	8000.000000	8000.000000
mean	3.850125	1.405500	15.155375	42.564750	70.619000	1.53000	1964.707625	1.37350	0.772250	1.427250	...	3.722250	70.809500	68.170000	62.737625	66.320125	62.794187	67.064000	66.09625	65.626750	67.153750
std	0.938228	0.491019	8.917100	27.187404	38.747503	0.49913	16.842865	1.52882	1.071459	1.408441	...	1.143358	21.184742	20.549943	24.771319	22.049437	23.463162	21.586817	23.08568	23.827493	22.502203
min	-8.000000	1.000000	1.000000	1.000000	1.000000	1.00000	1921.000000	-8.00000	-8.000000	-8.000000	...	-8.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.000000	-3.00000	-3.000000	-3.000000
25%	4.000000	1.000000	7.000000	18.000000	37.000000	1.00000	1952.000000	1.00000	1.000000	1.000000	...	3.000000	60.000000	60.000000	50.000000	60.000000	55.000000	60.000000	60.00000	60.000000	60.000000
50%	4.000000	1.000000	15.000000	42.000000	73.000000	2.00000	1965.000000	1.00000	1.000000	1.000000	...	4.000000	79.000000	70.000000	70.000000	70.000000	70.000000	70.000000	70.00000	70.000000	70.000000
75%	4.000000	2.000000	22.000000	65.000000	104.000000	2.00000	1977.000000	1.00000	1.000000	1.000000	...	5.000000	80.000000	80.000000	80.000000	80.000000	80.000000	80.000000	80.00000	80.000000	80.000000
max	5.000000	2.000000	31.000000	89.000000	134.000000	2.00000	1997.000000	8.00000	1.000000	9.000000	...	5.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.00000	100.000000	100.000000

8 rows × 135 columns函數

數據預處理

缺失值處理

查看子特徵的缺失狀況，其中oop

required_list 表示特徵中的必填項
continuous_list 表示特徵屬性爲連續型變量
categorical_list 表示分類型變量

其他特徵均爲等級（ordinal）型的分類變量。學習

required_list = ['survey_type', 'province', 'city', 'county', 'survey_time', 'gender', 'birth', 'nationality', 'religion',
                 'religion_freq', 'edu', 'income', 'political', 'floor_area', 'height_cm', 'weight_jin', 'health', 'health_problem',
                 'depression', 'hukou', 'socialize', 'relax', 'learn', 'equity', 'class', 'work_exper', 'work_status', 'work_yr', 'work_type',
                 'work_manage', 'family_income', 'family_m', 'family_status', 'house', 'car', 'marital', 'status_peer', 'status_3_before', 
                 'view', 'inc_ability']
continuous_list = ['birth', 'edu_yr', 'income', 'floor_area', 'height_cm', 'weight_jin', 'work_yr', 'family_income', 'family_m', 'house', 'son', 
                   'daughter', 'minor_child', 'marital_1st', 's_birth', 'marital_now', 's_income', 'f_birth', 'm_birth', 'inc_exp',
                  'public_service_1', 'public_service_2', 'public_service_3', 'public_service_4', 'public_service_5', 'public_service_6',
                  'public_service_7', 'public_service_8', 'public_service_9']
categorical_list = ['survey_type', 'province', 'gender', 'nationality']

必填項的缺失值分析

查看必填項中缺失值的狀況。

data_origin[required_list].isna().sum()[data_origin[required_list].isna().sum() > 0].to_frame().T

	work_status	work_yr	work_type	work_manage	family_income
0	5049	5049	5049	5049	1

其中

work_status 表示目前工做的情況
work_yr 表示一共工做了多少年
work_type 表示目前工做的性質
work_manage 表示目前工做的管理活動狀況
family_income 表示去年整年家庭總收入

首先分析 work_ 開頭的四項特徵的缺失狀況，它們的缺失計數同樣，可能說明調查問卷的填寫方式，可能被跳過了。

首先檢查調查問卷，找到對應的問卷問題，發如今 work_exper 特徵中，即工做經歷及情況，根據不一樣的工做經歷，將上面四個問題跳過。

查看 work_exper 對應的問卷。

能夠發現 work_exper 除了 1 分類，其它問題均被跳問；因此將上面四列的缺失記錄的 work_exper 輸出，查看是否都爲非 1 類的記錄。

經過下面的輸出能夠看到，在上面四項特徵爲缺失值的狀況下，其記錄對應的 work_exper 的取值大部分不爲 1 。

data_origin.loc[data_origin[required_list].isna().sum(axis=1)[data_origin[required_list].isna().sum(axis=1) > 0].index, 'work_exper'].to_frame().plot.hist()
pd.value_counts(data_origin.loc[data_origin[required_list].isna().sum(axis=1)[data_origin[required_list].isna().sum(axis=1) > 0].index, 'work_exper'])

5    1968
3    1242
4    1065
2     387
6     380
1       7
Name: work_exper, dtype: int64

進一步查看取值爲 1 的記錄。

(data_origin[data_origin[required_list].isna().sum(axis=1) > 0])[(data_origin[data_origin[required_list].isna().sum(axis=1) > 0].work_exper == 1)]

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
id
692	4	2	21	64	101	2015-07-20 11:12:00	2	1975	1	1	...	5	80	70	80	80	80.0	80	80	80	80
841	4	2	31	88	133	2015-08-17 13:49:00	2	1971	1	0	...	4	50	30	-2	-2	-2.0	50	50	50	70
1411	4	2	2	2	9	2015-07-23 09:25:00	1	1967	8	1	...	4	90	85	80	90	90.0	92	93	94	90
3117	4	1	4	7	18	2015-10-03 16:02:00	1	1980	1	1	...	2	30	35	30	40	60.0	40	30	70	70
4783	5	2	22	65	103	2015-07-08 18:45:00	1	1955	1	1	...	5	90	90	90	90	80.0	90	80	90	90
5589	5	2	16	46	78	2015-07-29 11:34:00	2	1964	1	1	...	3	89	63	67	75	74.0	67	65	78	79
7368	4	2	21	64	101	2015-07-19 08:32:00	2	1963	1	1	...	5	70	70	70	60	70.0	70	60	60	60

7 rows × 139 columns

能夠發現 work_exper 爲 1 的記錄存在7條，故將此刪除。

data_origin.drop((data_origin[data_origin[required_list].isna().sum(axis=1) > 0])[(data_origin[data_origin[required_list].isna().sum(axis=1) > 0].work_exper == 1)].index, inplace=True)

由於 family_income 缺失個數只有1條，不影響數據規模，因此直接將其刪除。

data_origin.drop(data_origin['family_income'].isna()[data_origin['family_income'].isna()].index, inplace=True)

連續型特徵缺失值分析

查看連續型特徵的卻失狀況。

data_origin[continuous_list].isna().sum()[data_origin[continuous_list].isna().sum() > 0].to_frame().T

	edu_yr	work_yr	minor_child	marital_1st	s_birth	marital_now	s_income
0	1970	5041	1066	828	1718	1770	1718

其中

edu_yr 表示已經完成的最高學歷是哪一年得到的
work_yr 表示第一份非農工做到目前的工做一共工做了多少年
minor_child 表示有幾個18週歲如下未成年子女
marital_1st 表示第一次結婚的時間
s_birth 表示目前的配偶或同居伴侶是哪一年出生的
martital_now 表示與目前的配偶是哪一年結婚的
s_income 表示配偶或同居伴侶去年整年的總收入

對於 edu_yr 即已經完成的最高學歷是哪一年得到的，查看缺失記錄的 edu_status 取值分佈狀況。

data_origin[data_origin['edu_yr'].isna()]['edu_status'].plot.hist()
pd.value_counts(data_origin[data_origin['edu_yr'].isna()]['edu_status'])

2.0    746
3.0    103
4.0      1
1.0      1
Name: edu_status, dtype: int64

查看 edu_yr 缺失的記錄的 edu_status 特徵後，只有選項 4 即畢業的記錄才應該填寫 edu_yr 的畢業年份，因此應該刪除記錄。

data_origin.drop(data_origin[(data_origin['edu_status'] == 4) & (data_origin['edu_yr'].isna())].index, inplace=True)

data_origin.shape

(7991, 139)

對於 minor_child 特徵，能夠檢查這個特徵缺失的記錄另外兩項特徵 son 和 daughter 分別表示兒子、女兒的數量，若是爲0，則將 minor_child 也填充爲0。

print(data_origin[np.array(data_origin['minor_child'].isna())].loc[:, 'son'].sum())
print(data_origin[np.array(data_origin['minor_child'].isna())].loc[:, 'daughter'].sum())
data_origin[np.array(data_origin['minor_child'].isna())].loc[:, 'son':'daughter']

0
0

	son	daughter
id
2	0	0
5	0	0
9	0	0
29	0	0
31	0	0
...	...	...
7967	0	0
7972	0	0
7991	0	0
7999	0	0
8000	0	0

1066 rows × 2 columns

能夠看對 minor_child 缺失的記錄，其兒子和女兒的個數也爲0，因此將 minor_child 缺失值填充爲0。

data_origin['minor_child'].fillna(0, inplace=True)

對於 marital_1st 的記錄的缺失狀況，能夠查看對應的記錄的 marital 的取值是否爲 1 表示未婚。

print(data_origin[np.array(data_origin['marital_1st'].isna())]['marital'].sum() == data_origin[np.array(data_origin['marital_1st'].isna())]['marital'].shape[0])
data_origin[np.array(data_origin['marital_1st'].isna())]['marital'].plot.hist()
pd.value_counts(data_origin[np.array(data_origin['marital_1st'].isna())]['marital'])

True





1    828
Name: marital, dtype: int64

能夠看到輸出結果代表對於 marital_1st 缺失的記錄都是未婚人士，因此缺失值正常。

下面查看 s_birth 即目前的配偶或同居伴侶是哪一年出生的的缺失狀況，首先查看缺失的記錄的 marital 狀態，查看是否知足無配偶或同居伴侶的狀況。

data_origin[data_origin['s_birth'].isna()]['marital'].plot.hist()
pd.value_counts(data_origin[data_origin['s_birth'].isna()]['marital'])

1    828
7    718
6    171
2      1
Name: marital, dtype: int64

根據輸出能夠看到，marital 取值爲 1 、6、7 分別表示未婚、離婚和喪偶，因此 s_birth 缺失屬於正常；並且取值爲 2 表示同居的缺失記錄只有一條，因此直接將其刪除便可。

data_origin.drop(data_origin[data_origin['s_birth'].isna()]['marital'][data_origin[data_origin['s_birth'].isna()]['marital'] == 2].index, inplace=True)

對於 marital_now 即與目前的配偶是哪一年結婚的，首先輸出 marital 查看婚姻的狀態，是否知足沒結婚的條件。

data_origin[data_origin['marital_now'].isna()]['marital'].plot.hist()
pd.value_counts(data_origin[data_origin['marital_now'].isna()]['marital'])

1    828
7    718
6    171
2     51
3      1
Name: marital, dtype: int64

根據輸出能夠獲得 1 和 2 表示沒有結婚的狀況，因此缺失屬於正常；

對於 3、6、7 分別表示初婚有配偶、離婚、喪偶；只有 3 屬於目前有配偶並結婚的狀況，因此應該刪除。

data_origin.drop(data_origin[data_origin['marital_now'].isna()].loc[data_origin[data_origin['marital_now'].isna()]['marital'] == 3].index, inplace=True)

data_origin.shape

(7989, 139)

對於 s_income 即配偶或同居伴侶去年整年的總收入的缺失狀況，能夠檢查對於 marital 查看其是否知足無配偶或伴侶的條件。

data_origin[data_origin['s_income'].isna()]['marital'].plot.hist()
pd.value_counts(data_origin[data_origin['s_income'].isna()]['marital'])

1    828
7    718
6    171
Name: marital, dtype: int64

能夠看到對於 s_income 的缺失值，其記錄對應的婚姻狀態都爲未婚、離婚或喪偶，因此 s_income 缺失是正常的。

分類變量缺失值分析

查看分類型（categorical）變量的缺失狀況，所有爲0，則沒有缺失值。

data_origin[categorical_list].isna().sum().to_frame().T

	survey_type	province	gender	nationality
0	0	0	0	0

全部特徵缺失值分析

查看全部特徵的缺失狀況。

data_origin.isna().sum()[data_origin.isna().sum() > 0].to_frame().T

	edu_other	edu_status	edu_yr	join_party	property_other	hukou_loc	social_neighbor	social_friend	work_status	work_yr	...	marital_1st	s_birth	marital_now	s_edu	s_political	s_hukou	s_income	s_work_exper	s_work_status	s_work_type
0	7986	1119	1969	7167	7923	4	795	795	5038	5038	...	828	1717	1768	1717	1717	1717	1717	1717	5427	5427

1 rows × 23 columns

首先對於 edu_other 特徵，只有在 edu 填寫了 14 的狀況下才填寫，首先檢查 edu_other 缺失的記錄的 edu 是否爲 14 若爲 14 則說明 edu_other 不該該爲缺失，應該將其刪除。

data_origin[data_origin['edu_other'].isna()][data_origin[data_origin['edu_other'].isna()]['edu'] == 14]

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
id
1242	4	2	3	6	13	2015-09-24 17:58:00	1	1971	1	1	...	5	100	90	60	80	70.0	80	70	60	50
3651	3	2	3	6	13	2015-09-24 20:25:00	1	1953	1	1	...	5	100	100	60	50	70.0	50	30	70	40
5330	2	2	3	6	13	2015-09-25 07:57:00	1	1953	1	1	...	5	100	100	100	100	100.0	100	30	100	50

3 rows × 139 columns

能夠看到 edu 爲 14 的記錄中，有3條記錄 edu_other 也爲缺失；因此將3條記錄刪除。

data_origin.drop(data_origin[data_origin['edu_other'].isna()][data_origin[data_origin['edu_other'].isna()]['edu'] == 14].index, inplace=True)

對於 edu_status 的缺失記錄，能夠先檢查記錄對應的 edu 是取的何值。

data_origin[data_origin['edu_status'].isna()]['edu'].plot.hist()
pd.value_counts(data_origin[data_origin['edu_status'].isna()]['edu'])

1    1052
2      65
3       2
Name: edu, dtype: int64

能夠看到對於 edu_status 缺失的記錄，其對應的 edu 教育程度爲別爲沒有受過任何教育、私塾、掃盲班和小學；對於取值爲 1 和 2 的狀況，屬於跳問選項，對應的 edu_status 屬於缺失是正常的；因此將 edu 取值爲 3 的記錄刪除。

data_origin.drop(data_origin[data_origin['edu_status'].isna()][data_origin[data_origin['edu_status'].isna()]['edu'] == 3].index, inplace=True)

對於 join_party 即目前政治面貌是黨員的入黨時間，只有政治面貌不是黨員的缺失值纔算正確，查看分佈狀況。

data_origin[data_origin['join_party'].isna()]['political'].plot.hist()
pd.value_counts(data_origin[data_origin['join_party'].isna()]['political'])

1    6703
 2     402
-8      41
 3      11
 4       5
Name: political, dtype: int64

根據直方圖看到，有5條記錄的 partical 的取值是 4 而入黨時間沒有填寫，因此將這5條記錄刪除。

data_origin.drop(data_origin[data_origin['join_party'].isna()][data_origin[data_origin['join_party'].isna()]['political'] == 4].index, inplace=True)

對於 hukou_loc 即目前的戶口登記地，查看缺失記錄的 hukou 登記狀況，發現取值都爲 7 即沒有戶口，因此缺失屬於正常。

data_origin[data_origin['hukou_loc'].isna()]['hukou'].to_frame()

	hukou
id
589	7
3657	7
3799	7
7811	7

對於 social_neighbor 和 social_friend 即與與其餘朋友進行社交娛樂活動的頻繁程度和有多少個晚上是由於出去度假或者探訪親友而沒有在家過夜，首先查看缺失記錄的 socialize 的分佈狀況。

data_origin[data_origin['social_neighbor'].isna()]['socialize'].plot.hist()
pd.value_counts(data_origin[data_origin['social_neighbor'].isna()]['socialize'])

1    793
Name: socialize, dtype: int64

能夠發現全部的 social_neighbor 和 social_friend 缺失記錄的 socialize 即是否常常在空閒時間作社交的事情所有均爲 1 即從不社交，因此兩個特徵的缺失值可使用 1 填充。

data_origin['social_neighbor'].fillna(1, inplace=True)
data_origin['social_friend'].fillna(1, inplace=True)

對於 s_edu 到 s_work_exper 的特徵，缺失值的記錄數都同樣，因此存在可能這幾項特徵的缺失記錄都來自同一批問卷對象。

首先查看 s_edu 的缺失記錄的 marital 的分佈狀況。

data_origin[data_origin['s_edu'].isna()]['marital'].plot.hist()
pd.value_counts(data_origin[data_origin['s_edu'].isna()]['marital'])

1    827
7    717
6    171
Name: marital, dtype: int64

能夠發現 s_edu 缺失的記錄的婚姻狀況所有均爲未婚、離婚或喪偶，均屬於沒有配偶或同居伴侶的狀況，因此屬於正常的缺失。

對於 s_political 到 s_work_exper 所有均屬於上述狀況。

對於 s_work_status 即配偶或同居伴侶目前的工做情況，首先查看調查問卷。

能夠得知只有 s_work_exper 填寫了 1 的狀況下才應該填寫 s_work_status 和 s_work_type 其它選項均須要跳過，因此屬於正常缺失值。

下面查看 s_work_status 缺失記錄的 s_work_exper 的分佈狀況。

data_origin[data_origin['s_work_status'].isna()]['s_work_exper'].plot.hist()
pd.value_counts(data_origin[data_origin['s_work_status'].isna()]['s_work_exper'])

5.0    1424
3.0    1017
4.0     823
6.0     221
2.0     217
1.0       1
Name: s_work_exper, dtype: int64

查看得知 s_work_exper 選 1 的記錄只有1條，直接刪除便可。

data_origin.drop(data_origin[data_origin['s_work_status'].isna()][data_origin[data_origin['s_work_status'].isna()]['s_work_exper'] == 1].index, inplace=True)

在調查問卷中，每一個選項通用含義，其 -1 表示不適用；-2 表示不知道；-3 表示拒絕回答；-8 表示沒法回答。

在這裏將全部的特徵的負數使用每個特徵的中位數進行填充。

data_origin.shape

(7978, 139)

no_ne_rows_index = (data_origin.drop(['survey_time', 'edu_other', 'property_other', 'invest_other'], axis=1) < 0).sum(axis=1)[(data_origin.drop(['survey_time', 'edu_other', 'property_other', 'invest_other'], axis=1) < 0).sum(axis=1) == 0].index

for column, content in data_origin.items():
    if pd.api.types.is_numeric_dtype(content):
        data_origin[column] = data_origin[column].apply(lambda x : pd.Series(data_origin.loc[no_ne_rows_index, :][column].unique()).median() if(x < 0 and x != np.nan) else x)

將全部的負數填充完成後，再將 NaN 數值所有使用統一的一個值 -1 填充。

data_origin.fillna(-1, inplace=True)

至此，全部特徵的缺失值已經所有處理完畢。

文本數據處理

在全部的特徵中，有3個特徵分別是 edu_other、property_other、invest_other 是字符串數據，須要將其轉換成序號編碼（Ordinal Encoding）。

首先查看 edu_other 的填寫狀況。

data_origin[data_origin['edu_other'] != -1]['edu_other'].to_frame()

	edu_other
id
1170	夜校
2513	夜校
4926	夜校

能夠看到 edu_other 的填寫狀況全都是夜校，將字符串轉換成序號編碼。

data_origin['edu_other'] = data_origin['edu_other'].astype('category').values.codes + 1

查看 property_other 即房子產權歸屬誰，首先檢查調查問卷的填寫狀況。

data_origin[data_origin['property_other'] != -1]['property_other'].to_frame()

	property_other
id
76	無產權
92	已購買，但未過戶
99	家庭共同全部
132	待辦
455	沒有產權
...	...
7376	家人共有
7746	全家人共有
7776	兄弟共有
7821	未分家，全家全部
7917	家人共有

66 rows × 1 columns

根據填寫狀況來看，其中有不少填寫信息都是一個意思，例如 家庭共同全部 和 全家全部 是同一個意思，可是在python處理中只能一個個的手動處理。

#data_origin.loc[[8009, 9212, 9759, 10517], 'property_other'] = '多人擁有'
#data_origin.loc[[8014, 8056, 10264], 'property_other'] = '未過戶'
#data_origin.loc[[8471, 8825, 9597, 9810, 9842, 9967, 10069, 10166, 10203, 10469], 'property_other'] = '全家擁有'
#data_origin.loc[[8553, 8596, 9605, 10421, 10814], 'property_other'] = '無產權'

data_origin.loc[[76, 132, 455, 495, 1415, 2511, 2792, 2956, 3647, 4147, 4193, 4589, 5023, 5382, 5492, 6102, 6272, 6339, 
                6507, 7184, 7239], 'property_other'] = '無產權'
data_origin.loc[[92, 1888, 2703, 3381, 5654], 'property_other'] = '未過戶'
data_origin.loc[[99, 619, 2728, 3062, 3222, 3251, 3696, 5283, 6191, 7295, 7376, 7746, 7821, 7917], 'property_other'] = '全家擁有'
data_origin.loc[[1597, 4993, 5398, 5899, 7240, 7776], 'property_other'] = '多人擁有'
data_origin.loc[[6469, 6891], 'property_other'] = '小產權'

將字符串編碼爲整數型的序號（ordinal）類型。

data_origin['property_other'] = data_origin['property_other'].astype('category').values.codes + 1

查看 invest_other 即從事的投資活動的填寫狀況。

pd.DataFrame(data_origin[data_origin['invest_other'] != -1]['invest_other'].unique())

	0
0	理財產品
1	民間借貸
2	銀行理財
3	儲蓄存款
4	理財
5	銀行存款利息
6	活期儲蓄
7	投資服務業、傢俱業
8	銀行存款
9	我的融資
10	租房
11	老人家不清楚
12	家中有部分土地承包出去
13	沒有
14	高利貸
15	彩票
16	本身沒有，兒女不清楚
17	網上理財
18	統籌
19	福利車票
20	其餘理財產品
21	商業萬能保險
22	投資開發區
23	字畫、茶壺

一樣地，將其轉換成整數類型的序號（ordinal）編碼。

data_origin['invest_other'] = data_origin['invest_other'].astype('category').values.codes + 1

離羣值處理

data_nona = data_origin.copy()

畫出箱型圖分析特徵的異常值。

並刪除離羣記錄。

sns.boxplot(x=data_nona['house'])

<AxesSubplot:xlabel='house'>

data_nona.drop(data_nona[data_nona['house'] > 25].index, inplace=True)

sns.boxplot(x=data_nona['family_m'])

<AxesSubplot:xlabel='family_m'>

data_nona.drop(data_nona[data_nona['family_m'] > 40].index, inplace=True)

sns.boxplot(x=data_nona['inc_exp'])

<AxesSubplot:xlabel='inc_exp'>

data_nona.drop(data_nona[data_nona['inc_exp'] > 0.6e8].index, inplace=True)

查看調查時間的月份分佈狀況，由於調查問卷都是在2015年填寫，只須要查看月份的離羣點。

由圖可知調查問卷是從6月開始的，記錄中2月的問卷屬於異常數據，應該刪除。

sns.boxplot(x=data_nona['survey_time'].dt.month)

<AxesSubplot:xlabel='survey_time'>

data_nona.drop(data_nona[data_nona['survey_time'].dt.month < 6].index, inplace=True)

特徵構造

特徵構造也可稱爲特徵交叉、特徵組合、數據變換。

連續變量離散化

離散化除了一些計算方面等等好處，還能夠引入非線性特性，也能夠很方便的作cross-feature。離散特徵的增長和減小都很容易，易於模型的快速迭代。此外，噪聲很大的環境中，離散化能夠下降特徵中包含的噪聲，提高特徵的表達能力。

pd.DataFrame(continuous_list)

	0
0	birth
1	edu_yr
2	income
3	floor_area
4	height_cm
5	weight_jin
6	work_yr
7	family_income
8	family_m
9	house
10	son
11	daughter
12	minor_child
13	marital_1st
14	s_birth
15	marital_now
16	s_income
17	f_birth
18	m_birth
19	inc_exp
20	public_service_1
21	public_service_2
22	public_service_3
23	public_service_4
24	public_service_5
25	public_service_6
26	public_service_7
27	public_service_8
28	public_service_9

將連續型變量所有進行分箱，而後對每一個區間進行編碼，生成新的離散的特徵。

for column in continuous_list:
    cut = pd.qcut(data_nona[column], q=5, duplicates='drop')
    cat = cut.values
    codes = cat.codes
    data_nona[column + '_discrete'] = codes

for column, content in data_nona.items():
    if pd.api.types.is_numeric_dtype(content):
        data_nona[column] = content.astype('int')

特徵選擇

將連續變量離散化後，生成之後綴 _discrete 的新特徵，因此將原來的連續變量的特徵刪除掉。

data_nona.to_csv('./data/happiness_train_complete_analysis.csv')

data_nona.drop(continuous_list, axis=1, inplace=True)

data_nona.to_csv('./data/happiness_train_complete_nona.csv')

特徵分析

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

data = pd.read_csv('./data/happiness_train_complete_analysis.csv', index_col='id', parse_dates=['survey_time'])

data.head()

	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	religion	...	inc_exp_discrete	public_service_1_discrete	public_service_2_discrete	public_service_3_discrete	public_service_4_discrete	public_service_5_discrete	public_service_6_discrete	public_service_7_discrete	public_service_8_discrete	public_service_9_discrete
id
1	4	1	12	32	59	2015-08-04 14:18:00	1	1959	1	1	...	2	0	0	0	0	0	0	0	0	0
2	4	2	18	52	85	2015-07-21 15:04:00	1	1992	1	1	...	2	4	1	2	3	4	1	4	0	0
3	4	2	29	83	126	2015-07-21 13:24:00	2	1967	1	0	...	3	4	2	3	3	3	4	4	4	2
4	5	2	10	28	51	2015-07-25 17:33:00	2	1943	1	1	...	0	4	3	2	3	3	4	4	3	2
5	4	1	7	18	36	2015-08-10 09:50:00	2	1994	1	1	...	4	0	0	0	0	0	0	0	0	0

5 rows × 168 columns

data.describe()

	happiness	survey_type	province	city	county	gender	birth	nationality	religion	religion_freq	...	inc_exp_discrete	public_service_1_discrete	public_service_2_discrete	public_service_3_discrete	public_service_4_discrete	public_service_5_discrete	public_service_6_discrete	public_service_7_discrete	public_service_8_discrete	public_service_9_discrete
count	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	...	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000	7968.000000
mean	3.866466	1.405120	15.158258	42.572164	70.631903	1.530748	1964.710216	1.399724	0.880271	1.452560	...	1.725653	1.665537	1.272214	1.841365	1.613328	1.848519	1.643449	1.651732	1.654869	1.302962
std	0.818844	0.490946	8.915876	27.183764	38.736751	0.499085	16.845155	1.466409	0.324665	1.358444	...	1.338535	1.420309	1.108440	1.342524	1.499494	1.297290	1.533445	1.544477	1.511468	1.078601
min	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1921.000000	1.000000	0.000000	1.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	4.000000	1.000000	7.000000	18.000000	37.000000	1.000000	1952.000000	1.000000	1.000000	1.000000	...	1.000000	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000
50%	4.000000	1.000000	15.000000	42.000000	73.000000	2.000000	1965.000000	1.000000	1.000000	1.000000	...	1.000000	2.000000	1.000000	2.000000	1.000000	2.000000	1.000000	1.000000	1.000000	1.000000
75%	4.000000	2.000000	22.000000	65.000000	104.000000	2.000000	1977.000000	1.000000	1.000000	1.000000	...	3.000000	2.000000	2.000000	3.000000	3.000000	3.000000	3.000000	3.000000	3.000000	2.000000
max	5.000000	2.000000	31.000000	89.000000	134.000000	2.000000	1997.000000	8.000000	1.000000	9.000000	...	4.000000	4.000000	3.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	3.000000

8 rows × 167 columns

data.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7968 entries, 1 to 8000
Data columns (total 168 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   happiness                  7968 non-null   int64         
 1   survey_type                7968 non-null   int64         
 2   province                   7968 non-null   int64         
 3   city                       7968 non-null   int64         
 4   county                     7968 non-null   int64         
 5   survey_time                7968 non-null   datetime64[ns]
 6   gender                     7968 non-null   int64         
 7   birth                      7968 non-null   int64         
 8   nationality                7968 non-null   int64         
 9   religion                   7968 non-null   int64         
 10  religion_freq              7968 non-null   int64         
 11  edu                        7968 non-null   int64         
 12  edu_other                  7968 non-null   int64         
 13  edu_status                 7968 non-null   int64         
 14  edu_yr                     7968 non-null   int64         
 15  income                     7968 non-null   int64         
 16  political                  7968 non-null   int64         
 17  join_party                 7968 non-null   int64         
 18  floor_area                 7968 non-null   int64         
 19  property_0                 7968 non-null   int64         
 20  property_1                 7968 non-null   int64         
 21  property_2                 7968 non-null   int64         
 22  property_3                 7968 non-null   int64         
 23  property_4                 7968 non-null   int64         
 24  property_5                 7968 non-null   int64         
 25  property_6                 7968 non-null   int64         
 26  property_7                 7968 non-null   int64         
 27  property_8                 7968 non-null   int64         
 28  property_other             7968 non-null   int64         
 29  height_cm                  7968 non-null   int64         
 30  weight_jin                 7968 non-null   int64         
 31  health                     7968 non-null   int64         
 32  health_problem             7968 non-null   int64         
 33  depression                 7968 non-null   int64         
 34  hukou                      7968 non-null   int64         
 35  hukou_loc                  7968 non-null   int64         
 36  media_1                    7968 non-null   int64         
 37  media_2                    7968 non-null   int64         
 38  media_3                    7968 non-null   int64         
 39  media_4                    7968 non-null   int64         
 40  media_5                    7968 non-null   int64         
 41  media_6                    7968 non-null   int64         
 42  leisure_1                  7968 non-null   int64         
 43  leisure_2                  7968 non-null   int64         
 44  leisure_3                  7968 non-null   int64         
 45  leisure_4                  7968 non-null   int64         
 46  leisure_5                  7968 non-null   int64         
 47  leisure_6                  7968 non-null   int64         
 48  leisure_7                  7968 non-null   int64         
 49  leisure_8                  7968 non-null   int64         
 50  leisure_9                  7968 non-null   int64         
 51  leisure_10                 7968 non-null   int64         
 52  leisure_11                 7968 non-null   int64         
 53  leisure_12                 7968 non-null   int64         
 54  socialize                  7968 non-null   int64         
 55  relax                      7968 non-null   int64         
 56  learn                      7968 non-null   int64         
 57  social_neighbor            7968 non-null   int64         
 58  social_friend              7968 non-null   int64         
 59  socia_outing               7968 non-null   int64         
 60  equity                     7968 non-null   int64         
 61  class                      7968 non-null   int64         
 62  class_10_before            7968 non-null   int64         
 63  class_10_after             7968 non-null   int64         
 64  class_14                   7968 non-null   int64         
 65  work_exper                 7968 non-null   int64         
 66  work_status                7968 non-null   int64         
 67  work_yr                    7968 non-null   int64         
 68  work_type                  7968 non-null   int64         
 69  work_manage                7968 non-null   int64         
 70  insur_1                    7968 non-null   int64         
 71  insur_2                    7968 non-null   int64         
 72  insur_3                    7968 non-null   int64         
 73  insur_4                    7968 non-null   int64         
 74  family_income              7968 non-null   int64         
 75  family_m                   7968 non-null   int64         
 76  family_status              7968 non-null   int64         
 77  house                      7968 non-null   int64         
 78  car                        7968 non-null   int64         
 79  invest_0                   7968 non-null   int64         
 80  invest_1                   7968 non-null   int64         
 81  invest_2                   7968 non-null   int64         
 82  invest_3                   7968 non-null   int64         
 83  invest_4                   7968 non-null   int64         
 84  invest_5                   7968 non-null   int64         
 85  invest_6                   7968 non-null   int64         
 86  invest_7                   7968 non-null   int64         
 87  invest_8                   7968 non-null   int64         
 88  invest_other               7968 non-null   int64         
 89  son                        7968 non-null   int64         
 90  daughter                   7968 non-null   int64         
 91  minor_child                7968 non-null   int64         
 92  marital                    7968 non-null   int64         
 93  marital_1st                7968 non-null   int64         
 94  s_birth                    7968 non-null   int64         
 95  marital_now                7968 non-null   int64         
 96  s_edu                      7968 non-null   int64         
 97  s_political                7968 non-null   int64         
 98  s_hukou                    7968 non-null   int64         
 99  s_income                   7968 non-null   int64         
 100 s_work_exper               7968 non-null   int64         
 101 s_work_status              7968 non-null   int64         
 102 s_work_type                7968 non-null   int64         
 103 f_birth                    7968 non-null   int64         
 104 f_edu                      7968 non-null   int64         
 105 f_political                7968 non-null   int64         
 106 f_work_14                  7968 non-null   int64         
 107 m_birth                    7968 non-null   int64         
 108 m_edu                      7968 non-null   int64         
 109 m_political                7968 non-null   int64         
 110 m_work_14                  7968 non-null   int64         
 111 status_peer                7968 non-null   int64         
 112 status_3_before            7968 non-null   int64         
 113 view                       7968 non-null   int64         
 114 inc_ability                7968 non-null   int64         
 115 inc_exp                    7968 non-null   int64         
 116 trust_1                    7968 non-null   int64         
 117 trust_2                    7968 non-null   int64         
 118 trust_3                    7968 non-null   int64         
 119 trust_4                    7968 non-null   int64         
 120 trust_5                    7968 non-null   int64         
 121 trust_6                    7968 non-null   int64         
 122 trust_7                    7968 non-null   int64         
 123 trust_8                    7968 non-null   int64         
 124 trust_9                    7968 non-null   int64         
 125 trust_10                   7968 non-null   int64         
 126 trust_11                   7968 non-null   int64         
 127 trust_12                   7968 non-null   int64         
 128 trust_13                   7968 non-null   int64         
 129 neighbor_familiarity       7968 non-null   int64         
 130 public_service_1           7968 non-null   int64         
 131 public_service_2           7968 non-null   int64         
 132 public_service_3           7968 non-null   int64         
 133 public_service_4           7968 non-null   int64         
 134 public_service_5           7968 non-null   int64         
 135 public_service_6           7968 non-null   int64         
 136 public_service_7           7968 non-null   int64         
 137 public_service_8           7968 non-null   int64         
 138 public_service_9           7968 non-null   int64         
 139 birth_discrete             7968 non-null   int64         
 140 edu_yr_discrete            7968 non-null   int64         
 141 income_discrete            7968 non-null   int64         
 142 floor_area_discrete        7968 non-null   int64         
 143 height_cm_discrete         7968 non-null   int64         
 144 weight_jin_discrete        7968 non-null   int64         
 145 work_yr_discrete           7968 non-null   int64         
 146 family_income_discrete     7968 non-null   int64         
 147 family_m_discrete          7968 non-null   int64         
 148 house_discrete             7968 non-null   int64         
 149 son_discrete               7968 non-null   int64         
 150 daughter_discrete          7968 non-null   int64         
 151 minor_child_discrete       7968 non-null   int64         
 152 marital_1st_discrete       7968 non-null   int64         
 153 s_birth_discrete           7968 non-null   int64         
 154 marital_now_discrete       7968 non-null   int64         
 155 s_income_discrete          7968 non-null   int64         
 156 f_birth_discrete           7968 non-null   int64         
 157 m_birth_discrete           7968 non-null   int64         
 158 inc_exp_discrete           7968 non-null   int64         
 159 public_service_1_discrete  7968 non-null   int64         
 160 public_service_2_discrete  7968 non-null   int64         
 161 public_service_3_discrete  7968 non-null   int64         
 162 public_service_4_discrete  7968 non-null   int64         
 163 public_service_5_discrete  7968 non-null   int64         
 164 public_service_6_discrete  7968 non-null   int64         
 165 public_service_7_discrete  7968 non-null   int64         
 166 public_service_8_discrete  7968 non-null   int64         
 167 public_service_9_discrete  7968 non-null   int64         
dtypes: datetime64[ns](1), int64(167)
memory usage: 10.3 MB

首先，查看 happiness 幸福程度的分佈，能夠發現多數人都屬於 比較幸福 的程度。

sns.set_theme(style="darkgrid")
sns.displot(data, x="happiness", facet_kws=dict(margin_titles=True))

<seaborn.axisgrid.FacetGrid at 0x25128009850>

查看每一個人的收入和幸福度的散點圖，經過散點圖能夠看出隨着收入的提升，大多數點都落在了較高的幸福程度上；即便如此，也會發現存在一些收入很是高的人也處在一個說不上幸福不幸福的程度。

sns.set_theme(style="whitegrid")
f, ax = plt.subplots()
sns.despine(f, left=True, bottom=True)
sns.scatterplot(x="happiness", y="income",
                size="income",
                palette="ch:r=-.2,d=.3_r",
                data=data, ax=ax)

<AxesSubplot:xlabel='happiness', ylabel='income'>

查看性別男女的幸福程度的分佈直方圖，在性別特徵上沒有過多的類別不平衡狀況。

sns.set_theme(style="darkgrid")
sns.displot(
    data, x="happiness", col="gender",
    facet_kws=dict(margin_titles=True)
)

<seaborn.axisgrid.FacetGrid at 0x25128b8e1f0>

經過直線圖，能夠看出，隨着 edu 受到的教育的提升，幸福程度也隨之提高。

sns.set_theme(style="ticks")
palette = sns.color_palette("rocket_r")
sns.relplot(
    data=data,
    x="edu", y="happiness",
    kind="line", size_order=["T1", "T2"], palette=palette,
    facet_kws=dict(sharex=False)
)

<seaborn.axisgrid.FacetGrid at 0x251284bdeb0>

查看每一個幸福程度的出生日期，能夠看出，不一樣幸福程度的年代的人分佈都是大同小異的。

sns.set_theme(style="ticks", palette="pastel")
sns.boxplot(x="happiness", y="birth",
            data=data)
sns.despine(offset=10, trim=True)

將記錄分爲是否信仰宗教信仰，查看幸福度和健康情況的分裂小提琴圖，也能夠看出一個趨勢，幸福度高的人大多數都分佈在較高的健康情況上，並且也能夠看出一個現象，隨着健康情況和幸福度的提升，信仰宗教信仰的人數也慢慢增長。

sns.set_theme(style="whitegrid")
sns.violinplot(data=data, x="happiness", y="health", hue="religion",
               split=True, inner="quart", linewidth=1)
sns.despine(left=True)

繪製一個多變量分佈直方圖，能夠看出大多數比較幸福的人，房產的數量也不會大幅增長。

import seaborn as sns
sns.set_theme(style="ticks")
g = sns.JointGrid(data=data, x="happiness", y="house", marginal_ticks=True)

# Set a log scaling on the y axis
g.ax_joint.set(yscale="linear")

# Create an inset legend for the histogram colorbar
cax = g.fig.add_axes([.15, .55, .02, .2])

# Add the joint and marginal histogram plots
g.plot_joint(
    sns.histplot, discrete=(True, False),
    cmap="light:#03012d", pmax=.8, cbar=True, cbar_ax=cax
)
g.plot_marginals(sns.histplot, element="step", color="#03012d")

<seaborn.axisgrid.JointGrid at 0x251288fee20>

繪製幸福度和住房建築面積的核密度估計圖，能夠看出一樣的現象，多數比較幸福的人的房屋建築面積也不會集中在很高的一個水平，可是也會有一個隨着房屋建築面積的增長幸福度也增長的現象。

sns.set_theme(style="ticks")
g = sns.jointplot(
    data=data[data['floor_area'] < 600],
    x="happiness", y="floor_area",
    kind="kde",
)

查看各個特徵的熱力圖，能夠根據圖中的顏色深度看出兩兩特徵之間的相關性的高低。

sns.set_theme(style="whitegrid")
corr_list = ['survey_type', 'province', 'city', 'county', 'survey_time', 'gender', 'birth', 'nationality', 'religion',
                 'religion_freq', 'edu', 'income', 'political', 'floor_area', 'height_cm', 'weight_jin', 'health', 'health_problem',
                 'depression', 'hukou', 'socialize', 'relax', 'learn', 'equity', 'class', 'work_exper', 'work_status', 'work_yr', 'work_type',
                 'work_manage', 'family_income', 'family_m', 'family_status', 'house', 'car', 'marital', 'status_peer', 'status_3_before', 
                 'view', 'inc_ability']
df = data
corr_mat = data[corr_list].corr().stack().reset_index(name="correlation")
g = sns.relplot(
    data=corr_mat,
    x="level_0", y="level_1", hue="correlation", size="correlation",
    palette="vlag", hue_norm=(-1, 1), edgecolor=".7",
    height=10, sizes=(50, 250), size_norm=(-.2, .8),
)
g.set(xlabel="", ylabel="", aspect="equal")
g.despine(left=True, bottom=True)
g.ax.margins(.02)
for label in g.ax.get_xticklabels():
    label.set_rotation(90)
for artist in g.legend.legendHandles:
    artist.set_edgecolor(".7")

查看全國省會城市的幸福人數的佔比條形圖，經過圖中能夠看出，湖北省調查人數最多但幸福人數不算高；河南省和山東省的幸福人數的佔比很是之高；即便內蒙古自治區的調查人數最少，可是幸福人數的佔比倒是很是高的。

sns.set_theme(style="whitegrid")

province_total = data['province'].groupby(data['province']).count().sort_values(ascending=False).to_frame()
province_total.columns = ['total']
happiness_involved = []
for index in province_total.index:
    happiness_involved.append((data[data['province'] == index][data[data['province'] == index]['happiness'] > 3].shape[0]))
happiness_involved = pd.DataFrame(happiness_involved, index=province_total.index)
happiness_involved.columns = ['involved']
province_total['province'] = province_total.index.map({
    1 : 'Shanghai', 2 : 'Yunnan', 3 : 'Neimeng', 4 : 'Beijing', 5 : 'Jilin', 6 : 'Sichuan', 7 : 'Tianjin', 8 : 'Ningxia',
    9 : 'Anhui', 10 : 'Shandong', 11 : 'Shanxi', 12 : 'Guangdong', 13 : 'Guangxi', 14 : 'Xinjiang', 15 : 'Jiangsu',
    16 : 'Jiangxi', 17 : 'Hebei', 18 : 'Henan', 19 : 'Zhejiang', 20 : 'Hainan', 21 : 'Hubei', 22 : 'Hunan', 23 : 'Gansu',
    24 : 'Fujian', 25 : 'XIzang', 26 : 'Guizhou', 27 : 'Liangning', 28 : 'Chongqing', 29 : 'Shaanxi', 30 : 'Qinghai', 31 : 'Heilongjiang'})
happiness_involved['province'] = province_total['province']

f, ax = plt.subplots(figsize=(6, 15))

sns.set_color_codes("pastel")
sns.barplot(x="total", y="province", data=province_total,
            label="Total", color="b")

sns.set_color_codes("muted")
sns.barplot(x="involved", y="province", data=happiness_involved,
            label="Alcohol-involved", color="b")

ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(ylabel="", xlabel="Happiness of every province")
sns.despine(left=True, bottom=True)

查看調查對象認爲的當今社會的公平度中的幸福人數佔比的直方圖，多數調查對象認爲當今社會是出於一個比較公平的，但仍有近半數人認爲不算太公平。

sns.set_theme(style="ticks")
f, ax = plt.subplots(figsize=(7, 5))
sns.despine(f)
sns.histplot(
    data, hue='happiness',
    x="equity",
    multiple="stack",
    palette="light:m_r",
    edgecolor=".3",
    linewidth=.5
)

<AxesSubplot:xlabel='equity', ylabel='Count'>

根據多變量的散點圖，幸福度高的人的都均勻地分佈在了不一樣身高、體重的地方；體形沒有太大地影響幸福度。

sns.set_theme(style="white")
sns.relplot(x="height_cm", y="weight_jin", hue="happiness", size="health",
             alpha=.5, palette="muted", data=data)

<seaborn.axisgrid.FacetGrid at 0x2512b5a2ca0>

繪製一個帶有偏差帶的直線圖，橫軸表示幸福度的提高，縱軸表示期待的年收入的提高，能夠看出，在幸福度比較低的人期待的年收入一般會很高並帶有很是大的偏差，隨着幸福度的提高每一個人期待的年收入也沒有變得更高，而且隨之偏差帶也變小了。

sns.set_theme(style="ticks")
palette = sns.color_palette("rocket_r")
sns.relplot(
    data=data,
    x="happiness", y="inc_exp",
    kind="line", palette=palette,
    aspect=.75, facet_kws=dict(sharex=False)
)

<seaborn.axisgrid.FacetGrid at 0x2512b5e8d90>

模型創建

XGBoost 模型介紹

XGBoost 是一個具備高效、靈活和可移植性的通過優化的分佈式 梯度提高 庫。它的實現是基於機器學習算法梯度提高框架。XGBoost 提供了並行的提高樹（例如GBDT、GBM）以一個很是快速而且精準的方法解決了許多的數據科學問題。相同的代碼能夠運行在主流的分佈式環境（如Hadoop、SGE、MPI）而且能夠處理數十億的樣本。

XGBoost表明了極端梯度提高（Extreme Gradient Boosting）。

集成決策樹

首先了解XGBoost的模型選擇：集成決策樹。樹的集成模型是由CART（classification and regression trees）的集合組成。下面一張圖簡單說明了一個CART分出某我的是否喜歡玩電腦遊戲的例子。

將每一個家庭成員分到不一樣的葉子結點上，並賦給他們一個分數，每個葉結點對應了一個分數。CART與決策樹是略有不一樣的，決策樹中每一個葉結點只包含了一個決策值。在CART上，真實的分數是與葉結點關聯的，能夠給出比分類更豐富的解釋。這也容許了更具備原則、更一致性的優化方法。

一般，在實踐中一個單獨的樹是不夠強大的。實際上使用的是集成模型，將多個樹的預測結果彙總到一塊兒。

上圖中是一個由兩棵樹集成在一塊兒的例子。每個樹的預測分數被加到一塊兒獲得最終的分數。一個重要的因素是兩棵樹努力補足彼此。能夠寫出模型：

\[\hat{y}_i=\sum_{k=1}^Kf_k\left (x_i\right ),f_k\in\mathcal{F} \]

其中，\(K\) 是樹的數量，\(f\) 是一個在函數空間 \(\mathcal{F}\) 的函數，而且 \(\mathcal{F}\) 是一個全部可能的CART的集合。可被優化的目標函數爲：

\[\mathit{obj}\left (\theta\right )=\sum_i^n\ell\left (y_i,\hat{y}_i\right )+\sum_{k=1}^K\Omega\left (f_k\right ) \]

隨機森林和提高樹實際上都是相同的模型；不一樣之處是如何去訓練它們。若是須要一個用來預測的集成樹，只須要寫出一個並其能夠工做在隨機森林和提高樹上。

提高樹

正如同全部的監督學習同樣，想要訓練樹就要先定義目標函數並優化它。

一個目標函數要老是包含訓練的損失度和正則化項。

\[\mathit{obj}=\sum_i^n\ell\left (y_i,\hat{y}_i\right )+\sum_{k=1}^K\Omega\left (f_k\right ) \]

加性訓練

樹須要訓練的參數有 \(f_i\) 每個都包含了樹的結構和葉結點的得分。訓練樹的結構是比傳統的能夠直接採用梯度的優化問題更難。一次性訓練並學習到全部的樹是很是棘手的。相反地，能夠採起一個附加的策略，修正已經學習到的，同時增長一課新樹。能夠寫出在第 \(t\) 步的預測值 \(\hat{y}_i^\left(t\right )\)

\[\begin{split} \hat{y}_i^{\left (0\right )}&=0\\ \hat{y}_i^{\left (1\right )}&=f_1\left (x_i\right )=\hat{y}_i^{\left (0\right )}+f_1\left (x_i\right )\\ \hat{y}_i^{\left (2\right )}&=f_1\left (x_i\right )+f_2\left (x_i\right )=\hat{y}_i^{\left (1\right )}+f_2\left (x_i\right )\\ &\dots\\ \hat{y}_i^{\left (t\right )}&=\sum_{k=1}^Kf_k\left (x_i\right )=\hat{y}_i^{\left (t-1\right )}+f_t\left (x_i\right )\\ \end{split} \]

在每一步須要什麼的樹，增長一棵樹，優化目標函數。

\[\begin{split} \mathit{obj}^{(t)}&=\sum_{i=1}^n\ell(y_i,\hat{y}_i^{(t)})+\sum_{i=1}^t\Omega(f_i)\\ &=\sum_{i=1}^n\ell(y_i,\hat{y}^{(t-1)}_i+f_t(x_i))+\Omega(f_t)+C \end{split} \]

若是考慮使用均方偏差（MSE）做爲損失函數，目標函數將會變成：

\[\begin{split} \mathit{obj}^{(t)}&=\sum_{i=1}^n\ell(y_i,\hat{y}^{(t-1)}_i+f_t(x_i))+\Omega(f_t)+C\\ &=\sum_{i=1}^n(y_i-(\hat{y}_i^{(t-1)}+f_t(x_i)))^2+\Omega(f_t)+C\\ &=\sum_{i=1}^n((y_i-\hat{y}_i^{(t-1)})-f_t(x_i))^2+\Omega(f_t)+C\\ &=\sum_{i=1}^n((y_i-\hat{y}_i^{(t-1)})^2-2(y_i-\hat{y}_i^{(t-1)})f_t(x_i)+f_t(x_i)^2)+\Omega(f_t)+C\\ &=\sum_{i=1}^n(-2(y_i-\hat{y}_i^{(t-1)})f_t(x_i)+f_t(x_i)^2)+\Omega(f_t)+C\\ \end{split} \]

MSE的形式是很是優雅的，其中有一個一階項（一般稱做殘差）和一個二階項。對於其它的損失函數（例如logistic的損失函數）而言，是沒有那麼輕易就能夠獲得如此優雅的形式。所以，一般會使用泰勒公式損失函數展開到二階項：

泰勒公式：函數 \(f(x)\) 在開區間 \((a,b)\) 上具備 \((n+1)\) 階導數，對於任一 \(x\in(a,b)\) 有

\[f(x)=\frac{f(x_0)}{0!}+\frac{f'(x_0)}{1!}(x-x_0)+\frac{f''(x_0)}{2!}(x-x_0)^2+\dots+\frac{f^{(n)}(x_0)}{n!}(x-x_0)^n+R_n(x) \]

\[\begin{split} \mathit{obj}^{(t)}&=\sum_{i=1}^n\ell(y_i,\hat{y}^{(t-1)}_i+f_t(x_i))+\Omega(f_t)+C\\ &=\sum_{i=1}^n[\frac{\ell(y_i,\hat{y}^{(t-1)}_i)}{0!}+\frac{\ell'(y_i,\hat{y}^{(t-1)}_i)}{1!}(\hat{y}^{(t)}_i-\hat{y}^{(t-1)}_i)+\frac{\ell''(y_i,\hat{y}^{(t-1)}_i)}{2!}(\hat{y}^{(t)}_i-\hat{y}^{(t-1)}_i)^2]+\Omega(f_t)+C\\ &=\sum_{i=1}^n[\ell(y_i,\hat{y}^{(t-1)}_i)+\ell'(y_i,\hat{y}^{(t-1)}_i)f_t(x_i)+\frac{1}{2}\ell''(y_i,\hat{y}^{(t-1)}_i)f_t(x_i)^2]+\Omega(f_t)+C\\ &=\sum_{i=1}^n[\ell(y_i,\hat{y}^{(t-1)}_i)+g_if_t(x_i)+\frac{1}{2}h_if_t(x_i)^2]+\Omega(f_t)+C\\ \end{split} \]

其中，\(g_i\) 和 \(h_i\) 被定義爲：

\[\begin{split} g_i&=\partial_{\hat{y}_i^{(t-1)}}\ell(y_i,\hat{y}^{(t-1)}_i)\\ h_i&=\partial^2_{\hat{y}^{(t-1)}_i}\ell(y_i,\hat{y}^{(t-1)}_i) \end{split} \]

移除全部的常量，在第 \(t\) 步的目標函數就成了：

\[\sum_{i=1}^n[g_if_t(x_i)+\frac{1}{2}h_if_t(x_i)^2]+\Omega(f_t) \]

這就成了對於一顆新樹的優化目標。一個很是重要的優點就是這個定義的目標函數的值只依賴於 \(g_i\) 和 \(h_i\) 這正是XGBoost支持自定義損失函數。能夠優化各類損失函數，包括邏輯迴歸和成對排名（pairwise ranking），使用 \(g_i\) 和 \(h_i\) 做爲輸入的徹底相同的求解器求解。

模型複雜度

定義樹的複雜度 \(\Omega(f)\) 。首先提煉出樹的定義 \(f(x)\) 爲：

\[f_t(x)=w_{q(x)},w\in\mathbb{R}^T,q:\mathbb{R}^d\rightarrow \{1,2,\dots,T\}. \]

其中 \(w\) 是葉結點上的得分向量，\(q\) 是一個將每個數據點分配到對應的葉結點上的函數，\(T\) 是葉結點的數量。在XGBoost中，定義複雜度爲：

\[\Omega(f)=\gamma T+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 \]

有不止一個方法定義複雜度，可是這種方式在實踐中能夠表現的很好。正則化項是大多數樹包都會被忽略的一部分。這是由於傳統的樹學習的對待僅僅強調改善雜質，模型的複雜度的控制留給了啓發式。經過正式的定義它，能夠更好的理解模型並使模型的表現更具備泛化能力。

樹的結構分數

經過對樹模型的目標函數的推導，能夠獲得在第 \(t\) 步的樹的目標值：

\[\begin{split} \mathit{obj}^{(t)}&\approx\sum_{i=1}^n[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}]+\gamma T+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2\\ &=[g_1w_{q(x_1)}+\frac{1}{2}h_1w^2_{q(x_1)}+g_2w_{q(x_2)}+\frac{1}{2}h_2w^2_{q(x_2)}+\dots+g_nw_{q(x_n)}+\frac{1}{2}h_nw^2_{q(x_n)}]+\gamma T+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2\\ &=\sum_{j=1}^T[(\sum_{i\in I_j}g_i)w_j+\frac{1}{2}(\sum_{i\in I_j}h_i)w^2_j]+\gamma T+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2\\ &=\sum_{j=1}^T[(\sum_{i\in I_j}g_i)w_j+\frac{1}{2}(\sum_{i\in I_j}h_i)w^2_j+\frac{1}{2}\lambda w_j^2]+\gamma T\\ &=\sum_{j=1}^T[(\sum_{i\in I_j}g_i)w_j+\frac{1}{2}(\sum_{i\in I_j}h_i+\lambda)w^2_j]+\gamma T\\ \end{split} \]

其中 \(I_j=\{i|q(x_i)=j\}\) 是第 \(i\) 個數據點被分配到第 \(j\) 個葉結點上的下標集合。改變了其累加的索引，由於被分配到相同的葉結點上的數據點獲得的分數是統一的。進一步壓縮表達令 \(G_j=\sum_{i\in I_j}g_i\) 和 \(H_j=\sum_{i\in I_j}h_i\)

\[\mathit{obj}^{(t)}=\sum_{j=1}^T[G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2]+\gamma T \]

其中，\(w_j\) 是彼此獨立的，式子 \(G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2\) 是二次的，而且對於給定的結構 \(q(x)\) 最好的 \(w_j\) 和能夠獲得的最佳的目標規約爲：

\[\begin{split} w_j\ast&=-\frac{G_j}{H_j+\lambda}\\ \mathit{obj}\ast&=-\frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda}+\gamma T \end{split} \]

此公式衡量了一棵樹的結構 \(q(x)\) 有多好。

基本上，對於一顆給定的樹結構，將統計量 \(g_i\) 和 \(h_i\) 推到它們所屬的葉結點上，並將它們累加到一塊兒，使用公式計算衡量這棵樹多好。這個分數相似於決策樹中的不純度度量（impurity measure），區別之處在於它還將模型複雜度考慮在內。

學習樹的結構

如今已經有了衡量一棵樹好壞的指標，一個典型的想法是枚舉全部可能的樹並從中挑出最好的一個。實際上這是很是棘手的，因此應該嘗試一次優化樹的一個級別。具體來講，是將一個子結點分割成兩個葉結點，得分增益爲：

\[\mathit{Gain}=\frac{1}{2}\left [\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right]-\gamma \]

這個公式能夠被分解爲幾個部分，一部分是在新左子結點的得分，第二部分是在新右子結點上的得分，第三部分是原先葉結點上的得分，第四部分是在新葉結點上的正則化項。能夠看到很是重要的因素是，若是增益小於 \(\gamma\) 更好的選擇是不去分割出一個新分支。這就是基本的樹模型的剪枝（pruning）技術。

對於實際中的數據，一般想要搜索一個最優的分割點。一個高效率的作法是，將全部的實例（記錄）排好序，以下圖示。

從左到右掃描計算全部分割方案的結構分數是很是高效的，而且能夠快速地找出最優的分割點。

加性數訓練的限制

由於將全部可能的樹結構枚舉出來是很是棘手的，因此每次增長一個分割點（split）。這個方法在大多數狀況下運行的很好，可是有一些邊緣案例致使這個方法失效。對於退化模型的訓練結果，每次僅僅考慮一個特徵維度。參考Can Gradient Boosting Learn Simple Arithmetic?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

train = pd.read_csv('./data/happiness_train_complete_nona.csv', index_col='id', parse_dates=['survey_time'])
test = pd.read_csv('./data/happiness_test_complete_nona.csv', index_col='id', parse_dates=['survey_time'])
submit = pd.read_csv('./data/happiness_submit.csv', index_col='id')

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X = train.drop(['happiness', 'survey_time'], axis=1)
y = train['happiness']

X_train, X_test, y_train, y_test = train_test_split(X, y)

from xgboost import XGBRegressor
from xgboost import plot_importance

model = XGBRegressor(gamma=0.1, learning_rate=0.1)
model.fit(X_train, y_train)
mean_squared_error(y_test, model.predict(X_test))

0.4596381608913307

predict = pd.DataFrame({'happiness' : model.predict(test.drop('survey_time', axis=1))}, index=test.index)

submit.loc[predict.index, 'happiness'] = predict['happiness']

submit.to_csv('./data/predict.csv')

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。