Predict Future Sales

數據分析
1. 數據基本處理

1.1 讀入數據集
1.2 基線模型預測
1.3 節省存儲空間

2. 數據探索

2.1 訓練集分析 sales_train

2.1.1 每件商品的銷量
2.1.2 每一個商店的銷量
2.1.3 每類商品的銷量
2.1.4 銷量和價格的離羣值

2.2 測試集分析
2.3 商店特徵

2.3.1 商店信息清洗
2.3.2 商店信息編碼

2.4 商品分類特徵

3 特徵融合

3.1 統計月銷量
3.2 相關信息融合
3.2 歷史信息

3.2.1 lag operation產生延遲信息，能夠選擇延遲的月數。
3.2.2 月銷量（每一個商品-商店）的歷史信息
3.2.3 月銷量（全部商品-商店）均值的歷史信息
3.2.4 月銷量（每件商品）均值和歷史特徵
3.2.5 月銷量（每一個商店）均值和歷史特徵
3.2.6 月銷量（每一個商品類別）均值和歷史特徵
3.2.7 月銷量（商品類別-商店）均值和歷史特徵
3.2.8 月銷量（商品類別_大類）均值和歷史特徵
3.2.9 月銷量（商品-商品類別_大類）均值和歷史特徵
3.2.10 月銷量（商店_城市）均值和歷史特徵
3.2.11 月銷量（商品-商店_城市）均值和歷史特徵
3.2.12 趨勢特徵，半年來價格的變化
3.2.13 每月天數¶
3.2.14 開始和結束的銷量

4. 數據建模

4.1 lightgbm 模型
4.2. 後處理

4.2.1 過時商品

4.3 重要性畫圖

預測將來銷售該項目來源於kaggle中的一場比賽的賽題，數據是由平常銷售數據組成的時間序列數據集，該數據集由俄羅斯最大的軟件公司之一 - 1C公司提供。提供了包括商店，商品，價格，日銷量等連續34個月內的數據，要求預測第35個月的各商店各商品的銷量。評價指標爲RMSE，Baseline是1.1677，個人成績是0.89896，目前排名178/3200。

文件名	文件說明
sales_train.csv	訓練集（date_block=0 到 33 的每日曆史數據，包括各商品在各商店的銷量，價格）
test.csv	測試集（date_block=34 的商店和產品信息）
items.csv	商品的詳細信息（item_name、item_id、item_category_id）
item_categories.csv	商品類別的詳細信息（item_category_name、item_category_id）
shops.csv	商店的詳細信息（shop_name、shop_id)

數據分析

1. 數據基本處理

1.1 讀入數據集

訓練集有六列，分別介紹日期，月份，商店，商品，價格和日銷量
測試集有三列，分別是ID，商店，和商品。

sales_train = pd.read_csv('input/sales_train.csv.gz')
test = pd.read_csv('input/test.csv.gz')

sales_train.head()

	date	shop_id	item_id	item_price	item_cnt_day
0	02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

test.head()

	ID	shop_id	item_id
0	0	5	5037
1	1	5	5320
2	2	5	5233
3	3	5	5232
4	4	5	5268

訓練集，有21807種商品，60個商店。一共2935849 種商品-商店組合。
測試集，有5100 種商品，42個商店。恰好就是5100 * 42 = 214200種 商品-商店組合。

print('how many lines in train set:', sales_train.shape)
print('unique items in train set:', sales_train['item_id'].nunique())
print('unique shops in train set:', sales_train['shop_id'].nunique())
print('how many lines in test set:', test.shape)
print('unique items in test set:', test['item_id'].nunique())
print('unique shops in test set:', test['shop_id'].nunique())

how many lines in train set: (2935849, 6)
unique items in train set: 21807
unique shops in train set: 60
how many lines in test set: (214200, 3)
unique items in test set: 5100
unique shops in test set: 42

查看數據的基本信息以及是否有數據缺失NAN
數據形狀爲(2935849, 6)，sales_train中沒有missing values，沒有nan，
html

print('----------head---------')
print(sales_train.head(5))
print('------information------')
print(sales_train.info())
print('-----missing value-----')
print(sales_train.isnull().sum())
print('--------nan value------')
print(sales_train.isna().sum())

----------head---------
         date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0  02.01.2013               0       59    22154      999.00           1.0
1  03.01.2013               0       25     2552      899.00           1.0
2  05.01.2013               0       25     2552      899.00          -1.0
3  06.01.2013               0       25     2554     1709.05           1.0
4  15.01.2013               0       25     2555     1099.00           1.0
------information------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
None
-----missing value-----
date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64
--------nan value------
date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64

1.2 基線模型預測

首先嚐試復現基線模型。本次比賽中，基線模型就是用第34個月的銷售看成第35個月的銷售，即將2015年10月的結果看成2015年11月的預測。評估結果應該是1.16777.python

訓練集中的數據是 商品-商店-天天的銷售。而要求預測的是商品-商店-每個月的銷售，所以須要合理使用groupby()和agg()函數。
訓練集沒有出現過的 商品-商店組合，一概填零，最終的結果須要限幅在 [0,20]區間。

sales_train_subset = sales_train[sales_train['date_block_num'] == 33]
sales_train_subset.head()

grouped = sales_train_subset[['shop_id','item_id','item_cnt_day']].groupby(['shop_id','item_id']).agg({'item_cnt_day':'sum'}).reset_index()
grouped = grouped.rename(columns={'item_cnt_day' : 'item_cnt_month'})
grouped.head()

	shop_id	item_id	item_cnt_month
0	2	31	1.0
1	2	486	3.0
2	2	787	1.0
3	2	794	1.0
4	2	968	1.0

test = pd.read_csv('../readonly/final_project_data/test.csv.gz')
test = pd.merge(test,grouped, on = ['shop_id','item_id'], how = 'left')
print(test.head())
test['item_cnt_month'] = test['item_cnt_month'].fillna(0).clip(0,20)
print(test.head())
test = test[['ID','item_cnt_month']]
submission = test.set_index('ID')
submission.to_csv('submission_baseline.csv')

ID  shop_id  item_id  item_cnt_month
0   0        5     5037             NaN
1   1        5     5320             NaN
2   2        5     5233             1.0
3   3        5     5232             NaN
4   4        5     5268             NaN
   ID  shop_id  item_id  item_cnt_month
0   0        5     5037             0.0
1   1        5     5320             0.0
2   2        5     5233             1.0
3   3        5     5232             0.0
4   4        5     5268             0.0

1.3 節省存儲空間

由於後續會作大量的特徵提取，對存儲空間的消耗較大，並且較大的特徵集對於模型訓練來講也是個負擔。在訓練集中能夠發現，不少數據的動態範圍很小，好比date_block_num,shop_id,item_id,用int16存儲就足夠了。而Item_price, item_cnt_day 用float32 存儲也是足夠的。這樣就能夠在不損失信息的前提下，減小通常的存儲空間消耗。結果來看從134.4+ MB,減小到了 61.6+ MB。markdown

def downcast_dtypes(df):
    cols_float64 = [c for c in df if df[c].dtype == 'float64']
    cols_int64_32 = [c for c in df if df[c].dtype in ['int64', 'int32']]
    df[cols_float64] = df[cols_float64].astype(np.float32)
    df[cols_int64_32] = df[cols_int64_32].astype(np.int16)
    return df
sales_train = downcast_dtypes(sales_train)
test = downcast_dtypes(test)
sales_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int16
shop_id           int16
item_id           int16
item_price        float32
item_cnt_day      float32
dtypes: float32(2), int16(3), object(1)
memory usage: 61.6+ MB

2. 數據探索

2.1 訓練集分析 sales_train

2.1.1 每件商品的銷量

咱們使用pivot_table來查看每件商品每月的銷量。pivot_table()和groupby()的用途相似，但更加靈活，能夠對columns作更多處理。app

sales_by_item_id = sales_train.pivot_table(index=['item_id'],values=['item_cnt_day'], 
                                        columns='date_block_num', aggfunc=np.sum, fill_value=0).reset_index()
sales_by_item_id.columns = sales_by_item_id.columns.droplevel().map(str)
sales_by_item_id = sales_by_item_id.reset_index(drop=True).rename_axis(None, axis=1)
sales_by_item_id.columns.values[0] = 'item_id'

sales_by_item_id.tail()

	item_id	0	1	2	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33
21802	22165	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
21803	22166	0	0	0	0	0	23	24	32	25	24	21	13	10	15	12	13	13	12	16	11	7	8	12	4	8	10	8	11	5	11
21804	22167	0	0	0	0	56	146	96	83	66	57	47	59	41	56	47	47	39	49	49	40	33	46	40	38	31	33	34	29	21	37
21805	22168	2	2	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
21806	22169	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

從上表中能夠看出一共有21807件商品。使用sum()能夠看到全部商品的總銷量根據時間的變化關係。函數

sales_by_item_id.sum()[1:].plot(legend=True, label="Monthly sum")

<matplotlib.axes._subplots.AxesSubplot at 0x1e0806f0fd0>

分析有多少商品在最近的連續六個月內，沒有銷量。這些商品有多少出如今測試集中。測試

訓練集一共21807種商品，其中有12391種在最近的六個月沒有銷量。
測試集一共5100種商品，其中有164種在訓練中最近六個月沒有銷量，共出現了164 * 42 = 6888次。
Tips：在最終的預測結果中，咱們能夠將這些商品的銷量大膽地設置爲零。

outdated_items = sales_by_item_id[sales_by_item_id.loc[:,'27':].sum(axis=1)==0]
print('Outdated items:', len(outdated_items))
test = pd.read_csv('../readonly/final_project_data/test.csv.gz')
print('unique items in test set:', test['item_id'].nunique())
print('Outdated items in test set:', test[test['item_id'].isin(outdated_items['item_id'])]['item_id'].nunique())

Outdated items: 12391
unique items in test set: 5100
Outdated items in test set: 164

在訓練集種有6行，是重複出現的，咱們能夠刪除或者保留，這種數據對結果影響不大。編碼

print("duplicated lines in sales_train is", len(sales_train[sales_train.duplicated()]))

duplicated lines in sales_train is 6

2.1.2 每一個商店的銷量

一共有 60 個商店，坐落在31個城市,城市的信息能夠做爲商店的一個特徵。
這裏先分析下哪些商店是最近纔開的，哪些是已經關閉了的，一樣分析最後六個月的數據。
atom

shop_id = 36 是新商店
shop_id = [0 1 8 11 13 17 23 29 30 32 33 40 43 54] 能夠認爲是已經關閉了。
Tips：新商店，能夠直接用第33個月來預測34個月的銷量，由於它沒有任何歷史數據。而已經關閉的商店，銷量能夠直接置零

sales_by_shop_id = sales_train.pivot_table(index=['shop_id'],values=['item_cnt_day'], 
                                        columns='date_block_num', aggfunc=np.sum, fill_value=0).reset_index()
sales_by_shop_id.columns = sales_by_shop_id.columns.droplevel().map(str)
sales_by_shop_id = sales_by_shop_id.reset_index(drop=True).rename_axis(None, axis=1)
sales_by_shop_id.columns.values[0] = 'shop_id'

for i in range(27,34):
    print('Not exists in month',i,sales_by_shop_id['shop_id'][sales_by_shop_id.loc[:,'0':str(i)].sum(axis=1)==0].unique())

for i in range(27,34):
    print('Shop is outdated for month',i,sales_by_shop_id['shop_id'][sales_by_shop_id.loc[:,str(i):].sum(axis=1)==0].unique())

Not exists in month 27 [36]
Not exists in month 28 [36]
Not exists in month 29 [36]
Not exists in month 30 [36]
Not exists in month 31 [36]
Not exists in month 32 [36]
Not exists in month 33 []
Shop is outdated for month 27 [ 0  1  8 11 13 17 23 30 32 40 43]
Shop is outdated for month 28 [ 0  1  8 11 13 17 23 30 32 33 40 43 54]
Shop is outdated for month 29 [ 0  1  8 11 13 17 23 29 30 32 33 40 43 54]
Shop is outdated for month 30 [ 0  1  8 11 13 17 23 29 30 32 33 40 43 54]
Shop is outdated for month 31 [ 0  1  8 11 13 17 23 29 30 32 33 40 43 54]
Shop is outdated for month 32 [ 0  1  8 11 13 17 23 29 30 32 33 40 43 54]
Shop is outdated for month 33 [ 0  1  8 11 13 17 23 27 29 30 32 33 40 43 51 54]

2.1.3 每類商品的銷量

爲了能使用商品的類別，須要先讀取item_categories的信息，而後添加到sales_train裏面es5

item_categories = pd.read_csv('../readonly/final_project_data/items.csv')
item_categories = item_categories[['item_id','item_category_id']]

item_categories.head()

	item_id	item_category_id
0	0	40
1	1	76
2	2	40
3	3	40
4	4	40

sales_train_merge_cat = pd.merge(sales_train,item_categories, on = 'item_id', how = 'left')
sales_train_merge_cat.head()

	date	shop_id	item_id	item_price	item_cnt_day	item_category_id
0	02.01.2013	59	22154	999.000000	1.0	37
1	03.01.2013	25	2552	899.000000	1.0	58
2	05.01.2013	25	2552	899.000000	-1.0	58
3	06.01.2013	25	2554	1709.050049	1.0	58
4	15.01.2013	25	2555	1099.000000	1.0	56

2.1.4 銷量和價格的離羣值

從sales_train種找到銷量和價格的離羣值，而後刪掉。spa

plt.figure(figsize=(10,4))
plt.xlim(-100,3000)
sns.boxplot(x = sales_train['item_cnt_day'])
print('Sale volume outliers:',sales_train['item_cnt_day'][sales_train['item_cnt_day']>1001].unique())
plt.figure(figsize=(10,4))
plt.xlim(-10000,320000)
sns.boxplot(x = sales_train['item_price'])
print('Sale price outliers:',sales_train['item_price'][sales_train['item_price']>300000].unique())

Sale volume outliers: [2169.]
Sale price outliers: [307980.]

sales_train = sales_train[sales_train['item_cnt_day'] <1001]
sales_train = sales_train[sales_train['item_price'] < 300000]
plt.figure(figsize=(10,4))
plt.xlim(-100,3000)
sns.boxplot(x = sales_train['item_cnt_day'])

plt.figure(figsize=(10,4))
plt.xlim(-10000,320000)
sns.boxplot(x = sales_train['item_price'])

<matplotlib.axes._subplots.AxesSubplot at 0x1e080864a20>

有一個商品的價格是負值，將其填充爲median。

sales_train[sales_train['item_price']<0]

	date	date_block_num	shop_id	item_id	item_price	item_cnt_day
484683	15.05.2013	4	32	2973	-1.0	1.0

median = sales_train[(sales_train['date_block_num'] == 4) & (sales_train['shop_id'] == 32)\
                     & (sales_train['item_id'] == 2973) & (sales_train['item_price']>0)].item_price.median()
sales_train.loc[sales_train['item_price']<0,'item_price'] = median
print(median)

1874.0

2.2 測試集分析

測試集有5100 種商品，42個商店。恰好就是5100 * 42 = 214200種 商品-商店組合。能夠分爲三大類

363種商品在訓練集沒有出現，363*42=15,246種商品-商店沒有數據，約佔7%。
87550種商品-商店組合是隻出現過商品，沒出現過組合。約佔42%。
111404種商品-商店組合是在訓練集中完整出現過的。約佔51%。

test = pd.read_csv('../readonly/final_project_data/test.csv.gz')
good_sales = test.merge(sales_train, on=['item_id','shop_id'], how='left').dropna()
good_pairs = test[test['ID'].isin(good_sales['ID'])]
no_data_items = test[~(test['item_id'].isin(sales_train['item_id']))]

print('1. Number of good pairs:', len(good_pairs))
print('2. No Data Items:', len(no_data_items))
print('3. Only Item_id Info:', len(test)-len(no_data_items)-len(good_pairs))

1. Number of good pairs: 111404
2. No Data Items: 15246
3. Only Item_id Info: 87550

no_data_items.head()

	ID	shop_id	item_id
1	1	5	5320
4	4	5	5268
45	45	5	5826
64	64	5	3538
65	65	5	3571

2.3 商店特徵

2.3.1 商店信息清洗

商店名裏已經包含了不少特徵，能夠按如下結構分解。

城市 | 類型 | 名稱

shops = pd.read_csv('../readonly/final_project_data/shops.csv')
shops.head()

	shop_name	shop_id
0	!Якутск Орджоникидзе, 56 фран	0
1	!Якутск ТЦ "Центральный" фран	1
2	Адыгея ТЦ "Мега"	2
3	Балашиха ТРК "Октябрь-Киномир"	3
4	Волжский ТЦ "Волга Молл"	4

通過分析，發現如下商店名爲同一個商店，能夠合併shop_id.
* 11 => 10
* 1 => 58
* 0 => 57
* 40 => 39

查看測試集，發現 shop id [0,1,11,40] 都不存在。

shop_id = 0, 1 僅僅存在了兩個月，而 shop_id = 57,58 看起來就像是繼任者。
shop_id = 11 僅僅存在於 date_block = 25,而 shop_id = 10 只在那個月沒有數據。
shop_id = 40 僅僅存在於 date_block = [14,25] 而 shop_id = 39 在 date_block = 14 以後一直存在。
shop_id = 46,商店名中間多了一個空格，會影響到編碼，要去掉。 Сергиев Посад ТЦ 「7Я」
經過商店命名，我發現shop 12 and 55都是網店，而且發現他們的銷量的相關度很高，只是不知道怎麼用這個信息。

sales12 = np.array(sales_by_shop_id.loc[sales_by_shop_id['shop_id'] == 12 ].values)
sales12 = sales12[:,1:].reshape(-1)
sales55 = np.array(sales_by_shop_id.loc[sales_by_shop_id['shop_id'] == 55 ].values)
sales55 = sales55[:,1:].reshape(-1)
months = np.array(sales_by_shop_id.loc[sales_by_shop_id['shop_id'] == 12 ].columns[1:])
np.corrcoef(sales12,sales55)

array([[1.        , 0.69647514],
       [0.69647514, 1.        ]])

test.shop_id.sort_values().unique()

array([ 2,  3,  4,  5,  6,  7, 10, 12, 14, 15, 16, 18, 19, 21, 22, 24, 25,
       26, 28, 31, 34, 35, 36, 37, 38, 39, 41, 42, 44, 45, 46, 47, 48, 49,
       50, 52, 53, 55, 56, 57, 58, 59], dtype=int64)

sales_train.loc[sales_train['shop_id'] == 0,'shop_id'] = 57
sales_train.loc[sales_train['shop_id'] == 1,'shop_id'] = 58
sales_train.loc[sales_train['shop_id'] == 11,'shop_id'] = 10
sales_train.loc[sales_train['shop_id'] == 40,'shop_id'] = 39

2.3.2 商店信息編碼

shops['shop_name'] = shops['shop_name'].apply(lambda x: x.lower()).str.replace('[^\w\s]', '').str.replace('\d+','').str.strip()
shops['shop_city'] = shops['shop_name'].str.partition(' ')[0]
shops['shop_type'] = shops['shop_name'].apply(lambda x: 'мтрц' if 'мтрц' in x else 'трц' if 'трц' in x else 'трк' if 'трк' in x else 'тц' if 'тц' in x else 'тк' if 'тк' in x else 'NO_DATA')
shops.head()

	shop_name	shop_id	shop_city	shop_type
0	якутск орджоникидзе фран	0	якутск	NO_DATA
1	якутск тц центральный фран	1	якутск	тц
2	адыгея тц мега	2	адыгея	тц
3	балашиха трк октябрькиномир	3	балашиха	трк
4	волжский тц волга молл	4	волжский	тц

shops['shop_city_code'] = LabelEncoder().fit_transform(shops['shop_city'])
shops['shop_type_code'] = LabelEncoder().fit_transform(shops['shop_type'])
shops.head()

	shop_name	shop_id	shop_city	shop_type	shop_city_code	shop_type_code
0	якутск орджоникидзе фран	0	якутск	NO_DATA	29	0
1	якутск тц центральный фран	1	якутск	тц	29	5
2	адыгея тц мега	2	адыгея	тц	0	5
3	балашиха трк октябрькиномир	3	балашиха	трк	1	3
4	волжский тц волга молл	4	волжский	тц	2	5

2.4 商品分類特徵

商品類別之間的距離很差肯定，使用one hot編碼更加合適。

categories = pd.read_csv('../readonly/final_project_data/item_categories.csv')

lines1 = [26,27,28,29,30,31]
lines2 = [81,82]
for index in lines1:
    category_name = categories.loc[index,'item_category_name']
# print(category_name)
    category_name = category_name.replace('Игры','Игры -')
# print(category_name)
    categories.loc[index,'item_category_name'] = category_name
for index in lines2:
    category_name = categories.loc[index,'item_category_name']
# print(category_name)
    category_name = category_name.replace('Чистые','Чистые -')
# print(category_name)
    categories.loc[index,'item_category_name'] = category_name
category_name = categories.loc[32,'item_category_name']
#print(category_name)
category_name = category_name.replace('Карты оплаты','Карты оплаты -')
#print(category_name)
categories.loc[32,'item_category_name'] = category_name

categories.head()

	item_category_name	item_category_id
0	PC - Гарнитуры/Наушники	0
1	Аксессуары - PS2	1
2	Аксессуары - PS3	2
3	Аксессуары - PS4	3
4	Аксессуары - PSP	4

categories['split'] = categories['item_category_name'].str.split('-')
categories['type'] = categories['split'].map(lambda x:x[0].strip())
categories['subtype'] = categories['split'].map(lambda x:x[1].strip() if len(x)>1 else x[0].strip())
categories = categories[['item_category_id','type','subtype']]
categories.head()

	item_category_id	type	subtype
0	0	PC	Гарнитуры/Наушники
1	1	Аксессуары	PS2
2	2	Аксессуары	PS3
3	3	Аксессуары	PS4
4	4	Аксессуары	PSP

categories['cat_type_code'] = LabelEncoder().fit_transform(categories['type'])
categories['cat_subtype_code'] = LabelEncoder().fit_transform(categories['subtype'])
categories.head()

	item_category_id	type	subtype	cat_type_code	cat_subtype_code
0	0	PC	Гарнитуры/Наушники	0	33
1	1	Аксессуары	PS2	1	13
2	2	Аксессуары	PS3	1	14
3	3	Аксессуары	PS4	1	15
4	4	Аксессуары	PSP	1	17

3 特徵融合

3.1 統計月銷量

首先將訓練集中的數據統計好月銷量

ts = time.time()
matrix = []
cols = ['date_block_num','shop_id','item_id']
for i in range(34):
    sales = sales_train[sales_train.date_block_num==i]
    matrix.append(np.array(list(product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16'))
    
matrix = pd.DataFrame(np.vstack(matrix), columns=cols)
matrix['date_block_num'] = matrix['date_block_num'].astype(np.int8)
matrix['shop_id'] = matrix['shop_id'].astype(np.int8)
matrix['item_id'] = matrix['item_id'].astype(np.int16)
matrix.sort_values(cols,inplace=True)
time.time() - ts

sales_train['revenue'] = sales_train['item_price'] *  sales_train['item_cnt_day']

groupby = sales_train.groupby(['item_id','shop_id','date_block_num']).agg({'item_cnt_day':'sum'})
groupby.columns = ['item_cnt_month']
groupby.reset_index(inplace=True)
matrix = matrix.merge(groupby, on = ['item_id','shop_id','date_block_num'], how = 'left')
matrix['item_cnt_month'] = matrix['item_cnt_month'].fillna(0).clip(0,20).astype(np.float16)
matrix.head()

test['date_block_num'] = 34
test['date_block_num'] = test['date_block_num'].astype(np.int8)
test['shop_id'] = test['shop_id'].astype(np.int8)
test['item_id'] = test['item_id'].astype(np.int16)
test.shape

cols = ['date_block_num','shop_id','item_id']
matrix = pd.concat([matrix, test[['item_id','shop_id','date_block_num']]], ignore_index=True, sort=False, keys=cols)
matrix.fillna(0, inplace=True) # 34 month
print(matrix.head())

date_block_num  shop_id  item_id  item_cnt_month
0               0        2       19             0.0
1               0        2       27             1.0
2               0        2       28             0.0
3               0        2       29             0.0
4               0        2       32             0.0

這裏要確保矩陣裏面沒有 NA和NULL

print(matrix['item_cnt_month'].isna().sum())
print(matrix['item_cnt_month'].isnull().sum())

0
0

3.2 相關信息融合

將上面獲得的商店，商品類別等信息與矩陣融合起來。

ts = time.time()
matrix = matrix.merge(items[['item_id','item_category_id']], on = ['item_id'], how = 'left')
matrix = matrix.merge(categories[['item_category_id','cat_type_code','cat_subtype_code']], on = ['item_category_id'], how = 'left')
matrix = matrix.merge(shops[['shop_id','shop_city_code','shop_type_code']], on = ['shop_id'], how = 'left')
matrix['shop_city_code'] = matrix['shop_city_code'].astype(np.int8)
matrix['shop_type_code'] = matrix['shop_type_code'].astype(np.int8)
matrix['item_category_id'] = matrix['item_category_id'].astype(np.int8)
matrix['cat_type_code'] = matrix['cat_type_code'].astype(np.int8)
matrix['cat_subtype_code'] = matrix['cat_subtype_code'].astype(np.int8)
time.time() - ts

4.71001935005188

matrix.head()

	shop_id	item_id	item_cnt_month	item_category_id	cat_type_code	cat_subtype_code	shop_type_code
0	2	19	0.0	40	7	6	5
1	2	27	1.0	19	5	14	5
2	2	28	0.0	30	5	12	5
3	2	29	0.0	23	5	20	5
4	2	32	0.0	40	7	6	5

matrix.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11056140 entries, 0 to 11056139
Data columns (total 9 columns):
date_block_num      int8
shop_id             int8
item_id             int16
item_cnt_month      float16
item_category_id    int8
cat_type_code       int8
cat_subtype_code    int8
shop_city_code      int8
shop_type_code      int8
dtypes: float16(1), int16(1), int8(7)
memory usage: 200.3 MB

3.2 歷史信息

將產生的信息作了融合。須要經過延遲操做來產生一些歷史信息。好比能夠將第0-33個月的銷量做爲第1-34個月的歷史特徵（延遲一個月）。按照如下說明，一共產生了15種特徵。

每一個商品-商店組合每月銷量的歷史信息，分別延遲[1,2,3,6,12]個月。這應該是最符合直覺的一種操做。
全部商品-商店組合每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。
每件商品每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。
每一個商店每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。
每一個商品類別每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。
每一個商品類別-商店每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。

以上六種延遲都比較直觀，直接針對商品，商店，商品類別。可是銷量的變化趨勢還可能與商品類別_大類，商店_城市，商品價格，每月的天數有關，還須要作如下統計和延遲。能夠根據模型輸出的feature importance來選擇和調整這些特徵。

每一個商品類別_大類每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。
每一個商店_城市每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。
每一個商品-商店_城市組合每月銷量均值的歷史信息，分別延遲[1,2,3,6,12]個月。

除了以上組合以外，還有如下特徵可能有用

每一個商品第一次的銷量
每一個商品最後一次的銷量
每一個商品_商店組合第一次的銷量
每一個商品_商店組合最後一次的銷量
每一個商品的價格變化
每月的天數

3.2.1 lag operation產生延遲信息，能夠選擇延遲的月數。

def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

3.2.2 月銷量（每一個商品-商店）的歷史信息

針對每月的商品-商店組合的銷量求一個歷史信息，分別是1個月、2個月、3個月、6個月、12個月前的銷量。這個值和咱們要預測的值是同一個數量級，不須要求平均。並且會有不少值會是NAN，由於不存在這樣的歷史信息，具體緣由前面已經分析過。

ts = time.time()
matrix = lag_feature(matrix, [1,2,3,6,12], 'item_cnt_month')
time.time() - ts

27.62108063697815

3.2.3 月銷量（全部商品-商店）均值的歷史信息

統計每月的銷量，這裏的銷量是包括了該月的全部商品-商店組合，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num'], how='left')
matrix['date_avg_item_cnt'] = matrix['date_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_avg_item_cnt')
matrix.drop(['date_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

33.013164043426514

3.2.4 月銷量（每件商品）均值和歷史特徵

統計每件商品在每月的銷量，這裏的銷量是包括了該月該商品在全部商店的銷量，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_cnt'] = matrix['date_item_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_avg_item_cnt')
matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

3.2.5 月銷量（每一個商店）均值和歷史特徵

統計每一個商店在每月的銷量，這裏的銷量是包括了該月該商店的全部商品的銷量，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_shop_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['date_shop_avg_item_cnt'] = matrix['date_shop_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_shop_avg_item_cnt')
matrix.drop(['date_shop_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

3.2.6 月銷量（每一個商品類別）均值和歷史特徵

統計每一個商品類別在每月的銷量，這裏的銷量是包括了該月該商品類別的全部銷量，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num', 'item_category_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_cat_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_category_id'], how='left')
matrix['date_cat_avg_item_cnt'] = matrix['date_cat_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_cat_avg_item_cnt')
matrix.drop(['date_cat_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

3.2.7 月銷量（商品類別-商店）均值和歷史特徵

統計每一個商品類別-商店在每月的銷量，這裏的銷量是包括了該月該商品商店_城市的全部銷量，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num', 'item_category_id','shop_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_cat_shop_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'item_category_id','shop_id'], how='left')
matrix['date_cat_shop_avg_item_cnt'] = matrix['date_cat_shop_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_cat_shop_avg_item_cnt')
matrix.drop(['date_cat_shop_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

15.178605556488037

matrix.info()

3.2.8 月銷量（商品類別_大類）均值和歷史特徵

統計每一個商品類別_大類在每月的銷量，這裏的銷量是包括了該月該商品類別_大類的全部銷量，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num', 'cat_type_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_type_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','cat_type_code'], how='left')
matrix['date_type_avg_item_cnt'] = matrix['date_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_type_avg_item_cnt')
matrix.drop(['date_type_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

14.34829592704773

3.2.9 月銷量（商品-商品類別_大類）均值和歷史特徵

ts = time.time()
group = matrix.groupby(['date_block_num', 'item_id','cat_type_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_type_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id','cat_type_code'], how='left')
matrix['date_item_type_avg_item_cnt'] = matrix['date_item_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_type_avg_item_cnt')
matrix.drop(['date_item_type_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

14.34829592704773

3.2.10 月銷量（商店_城市）均值和歷史特徵

統計每一個商店_城市在每月的銷量，這裏的銷量是包括了該月該商店_城市的全部銷量，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_city_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_city_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num',  'shop_city_code'], how='left')
matrix['date_city_avg_item_cnt'] = matrix['date_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_city_avg_item_cnt')
matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

14.687093496322632

3.2.11 月銷量（商品-商店_城市）均值和歷史特徵

統計每一個商品-商店_城市在每月的銷量，這裏的銷量是包括了該月該商品-商店_城市的全部銷量，因此須要求平均。同求歷史信息。

ts = time.time()
group = matrix.groupby(['date_block_num','item_id', 'shop_city_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_item_city_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'shop_city_code'], how='left')
matrix['date_item_city_avg_item_cnt'] = matrix['date_item_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_city_avg_item_cnt')
matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

14.687093496322632

3.2.12 趨勢特徵，半年來價格的變化

ts = time.time()
group = sales_train.groupby(['item_id']).agg({'item_price': ['mean']})
group.columns = ['item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['item_id'], how='left')
matrix['item_avg_item_price'] = matrix['item_avg_item_price'].astype(np.float16)

group = sales_train.groupby(['date_block_num','item_id']).agg({'item_price': ['mean']})
group.columns = ['date_item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_price'] = matrix['date_item_avg_item_price'].astype(np.float16)

lags = [1,2,3,4,5,6,12]
matrix = lag_feature(matrix, lags, 'date_item_avg_item_price')

for i in lags:
    matrix['delta_price_lag_'+str(i)] = \
        (matrix['date_item_avg_item_price_lag_'+str(i)] - matrix['item_avg_item_price']) / matrix['item_avg_item_price']

def select_trend(row):
    for i in lags:
        if row['delta_price_lag_'+str(i)]:
            return row['delta_price_lag_'+str(i)]
    return 0
    
matrix['delta_price_lag'] = matrix.apply(select_trend, axis=1)
matrix['delta_price_lag'] = matrix['delta_price_lag'].astype(np.float16)
matrix['delta_price_lag'].fillna(0, inplace=True)

# https://stackoverflow.com/questions/31828240/first-non-null-value-per-row-from-a-list-of-pandas-columns/31828559
# matrix['price_trend'] = matrix[['delta_price_lag_1','delta_price_lag_2','delta_price_lag_3']].bfill(axis=1).iloc[:, 0]
# Invalid dtype for backfill_2d [float16]

fetures_to_drop = ['item_avg_item_price', 'date_item_avg_item_price']
for i in lags:
    fetures_to_drop += ['date_item_avg_item_price_lag_'+str(i)]
    fetures_to_drop += ['delta_price_lag_'+str(i)]

matrix.drop(fetures_to_drop, axis=1, inplace=True)

time.time() - ts

601.2605240345001

3.2.13 每月天數¶

matrix['month'] = matrix['date_block_num'] % 12
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
matrix['days'] = matrix['month'].map(days).astype(np.int8)

3.2.14 開始和結束的銷量

Months since the last sale for each shop/item pair and for item only. I use programing approach.

Create HashTable with key equals to {shop_id,item_id} and value equals to date_block_num. Iterate data from the top. Foreach row if {row.shop_id,row.item_id} is not present in the table, then add it to the table and set its value to row.date_block_num. if HashTable contains key, then calculate the difference beteween cached value and row.date_block_num.

ts = time.time()
cache = {}
matrix['item_shop_last_sale'] = -1
matrix['item_shop_last_sale'] = matrix['item_shop_last_sale'].astype(np.int8)
for idx, row in matrix.iterrows():    
    key = str(row.item_id)+' '+str(row.shop_id)
    if key not in cache:
        if row.item_cnt_month!=0:
            cache[key] = row.date_block_num
    else:
        last_date_block_num = cache[key]
        matrix.at[idx, 'item_shop_last_sale'] = row.date_block_num - last_date_block_num
        cache[key] = row.date_block_num         
time.time() - ts

Months since the first sale for each shop/item pair and for item only.

ts = time.time()
matrix['item_shop_first_sale'] = matrix['date_block_num'] - matrix.groupby(['item_id','shop_id'])['date_block_num'].transform('min')
matrix['item_first_sale'] = matrix['date_block_num'] - matrix.groupby('item_id')['date_block_num'].transform('min')
time.time() - ts

2.4333603382110596

由於使用了12個月做爲延遲特徵，必然由大量的數據是NA值，將最開始11個月的原始特徵刪除，而且對於NA值咱們須要把它填充爲0。

ts = time.time()
matrix = matrix[matrix.date_block_num > 11]
time.time() - ts

1.0133898258209229

ts = time.time()
def fill_na(df):
    for col in df.columns:
        if ('_lag_' in col) & (df[col].isnull().any()):
            if ('item_cnt' in col):
                df[col].fillna(0, inplace=True)         
    return df

matrix = fill_na(matrix)
time.time() - ts

4. 數據建模

4.1 lightgbm 模型

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import gc
import pickle
from itertools import product
from sklearn.preprocessing import LabelEncoder

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)

#import sklearn.model_selection.KFold as KFold

def plot_features(booster, figsize):    
    fig, ax = plt.subplots(1,1,figsize=figsize)
    return plot_importance(booster=booster, ax=ax)

data = pd.read_pickle('data_simple.pkl')

X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)

del data
gc.collect();

import lightgbm as lgb

ts = time.time()
train_data = lgb.Dataset(data=X_train, label=Y_train)
valid_data = lgb.Dataset(data=X_valid, label=Y_valid)

time.time() - ts
    
params = {"objective" : "regression", "metric" : "rmse", 'n_estimators':10000, 'early_stopping_rounds':50,
              "num_leaves" : 200, "learning_rate" : 0.01, "bagging_fraction" : 0.9,
              "feature_fraction" : 0.3, "bagging_seed" : 0}
    
lgb_model = lgb.train(params, train_data, valid_sets=[train_data, valid_data], verbose_eval=1000) 
Y_test = lgb_model.predict(X_test).clip(0, 20)

I want to change some values to be zeros.

4.2. 後處理

4.2.1 過時商品

將最近12個月都沒有銷售的商品的銷量置零。

Li Kang

發佈了28 篇原創文章 · 獲贊 42 · 訪問量 6萬+

私信關注

	shop_id	item_id	item_cnt_month	item_category_id	cat_type_code	cat_subtype_code	shop_type_code
0	2	19	0.0	40	7	6	5
1	2	27	1.0	19	5	14	5
2	2	28	0.0	30	5	12	5
3	2	29	0.0	23	5	20	5
4	2	32	0.0	40	7	6	5

	shop_id	item_id	item_cnt_month	item_category_id	cat_type_code	cat_subtype_code	shop_type_code
0	2	19	0.0	40	7	6	5
1	2	27	1.0	19	5	14	5
2	2	28	0.0	30	5	12	5
3	2	29	0.0	23	5	20	5
4	2	32	0.0	40	7	6	5

Predict Future Sales 預測將來銷量, Kaggle 比賽，LB 0.89896 排名6%