Predict Future Sales
- 數據分析
- 1. 數據基本處理
- 2. 數據探索
- 3 特徵融合
- 3.1 統計月銷量
- 3.2 相關信息融合
- 3.2 歷史信息
- 3.2.1 lag operation產生延遲信息,能夠選擇延遲的月數。
- 3.2.2 月銷量(每一個商品-商店)的歷史信息
- 3.2.3 月銷量(全部商品-商店)均值的歷史信息
- 3.2.4 月銷量(每件商品)均值和歷史特徵
- 3.2.5 月銷量(每一個商店)均值和歷史特徵
- 3.2.6 月銷量(每一個商品類別)均值和歷史特徵
- 3.2.7 月銷量(商品類別-商店)均值和歷史特徵
- 3.2.8 月銷量(商品類別_大類)均值和歷史特徵
- 3.2.9 月銷量(商品-商品類別_大類)均值和歷史特徵
- 3.2.10 月銷量(商店_城市)均值和歷史特徵
- 3.2.11 月銷量(商品-商店_城市)均值和歷史特徵
- 3.2.12 趨勢特徵,半年來價格的變化
- 3.2.13 每月天數¶
- 3.2.14 開始和結束的銷量
- 4. 數據建模
預測將來銷售該項目來源於kaggle中的一場比賽的賽題,數據是由平常銷售數據組成的時間序列數據集,該數據集由俄羅斯最大的軟件公司之一 - 1C公司提供。提供了包括商店,商品,價格,日銷量等連續34個月內的數據,要求預測第35個月的各商店各商品的銷量。評價指標爲RMSE,Baseline是1.1677,個人成績是0.89896,目前排名178/3200。
文件名 | 文件說明 |
---|---|
sales_train.csv | 訓練集(date_block=0 到 33 的每日曆史數據,包括各商品在各商店的銷量,價格) |
test.csv | 測試集(date_block=34 的商店和產品信息) |
items.csv | 商品的詳細信息(item_name、item_id、item_category_id) |
item_categories.csv | 商品類別的詳細信息(item_category_name、item_category_id) |
shops.csv | 商店的詳細信息(shop_name、shop_id) |
數據分析
1. 數據基本處理
1.1 讀入數據集
- 訓練集有六列,分別介紹日期,月份,商店,商品,價格和日銷量
- 測試集有三列,分別是ID,商店,和商品。
sales_train = pd.read_csv('input/sales_train.csv.gz') test = pd.read_csv('input/test.csv.gz')
sales_train.head()
date | date_block_num | shop_id | item_id | item_price | item_cnt_day | |
---|---|---|---|---|---|---|
0 | 02.01.2013 | 0 | 59 | 22154 | 999.00 | 1.0 |
1 | 03.01.2013 | 0 | 25 | 2552 | 899.00 | 1.0 |
2 | 05.01.2013 | 0 | 25 | 2552 | 899.00 | -1.0 |
3 | 06.01.2013 | 0 | 25 | 2554 | 1709.05 | 1.0 |
4 | 15.01.2013 | 0 | 25 | 2555 | 1099.00 | 1.0 |
test.head()
ID | shop_id | item_id | |
---|---|---|---|
0 | 0 | 5 | 5037 |
1 | 1 | 5 | 5320 |
2 | 2 | 5 | 5233 |
3 | 3 | 5 | 5232 |
4 | 4 | 5 | 5268 |
- 訓練集,有21807種商品,60個商店。一共2935849 種商品-商店組合。
- 測試集,有5100 種商品,42個商店。恰好就是5100 * 42 = 214200種 商品-商店組合。
print('how many lines in train set:', sales_train.shape) print('unique items in train set:', sales_train['item_id'].nunique()) print('unique shops in train set:', sales_train['shop_id'].nunique()) print('how many lines in test set:', test.shape) print('unique items in test set:', test['item_id'].nunique()) print('unique shops in test set:', test['shop_id'].nunique())
how many lines in train set: (2935849, 6) unique items in train set: 21807 unique shops in train set: 60 how many lines in test set: (214200, 3) unique items in test set: 5100 unique shops in test set: 42
查看數據的基本信息以及是否有數據缺失NAN
數據形狀爲(2935849, 6),sales_train中沒有missing values,沒有nan,
html
print('----------head---------') print(sales_train.head(5)) print('------information------') print(sales_train.info()) print('-----missing value-----') print(sales_train.isnull().sum()) print('--------nan value------') print(sales_train.isna().sum())
----------head--------- date date_block_num shop_id item_id item_price item_cnt_day 0 02.01.2013 0 59 22154 999.00 1.0 1 03.01.2013 0 25 2552 899.00 1.0 2 05.01.2013 0 25 2552 899.00 -1.0 3 06.01.2013 0 25 2554 1709.05 1.0 4 15.01.2013 0 25 2555 1099.00 1.0 ------information------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 2935849 entries, 0 to 2935848 Data columns (total 6 columns): date object date_block_num int64 shop_id int64 item_id int64 item_price float64 item_cnt_day float64 dtypes: float64(2), int64(3), object(1) memory usage: 134.4+ MB None -----missing value----- date 0 date_block_num 0 shop_id 0 item_id 0 item_price 0 item_cnt_day 0 dtype: int64 --------nan value------ date 0 date_block_num 0 shop_id 0 item_id 0 item_price 0 item_cnt_day 0 dtype: int64
1.2 基線模型預測
首先嚐試復現基線模型。本次比賽中,基線模型就是用第34個月的銷售看成第35個月的銷售,即將2015年10月的結果看成2015年11月的預測。評估結果應該是1.16777.python
- 訓練集中的數據是 商品-商店-天天的銷售。而要求預測的是商品-商店-每個月的銷售,所以須要合理使用groupby()和agg()函數。
- 訓練集沒有出現過的 商品-商店組合,一概填零,最終的結果須要限幅在 [0,20]區間。
sales_train_subset = sales_train[sales_train['date_block_num'] == 33] sales_train_subset.head() grouped = sales_train_subset[['shop_id','item_id','item_cnt_day']].groupby(['shop_id','item_id']).agg({'item_cnt_day':'sum'}).reset_index() grouped = grouped.rename(columns={'item_cnt_day' : 'item_cnt_month'}) grouped.head()
shop_id | item_id | item_cnt_month | |
---|---|---|---|
0 | 2 | 31 | 1.0 |
1 | 2 | 486 | 3.0 |
2 | 2 | 787 | 1.0 |
3 | 2 | 794 | 1.0 |
4 | 2 | 968 | 1.0 |
test = pd.read_csv('../readonly/final_project_data/test.csv.gz') test = pd.merge(test,grouped, on = ['shop_id','item_id'], how = 'left') print(test.head()) test['item_cnt_month'] = test['item_cnt_month'].fillna(0).clip(0,20) print(test.head()) test = test[['ID','item_cnt_month']] submission = test.set_index('ID') submission.to_csv('submission_baseline.csv')
ID shop_id item_id item_cnt_month 0 0 5 5037 NaN 1 1 5 5320 NaN 2 2 5 5233 1.0 3 3 5 5232 NaN 4 4 5 5268 NaN ID shop_id item_id item_cnt_month 0 0 5 5037 0.0 1 1 5 5320 0.0 2 2 5 5233 1.0 3 3 5 5232 0.0 4 4 5 5268 0.0
1.3 節省存儲空間
由於後續會作大量的特徵提取,對存儲空間的消耗較大,並且較大的特徵集對於模型訓練來講也是個負擔。在訓練集中能夠發現,不少數據的動態範圍很小,好比date_block_num,shop_id,item_id,用int16存儲就足夠了。而Item_price, item_cnt_day 用float32 存儲也是足夠的。這樣就能夠在不損失信息的前提下,減小通常的存儲空間消耗。結果來看從134.4+ MB,減小到了 61.6+ MB。markdown
def downcast_dtypes(df): cols_float64 = [c for c in df if df[c].dtype == 'float64'] cols_int64_32 = [c for c in df if df[c].dtype in ['int64', 'int32']] df[cols_float64] = df[cols_float64].astype(np.float32) df[cols_int64_32] = df[cols_int64_32].astype(np.int16) return df sales_train = downcast_dtypes(sales_train) test = downcast_dtypes(test) sales_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2935849 entries, 0 to 2935848 Data columns (total 6 columns): date object date_block_num int16 shop_id int16 item_id int16 item_price float32 item_cnt_day float32 dtypes: float32(2), int16(3), object(1) memory usage: 61.6+ MB
2. 數據探索
2.1 訓練集分析 sales_train
2.1.1 每件商品的銷量
咱們使用pivot_table來查看每件商品每月的銷量。pivot_table()和groupby()的用途相似,但更加靈活,能夠對columns作更多處理。app
sales_by_item_id = sales_train.pivot_table(index=['item_id'],values=['item_cnt_day'], columns='date_block_num', aggfunc=np.sum, fill_value=0).reset_index() sales_by_item_id.columns = sales_by_item_id.columns.droplevel().map(str) sales_by_item_id = sales_by_item_id.reset_index(drop=True).rename_axis(None, axis=1) sales_by_item_id.columns.values[0] = 'item_id'
sales_by_item_id.tail()
item_id | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21802 | 22165 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
21803 | 22166 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 23 | 24 | 32 | 25 | 24 | 21 | 13 | 10 | 15 | 12 | 13 | 13 | 12 | 16 | 11 | 7 | 8 | 12 | 4 | 8 | 10 | 8 | 11 | 5 | 11 |
21804 | 22167 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 146 | 96 | 83 | 66 | 57 | 47 | 59 | 41 | 56 | 47 | 47 | 39 | 49 | 49 | 40 | 33 | 46 | 40 | 38 | 31 | 33 | 34 | 29 | 21 | 37 |
21805 | 22168 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
21806 | 22169 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
從上表中能夠看出一共有21807件商品。使用sum()能夠看到全部商品的總銷量根據時間的變化關係。函數
sales_by_item_id.sum()[1:].plot(legend=True, label="Monthly sum")
<matplotlib.axes._subplots.AxesSubplot at 0x1e0806f0fd0>
分析有多少商品在最近的連續六個月內,沒有銷量。這些商品有多少出如今測試集中。測試
- 訓練集一共21807種商品,其中有12391種在最近的六個月沒有銷量。
- 測試集一共5100種商品,其中有164種在訓練中最近六個月沒有銷量,共出現了164 * 42 = 6888次。
- Tips:在最終的預測結果中,咱們能夠將這些商品的銷量大膽地設置爲零。
outdated_items = sales_by_item_id[sales_by_item_id.loc[:,'27':].sum(axis=1)==0] print('Outdated items:', len(outdated_items)) test = pd.read_csv('../readonly/final_project_data/test.csv.gz') print('unique items in test set:', test['item_id'].nunique()) print('Outdated items in test set:', test[test['item_id'].isin(outdated_items['item_id'])]['item_id'].nunique())
Outdated items: 12391 unique items in test set: 5100 Outdated items in test set: 164
在訓練集種有6行,是重複出現的,咱們能夠刪除或者保留,這種數據對結果影響不大。編碼
print("duplicated lines in sales_train is", len(sales_train[sales_train.duplicated()]))
duplicated lines in sales_train is 6
2.1.2 每一個商店的銷量
一共有 60 個商店,坐落在31個城市,城市的信息能夠做爲商店的一個特徵。
這裏先分析下哪些商店是最近纔開的,哪些是已經關閉了的,一樣分析最後六個月的數據。
atom
- shop_id = 36 是新商店
- shop_id = [0 1 8 11 13 17 23 29 30 32 33 40 43 54] 能夠認爲是已經關閉了。
- Tips:新商店,能夠直接用第33個月來預測34個月的銷量,由於它沒有任何歷史數據。而已經關閉的商店,銷量能夠直接置零
sales_by_shop_id = sales_train.pivot_table(index=['shop_id'],values=['item_cnt_day'], columns='date_block_num', aggfunc=np.sum, fill_value=0).reset_index() sales_by_shop_id.columns = sales_by_shop_id.columns.droplevel().map(str) sales_by_shop_id = sales_by_shop_id.reset_index(drop=True).rename_axis(None, axis=1) sales_by_shop_id.columns.values[0] = 'shop_id' for i in range(27,34): print('Not exists in month',i,sales_by_shop_id['shop_id'][sales_by_shop_id.loc[:,'0':str(i)].sum(axis=1)==0].unique()) for i in range(27,34): print('Shop is outdated for month',i,sales_by_shop_id['shop_id'][sales_by_shop_id.loc[:,str(i):].sum(axis=1)==0].unique())
Not exists in month 27 [36] Not exists in month 28 [36] Not exists in month 29 [36] Not exists in month 30 [36] Not exists in month 31 [36] Not exists in month 32 [36] Not exists in month 33 [] Shop is outdated for month 27 [ 0 1 8 11 13 17 23 30 32 40 43] Shop is outdated for month 28 [ 0 1 8 11 13 17 23 30 32 33 40 43 54] Shop is outdated for month 29 [ 0 1 8 11 13 17 23 29 30 32 33 40 43 54] Shop is outdated for month 30 [ 0 1 8 11 13 17 23 29 30 32 33 40 43 54] Shop is outdated for month 31 [ 0 1 8 11 13 17 23 29 30 32 33 40 43 54] Shop is outdated for month 32 [ 0 1 8 11 13 17 23 29 30 32 33 40 43 54] Shop is outdated for month 33 [ 0 1 8 11 13 17 23 27 29 30 32 33 40 43 51 54]
2.1.3 每類商品的銷量
爲了能使用商品的類別,須要先讀取item_categories的信息,而後添加到sales_train裏面es5
item_categories = pd.read_csv('../readonly/final_project_data/items.csv') item_categories = item_categories[['item_id','item_category_id']]
item_categories.head()
item_id | item_category_id | |
---|---|---|
0 | 0 | 40 |
1 | 1 | 76 |
2 | 2 | 40 |
3 | 3 | 40 |
4 | 4 | 40 |
sales_train_merge_cat = pd.merge(sales_train,item_categories, on = 'item_id', how = 'left') sales_train_merge_cat.head()
date | date_block_num | shop_id | item_id | item_price | item_cnt_day | item_category_id | |
---|---|---|---|---|---|---|---|
0 | 02.01.2013 | 0 | 59 | 22154 | 999.000000 | 1.0 | 37 |
1 | 03.01.2013 | 0 | 25 | 2552 | 899.000000 | 1.0 | 58 |
2 | 05.01.2013 | 0 | 25 | 2552 | 899.000000 | -1.0 | 58 |
3 | 06.01.2013 | 0 | 25 | 2554 | 1709.050049 | 1.0 | 58 |
4 | 15.01.2013 | 0 | 25 | 2555 | 1099.000000 | 1.0 | 56 |
2.1.4 銷量和價格的離羣值
從sales_train種找到銷量和價格的離羣值,而後刪掉。spa
plt.figure(figsize=(10,4)) plt.xlim(-100,3000) sns.boxplot(x = sales_train['item_cnt_day']) print('Sale volume outliers:',sales_train['item_cnt_day'][sales_train['item_cnt_day']>1001].unique()) plt.figure(figsize=(10,4)) plt.xlim(-10000,320000) sns.boxplot(x = sales_train['item_price']) print('Sale price outliers:',sales_train['item_price'][sales_train['item_price']>300000].unique())
Sale volume outliers: [2169.] Sale price outliers: [307980.]
sales_train = sales_train[sales_train['item_cnt_day'] <1001] sales_train = sales_train[sales_train['item_price'] < 300000] plt.figure(figsize=(10,4)) plt.xlim(-100,3000) sns.boxplot(x = sales_train['item_cnt_day']) plt.figure(figsize=(10,4)) plt.xlim(-10000,320000) sns.boxplot(x = sales_train['item_price'])
<matplotlib.axes._subplots.AxesSubplot at 0x1e080864a20>
有一個商品的價格是負值,將其填充爲median。
sales_train[sales_train['item_price']<0]
date | date_block_num | shop_id | item_id | item_price | item_cnt_day | |
---|---|---|---|---|---|---|
484683 | 15.05.2013 | 4 | 32 | 2973 | -1.0 | 1.0 |
median = sales_train[(sales_train['date_block_num'] == 4) & (sales_train['shop_id'] == 32)\ & (sales_train['item_id'] == 2973) & (sales_train['item_price']>0)].item_price.median() sales_train.loc[sales_train['item_price']<0,'item_price'] = median print(median)
1874.0
2.2 測試集分析
測試集有5100 種商品,42個商店。恰好就是5100 * 42 = 214200種 商品-商店組合。能夠分爲三大類
- 363種商品在訓練集沒有出現,363*42=15,246種商品-商店沒有數據,約佔7%。
- 87550種商品-商店組合是隻出現過商品,沒出現過組合。約佔42%。
- 111404種商品-商店組合是在訓練集中完整出現過的。約佔51%。
test = pd.read_csv('../readonly/final_project_data/test.csv.gz') good_sales = test.merge(sales_train, on=['item_id','shop_id'], how='left').dropna() good_pairs = test[test['ID'].isin(good_sales['ID'])] no_data_items = test[~(test['item_id'].isin(sales_train['item_id']))] print('1. Number of good pairs:', len(good_pairs)) print('2. No Data Items:', len(no_data_items)) print('3. Only Item_id Info:', len(test)-len(no_data_items)-len(good_pairs))
1. Number of good pairs: 111404 2. No Data Items: 15246 3. Only Item_id Info: 87550
no_data_items.head()
ID | shop_id | item_id | |
---|---|---|---|
1 | 1 | 5 | 5320 |
4 | 4 | 5 | 5268 |
45 | 45 | 5 | 5826 |
64 | 64 | 5 | 3538 |
65 | 65 | 5 | 3571 |
2.3 商店特徵
2.3.1 商店信息清洗
商店名裏已經包含了不少特徵,能夠按如下結構分解。
城市 | 類型 | 名稱
shops = pd.read_csv('../readonly/final_project_data/shops.csv') shops.head()
shop_name | shop_id | |
---|---|---|
0 | !Якутск Орджоникидзе, 56 фран | 0 |
1 | !Якутск ТЦ "Центральный" фран | 1 |
2 | Адыгея ТЦ "Мега" | 2 |
3 | Балашиха ТРК "Октябрь-Киномир" | 3 |
4 | Волжский ТЦ "Волга Молл" | 4 |
通過分析,發現如下商店名爲同一個商店,能夠合併shop_id.
* 11 => 10
* 1 => 58
* 0 => 57
* 40 => 39
查看測試集,發現 shop id [0,1,11,40] 都不存在。
- shop_id = 0, 1 僅僅存在了兩個月,而 shop_id = 57,58 看起來就像是繼任者。
- shop_id = 11 僅僅存在於 date_block = 25,而 shop_id = 10 只在那個月沒有數據。
- shop_id = 40 僅僅存在於 date_block = [14,25] 而 shop_id = 39 在 date_block = 14 以後一直存在。
- shop_id = 46,商店名中間多了一個空格,會影響到編碼,要去掉。 Сергиев Посад ТЦ 「7Я」
- 經過商店命名,我發現shop 12 and 55都是網店,而且發現他們的銷量的相關度很高,只是不知道怎麼用這個信息。
sales12 = np.array(sales_by_shop_id.loc[sales_by_shop_id['shop_id'] == 12 ].values) sales12 = sales12[:,1:].reshape(-1) sales55 = np.array(sales_by_shop_id.loc[sales_by_shop_id['shop_id'] == 55 ].values) sales55 = sales55[:,1:].reshape(-1) months = np.array(sales_by_shop_id.loc[sales_by_shop_id['shop_id'] == 12 ].columns[1:]) np.corrcoef(sales12,sales55)
array([[1. , 0.69647514], [0.69647514, 1. ]])
test.shop_id.sort_values().unique()
array([ 2, 3, 4, 5, 6, 7, 10, 12, 14, 15, 16, 18, 19, 21, 22, 24, 25, 26, 28, 31, 34, 35, 36, 37, 38, 39, 41, 42, 44, 45, 46, 47, 48, 49, 50, 52, 53, 55, 56, 57, 58, 59], dtype=int64)
sales_train.loc[sales_train['shop_id'] == 0,'shop_id'] = 57 sales_train.loc[sales_train['shop_id'] == 1,'shop_id'] = 58 sales_train.loc[sales_train['shop_id'] == 11,'shop_id'] = 10 sales_train.loc[sales_train['shop_id'] == 40,'shop_id'] = 39
2.3.2 商店信息編碼
shops['shop_name'] = shops['shop_name'].apply(lambda x: x.lower()).str.replace('[^\w\s]', '').str.replace('\d+','').str.strip() shops['shop_city'] = shops['shop_name'].str.partition(' ')[0] shops['shop_type'] = shops['shop_name'].apply(lambda x: 'мтрц' if 'мтрц' in x else 'трц' if 'трц' in x else 'трк' if 'трк' in x else 'тц' if 'тц' in x else 'тк' if 'тк' in x else 'NO_DATA') shops.head()
shop_name | shop_id | shop_city | shop_type | |
---|---|---|---|---|
0 | якутск орджоникидзе фран | 0 | якутск | NO_DATA |
1 | якутск тц центральный фран | 1 | якутск | тц |
2 | адыгея тц мега | 2 | адыгея | тц |
3 | балашиха трк октябрькиномир | 3 | балашиха | трк |
4 | волжский тц волга молл | 4 | волжский | тц |
shops['shop_city_code'] = LabelEncoder().fit_transform(shops['shop_city']) shops['shop_type_code'] = LabelEncoder().fit_transform(shops['shop_type']) shops.head()
shop_name | shop_id | shop_city | shop_type | shop_city_code | shop_type_code | |
---|---|---|---|---|---|---|
0 | якутск орджоникидзе фран | 0 | якутск | NO_DATA | 29 | 0 |
1 | якутск тц центральный фран | 1 | якутск | тц | 29 | 5 |
2 | адыгея тц мега | 2 | адыгея | тц | 0 | 5 |
3 | балашиха трк октябрькиномир | 3 | балашиха | трк | 1 | 3 |
4 | волжский тц волга молл | 4 | волжский | тц | 2 | 5 |
2.4 商品分類特徵
商品類別之間的距離很差肯定,使用one hot編碼更加合適。
categories = pd.read_csv('../readonly/final_project_data/item_categories.csv')
lines1 = [26,27,28,29,30,31] lines2 = [81,82] for index in lines1: category_name = categories.loc[index,'item_category_name'] # print(category_name) category_name = category_name.replace('Игры','Игры -') # print(category_name) categories.loc[index,'item_category_name'] = category_name for index in lines2: category_name = categories.loc[index,'item_category_name'] # print(category_name) category_name = category_name.replace('Чистые','Чистые -') # print(category_name) categories.loc[index,'item_category_name'] = category_name category_name = categories.loc[32,'item_category_name'] #print(category_name) category_name = category_name.replace('Карты оплаты','Карты оплаты -') #print(category_name) categories.loc[32,'item_category_name'] = category_name
categories.head()
item_category_name | item_category_id | |
---|---|---|
0 | PC - Гарнитуры/Наушники | 0 |
1 | Аксессуары - PS2 | 1 |
2 | Аксессуары - PS3 | 2 |
3 | Аксессуары - PS4 | 3 |
4 | Аксессуары - PSP | 4 |
categories['split'] = categories['item_category_name'].str.split('-') categories['type'] = categories['split'].map(lambda x:x[0].strip()) categories['subtype'] = categories['split'].map(lambda x:x[1].strip() if len(x)>1 else x[0].strip()) categories = categories[['item_category_id','type','subtype']] categories.head()
item_category_id | type | subtype | |
---|---|---|---|
0 | 0 | PC | Гарнитуры/Наушники |
1 | 1 | Аксессуары | PS2 |
2 | 2 | Аксессуары | PS3 |
3 | 3 | Аксессуары | PS4 |
4 | 4 | Аксессуары | PSP |
categories['cat_type_code'] = LabelEncoder().fit_transform(categories['type']) categories['cat_subtype_code'] = LabelEncoder().fit_transform(categories['subtype']) categories.head()
item_category_id | type | subtype | cat_type_code | cat_subtype_code | |
---|---|---|---|---|---|
0 | 0 | PC | Гарнитуры/Наушники | 0 | 33 |
1 | 1 | Аксессуары | PS2 | 1 | 13 |
2 | 2 | Аксессуары | PS3 | 1 | 14 |
3 | 3 | Аксессуары | PS4 | 1 | 15 |
4 | 4 | Аксессуары | PSP | 1 | 17 |
3 特徵融合
3.1 統計月銷量
首先將訓練集中的數據統計好月銷量
ts = time.time() matrix = [] cols = ['date_block_num','shop_id','item_id'] for i in range(34): sales = sales_train[sales_train.date_block_num==i] matrix.append(np.array(list(product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16')) matrix = pd.DataFrame(np.vstack(matrix), columns=cols) matrix['date_block_num'] = matrix['date_block_num'].astype(np.int8) matrix['shop_id'] = matrix['shop_id'].astype(np.int8) matrix['item_id'] = matrix['item_id'].astype(np.int16) matrix.sort_values(cols,inplace=True) time.time() - ts sales_train['revenue'] = sales_train['item_price'] * sales_train['item_cnt_day'] groupby = sales_train.groupby(['item_id','shop_id','date_block_num']).agg({'item_cnt_day':'sum'}) groupby.columns = ['item_cnt_month'] groupby.reset_index(inplace=True) matrix = matrix.merge(groupby, on = ['item_id','shop_id','date_block_num'], how = 'left') matrix['item_cnt_month'] = matrix['item_cnt_month'].fillna(0).clip(0,20).astype(np.float16) matrix.head() test['date_block_num'] = 34 test['date_block_num'] = test['date_block_num'].astype(np.int8) test['shop_id'] = test['shop_id'].astype(np.int8) test['item_id'] = test['item_id'].astype(np.int16) test.shape cols = ['date_block_num','shop_id','item_id'] matrix = pd.concat([matrix, test[['item_id','shop_id','date_block_num']]], ignore_index=True, sort=False, keys=cols) matrix.fillna(0, inplace=True) # 34 month print(matrix.head())
date_block_num shop_id item_id item_cnt_month 0 0 2 19 0.0 1 0 2 27 1.0 2 0 2 28 0.0 3 0 2 29 0.0 4 0 2 32 0.0
這裏要確保矩陣裏面沒有 NA和NULL
print(matrix['item_cnt_month'].isna().sum()) print(matrix['item_cnt_month'].isnull().sum())
0 0
3.2 相關信息融合
將上面獲得的商店,商品類別等信息與矩陣融合起來。
ts = time.time() matrix = matrix.merge(items[['item_id','item_category_id']], on = ['item_id'], how = 'left') matrix = matrix.merge(categories[['item_category_id','cat_type_code','cat_subtype_code']], on = ['item_category_id'], how = 'left') matrix = matrix.merge(shops[['shop_id','shop_city_code','shop_type_code']], on = ['shop_id'], how = 'left') matrix['shop_city_code'] = matrix['shop_city_code'].astype(np.int8) matrix['shop_type_code'] = matrix['shop_type_code'].astype(np.int8) matrix['item_category_id'] = matrix['item_category_id'].astype(np.int8) matrix['cat_type_code'] = matrix['cat_type_code'].astype(np.int8) matrix['cat_subtype_code'] = matrix['cat_subtype_code'].astype(np.int8) time.time() - ts
4.71001935005188
matrix.head()
date_block_num | shop_id | item_id | item_cnt_month | item_category_id | cat_type_code | cat_subtype_code | shop_city_code | shop_type_code | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 19 | 0.0 | 40 | 7 | 6 | 0 | 5 |
1 | 0 | 2 | 27 | 1.0 | 19 | 5 | 14 | 0 | 5 |
2 | 0 | 2 | 28 | 0.0 | 30 | 5 | 12 | 0 | 5 |
3 | 0 | 2 | 29 | 0.0 | 23 | 5 | 20 | 0 | 5 |
4 | 0 | 2 | 32 | 0.0 | 40 | 7 | 6 | 0 | 5 |
matrix.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 11056140 entries, 0 to 11056139 Data columns (total 9 columns): date_block_num int8 shop_id int8 item_id int16 item_cnt_month float16 item_category_id int8 cat_type_code int8 cat_subtype_code int8 shop_city_code int8 shop_type_code int8 dtypes: float16(1), int16(1), int8(7) memory usage: 200.3 MB
3.2 歷史信息
將產生的信息作了融合。須要經過延遲操做來產生一些歷史信息。好比能夠將第0-33個月的銷量做爲第1-34個月的歷史特徵(延遲一個月)。按照如下說明,一共產生了15種特徵。
- 每一個商品-商店組合每月銷量的歷史信息,分別延遲[1,2,3,6,12]個月。這應該是最符合直覺的一種操做。
- 全部商品-商店組合每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
- 每件商品每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
- 每一個商店每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
- 每一個商品類別每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
- 每一個商品類別-商店每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
以上六種延遲都比較直觀,直接針對商品,商店,商品類別。可是銷量的變化趨勢還可能與商品類別_大類,商店_城市,商品價格,每月的天數有關,還須要作如下統計和延遲。能夠根據模型輸出的feature importance來選擇和調整這些特徵。
- 每一個商品類別_大類每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
- 每一個商店_城市每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
- 每一個商品-商店_城市組合每月銷量均值的歷史信息,分別延遲[1,2,3,6,12]個月。
除了以上組合以外,還有如下特徵可能有用
- 每一個商品第一次的銷量
- 每一個商品最後一次的銷量
- 每一個商品_商店組合第一次的銷量
- 每一個商品_商店組合最後一次的銷量
- 每一個商品的價格變化
- 每月的天數
3.2.1 lag operation產生延遲信息,能夠選擇延遲的月數。
def lag_feature(df, lags, col): tmp = df[['date_block_num','shop_id','item_id',col]] for i in lags: shifted = tmp.copy() shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)] shifted['date_block_num'] += i df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left') return df
3.2.2 月銷量(每一個商品-商店)的歷史信息
針對每月的商品-商店組合的銷量求一個歷史信息,分別是1個月、2個月、3個月、6個月、12個月前的銷量。這個值和咱們要預測的值是同一個數量級,不須要求平均。並且會有不少值會是NAN,由於不存在這樣的歷史信息,具體緣由前面已經分析過。
ts = time.time() matrix = lag_feature(matrix, [1,2,3,6,12], 'item_cnt_month') time.time() - ts
27.62108063697815
3.2.3 月銷量(全部商品-商店)均值的歷史信息
統計每月的銷量,這裏的銷量是包括了該月的全部商品-商店組合,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num'], how='left') matrix['date_avg_item_cnt'] = matrix['date_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_avg_item_cnt') matrix.drop(['date_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
33.013164043426514
3.2.4 月銷量(每件商品)均值和歷史特徵
統計每件商品在每月的銷量,這裏的銷量是包括了該月該商品在全部商店的銷量,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_item_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left') matrix['date_item_avg_item_cnt'] = matrix['date_item_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_avg_item_cnt') matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
3.2.5 月銷量(每一個商店)均值和歷史特徵
統計每一個商店在每月的銷量,這裏的銷量是包括了該月該商店的全部商品的銷量,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_shop_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left') matrix['date_shop_avg_item_cnt'] = matrix['date_shop_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_shop_avg_item_cnt') matrix.drop(['date_shop_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
3.2.6 月銷量(每一個商品類別)均值和歷史特徵
統計每一個商品類別在每月的銷量,這裏的銷量是包括了該月該商品類別的全部銷量,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num', 'item_category_id']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_cat_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num','item_category_id'], how='left') matrix['date_cat_avg_item_cnt'] = matrix['date_cat_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_cat_avg_item_cnt') matrix.drop(['date_cat_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
3.2.7 月銷量(商品類別-商店)均值和歷史特徵
統計每一個商品類別-商店在每月的銷量,這裏的銷量是包括了該月該商品商店_城市的全部銷量,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num', 'item_category_id','shop_id']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_cat_shop_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num', 'item_category_id','shop_id'], how='left') matrix['date_cat_shop_avg_item_cnt'] = matrix['date_cat_shop_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_cat_shop_avg_item_cnt') matrix.drop(['date_cat_shop_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
15.178605556488037
matrix.info()
3.2.8 月銷量(商品類別_大類)均值和歷史特徵
統計每一個商品類別_大類在每月的銷量,這裏的銷量是包括了該月該商品類別_大類的全部銷量,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num', 'cat_type_code']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_type_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num','cat_type_code'], how='left') matrix['date_type_avg_item_cnt'] = matrix['date_type_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_type_avg_item_cnt') matrix.drop(['date_type_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
14.34829592704773
3.2.9 月銷量(商品-商品類別_大類)均值和歷史特徵
ts = time.time() group = matrix.groupby(['date_block_num', 'item_id','cat_type_code']).agg({'item_cnt_month': ['mean']}) group.columns = [ 'date_item_type_avg_item_cnt' ] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num','item_id','cat_type_code'], how='left') matrix['date_item_type_avg_item_cnt'] = matrix['date_item_type_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_type_avg_item_cnt') matrix.drop(['date_item_type_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
14.34829592704773
3.2.10 月銷量(商店_城市)均值和歷史特徵
統計每一個商店_城市在每月的銷量,這裏的銷量是包括了該月該商店_城市的全部銷量,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num', 'shop_city_code']).agg({'item_cnt_month': ['mean']}) group.columns = ['date_city_avg_item_cnt'] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_city_code'], how='left') matrix['date_city_avg_item_cnt'] = matrix['date_city_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_city_avg_item_cnt') matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
14.687093496322632
3.2.11 月銷量(商品-商店_城市)均值和歷史特徵
統計每一個商品-商店_城市在每月的銷量,這裏的銷量是包括了該月該商品-商店_城市的全部銷量,因此須要求平均。同求歷史信息。
ts = time.time() group = matrix.groupby(['date_block_num','item_id', 'shop_city_code']).agg({'item_cnt_month': ['mean']}) group.columns = ['date_item_city_avg_item_cnt'] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'shop_city_code'], how='left') matrix['date_item_city_avg_item_cnt'] = matrix['date_item_city_avg_item_cnt'].astype(np.float16) matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_city_avg_item_cnt') matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True) time.time() - ts
14.687093496322632
3.2.12 趨勢特徵,半年來價格的變化
ts = time.time() group = sales_train.groupby(['item_id']).agg({'item_price': ['mean']}) group.columns = ['item_avg_item_price'] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['item_id'], how='left') matrix['item_avg_item_price'] = matrix['item_avg_item_price'].astype(np.float16) group = sales_train.groupby(['date_block_num','item_id']).agg({'item_price': ['mean']}) group.columns = ['date_item_avg_item_price'] group.reset_index(inplace=True) matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left') matrix['date_item_avg_item_price'] = matrix['date_item_avg_item_price'].astype(np.float16) lags = [1,2,3,4,5,6,12] matrix = lag_feature(matrix, lags, 'date_item_avg_item_price') for i in lags: matrix['delta_price_lag_'+str(i)] = \ (matrix['date_item_avg_item_price_lag_'+str(i)] - matrix['item_avg_item_price']) / matrix['item_avg_item_price'] def select_trend(row): for i in lags: if row['delta_price_lag_'+str(i)]: return row['delta_price_lag_'+str(i)] return 0 matrix['delta_price_lag'] = matrix.apply(select_trend, axis=1) matrix['delta_price_lag'] = matrix['delta_price_lag'].astype(np.float16) matrix['delta_price_lag'].fillna(0, inplace=True) # https://stackoverflow.com/questions/31828240/first-non-null-value-per-row-from-a-list-of-pandas-columns/31828559 # matrix['price_trend'] = matrix[['delta_price_lag_1','delta_price_lag_2','delta_price_lag_3']].bfill(axis=1).iloc[:, 0] # Invalid dtype for backfill_2d [float16] fetures_to_drop = ['item_avg_item_price', 'date_item_avg_item_price'] for i in lags: fetures_to_drop += ['date_item_avg_item_price_lag_'+str(i)] fetures_to_drop += ['delta_price_lag_'+str(i)] matrix.drop(fetures_to_drop, axis=1, inplace=True) time.time() - ts
601.2605240345001
3.2.13 每月天數¶
matrix['month'] = matrix['date_block_num'] % 12 days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31]) matrix['days'] = matrix['month'].map(days).astype(np.int8)
3.2.14 開始和結束的銷量
Months since the last sale for each shop/item pair and for item only. I use programing approach.
Create HashTable with key equals to {shop_id,item_id} and value equals to date_block_num. Iterate data from the top. Foreach row if {row.shop_id,row.item_id} is not present in the table, then add it to the table and set its value to row.date_block_num. if HashTable contains key, then calculate the difference beteween cached value and row.date_block_num.
ts = time.time() cache = {} matrix['item_shop_last_sale'] = -1 matrix['item_shop_last_sale'] = matrix['item_shop_last_sale'].astype(np.int8) for idx, row in matrix.iterrows(): key = str(row.item_id)+' '+str(row.shop_id) if key not in cache: if row.item_cnt_month!=0: cache[key] = row.date_block_num else: last_date_block_num = cache[key] matrix.at[idx, 'item_shop_last_sale'] = row.date_block_num - last_date_block_num cache[key] = row.date_block_num time.time() - ts
Months since the first sale for each shop/item pair and for item only.
ts = time.time() matrix['item_shop_first_sale'] = matrix['date_block_num'] - matrix.groupby(['item_id','shop_id'])['date_block_num'].transform('min') matrix['item_first_sale'] = matrix['date_block_num'] - matrix.groupby('item_id')['date_block_num'].transform('min') time.time() - ts
2.4333603382110596
由於使用了12個月做爲延遲特徵,必然由大量的數據是NA值,將最開始11個月的原始特徵刪除,而且對於NA值咱們須要把它填充爲0。
ts = time.time() matrix = matrix[matrix.date_block_num > 11] time.time() - ts
1.0133898258209229
ts = time.time() def fill_na(df): for col in df.columns: if ('_lag_' in col) & (df[col].isnull().any()): if ('item_cnt' in col): df[col].fillna(0, inplace=True) return df matrix = fill_na(matrix) time.time() - ts
4. 數據建模
4.1 lightgbm 模型
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import time import gc import pickle from itertools import product from sklearn.preprocessing import LabelEncoder pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 100) #import sklearn.model_selection.KFold as KFold
def plot_features(booster, figsize): fig, ax = plt.subplots(1,1,figsize=figsize) return plot_importance(booster=booster, ax=ax)
data = pd.read_pickle('data_simple.pkl')
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1) Y_train = data[data.date_block_num < 33]['item_cnt_month'] X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1) Y_valid = data[data.date_block_num == 33]['item_cnt_month'] X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)
del data gc.collect();
import lightgbm as lgb ts = time.time() train_data = lgb.Dataset(data=X_train, label=Y_train) valid_data = lgb.Dataset(data=X_valid, label=Y_valid) time.time() - ts params = {"objective" : "regression", "metric" : "rmse", 'n_estimators':10000, 'early_stopping_rounds':50, "num_leaves" : 200, "learning_rate" : 0.01, "bagging_fraction" : 0.9, "feature_fraction" : 0.3, "bagging_seed" : 0} lgb_model = lgb.train(params, train_data, valid_sets=[train_data, valid_data], verbose_eval=1000) Y_test = lgb_model.predict(X_test).clip(0, 20)
I want to change some values to be zeros.
4.2. 後處理
4.2.1 過時商品
將最近12個月都沒有銷售的商品的銷量置零。