Deep Interest Network(DIN)是阿里媽媽精準定向檢索及基礎算法團隊在2017年6月提出的。其針對電子商務領域(e-commerce industry)的CTR預估,重點在於充分利用/挖掘用戶歷史行爲數據中的信息。java
本系列文章經過解讀論文以及源碼,順便梳理一些深度學習相關概念和TensorFlow的實現。本文是第二篇,將分析如何產生訓練數據,建模用戶序列。python
咱們先總述下 DIN 的行爲:git
可見用戶序列是輸入核心數據,圍繞此數據,又須要用戶,商品,商品屬性等一系列數據。因此 DIN 須要以下數據:github
prepare_data.sh文件進行了數據處理,生成各類數據,其內容以下。算法
export PATH="~/anaconda4/bin:$PATH" wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books.json.gz wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Books.json.gz gunzip reviews_Books.json.gz gunzip meta_Books.json.gz python script/process_data.py meta_Books.json reviews_Books_5.json python script/local_aggretor.py python script/split_by_user.py python script/generate_voc.py
咱們能夠看到這些處理文件的做用以下:shell
論文中用的是Amazon Product Data數據,包含兩個文件:reviews_Electronics_5.json, meta_Electronics.json。json
其中:網絡
具體格式以下:app
reviews_Electronics數據 | |
---|---|
reviewerID | 評論者id,例如[A2SUAM1J3GNN3B] |
asin | 產品的id,例如[0000013714] |
reviewerName | 評論者暱稱 |
helpful | 評論的有用性評級,例如2/3 |
reviewText | 評論文本 |
overall | 產品的評級 |
summary | 評論摘要 |
unixReviewTime | 審覈時間(unix時間) |
reviewTime | 審覈時間(原始) |
meta_Electronics 數據 | |
---|---|
asin | 產品的ID |
title | 產品名稱 |
imUrl | 產品圖片地址 |
categories | 產品所屬的類別列表 |
description | 產品描述 |
此數據集中的用戶行爲很豐富,每一個用戶和商品都有超過5條評論。 特徵包括goods_id,cate_id,用戶評論 goods_id_list 和 cate_id_list。用戶的全部行爲都是(b1,b2,...,bk,... ,bn)。dom
任務是經過利用前 k 個評論商品來預測第(k + 1)個評論的商品。 訓練數據集是用每一個用戶的 k = 1,2,...,n-2 生成的。
經過處理這兩個json文件,咱們能夠生成兩個元數據文件:item-info,reviews-info。
python script/process_data.py meta_Books.json reviews_Books_5.json
具體代碼以下,就是簡單提取:
def process_meta(file): fi = open(file, "r") fo = open("item-info", "w") for line in fi: obj = eval(line) cat = obj["categories"][0][-1] print>>fo, obj["asin"] + "\t" + cat def process_reviews(file): fi = open(file, "r") user_map = {} fo = open("reviews-info", "w") for line in fi: obj = eval(line) userID = obj["reviewerID"] itemID = obj["asin"] rating = obj["overall"] time = obj["unixReviewTime"] print>>fo, userID + "\t" + itemID + "\t" + str(rating) + "\t" + str(time)
生成文件以下。
reviews-info格式爲:userID,itemID,評分,時間戳
A2S166WSCFIFP5 000100039X 5.0 1071100800 A1BM81XB4QHOA3 000100039X 5.0 1390003200 A1MOSTXNIO5MPJ 000100039X 5.0 1317081600 A2XQ5LZHTD4AFT 000100039X 5.0 1033948800 A3V1MKC2BVWY48 000100039X 5.0 1390780800 A12387207U8U24 000100039X 5.0 1206662400
item-info格式爲:產品的id,產品所屬的類別列表,這裏就至關於一個映射表。即 0001048791 這個產品對應 Books這個種類。
0001048791 Books 0001048775 Books 0001048236 Books 0000401048 Books 0001019880 Books 0001048813 Books
經過 manual_join 函數構建了負樣本,具體邏輯以下:
好比:
商品列表是:
item_list = 0000000 = {str} '000100039X' 0000001 = {str} '000100039X' 0000002 = {str} '000100039X' 0000003 = {str} '000100039X' 0000004 = {str} '000100039X' 0000005 = {str} '000100039X'
用戶的行爲序列是:
user_map = {dict: 603668} 'A1BM81XB4QHOA3' = {list: 6} 0 = {tuple: 2} ('A1BM81XB4QHOA3\t000100039X\t5.0\t1390003200', 1390003200.0) 1 = {tuple: 2} ('A1BM81XB4QHOA3\t0060838582\t5.0\t1190851200', 1190851200.0) 2 = {tuple: 2} ('A1BM81XB4QHOA3\t0743241924\t4.0\t1143158400', 1143158400.0) 3 = {tuple: 2} ('A1BM81XB4QHOA3\t0848732391\t2.0\t1300060800', 1300060800.0) 4 = {tuple: 2} ('A1BM81XB4QHOA3\t0884271781\t5.0\t1403308800', 1403308800.0) 5 = {tuple: 2} ('A1BM81XB4QHOA3\t1885535104\t5.0\t1390003200', 1390003200.0) 'A1MOSTXNIO5MPJ' = {list: 9} 0 = {tuple: 2} ('A1MOSTXNIO5MPJ\t000100039X\t5.0\t1317081600', 1317081600.0) 1 = {tuple: 2} ('A1MOSTXNIO5MPJ\t0143142941\t4.0\t1211760000', 1211760000.0) 2 = {tuple: 2} ('A1MOSTXNIO5MPJ\t0310325366\t1.0\t1259712000', 1259712000.0) 3 = {tuple: 2} ('A1MOSTXNIO5MPJ\t0393062112\t5.0\t1179964800', 1179964800.0) 4 = {tuple: 2} ('A1MOSTXNIO5MPJ\t0872203247\t3.0\t1211760000', 1211760000.0) 5 = {tuple: 2} ('A1MOSTXNIO5MPJ\t1455504181\t5.0\t1398297600', 1398297600.0) 6 = {tuple: 2} ('A1MOSTXNIO5MPJ\t1596917024\t5.0\t1369440000', 1369440000.0) 7 = {tuple: 2} ('A1MOSTXNIO5MPJ\t1600610676\t5.0\t1276128000', 1276128000.0) 8 = {tuple: 2} ('A1MOSTXNIO5MPJ\t9380340141\t3.0\t1369440000', 1369440000.0)
具體代碼以下:
def manual_join(): f_rev = open("reviews-info", "r") user_map = {} item_list = [] for line in f_rev: line = line.strip() items = line.split("\t") if items[0] not in user_map: user_map[items[0]]= [] user_map[items[0]].append(("\t".join(items), float(items[-1]))) item_list.append(items[1]) f_meta = open("item-info", "r") meta_map = {} for line in f_meta: arr = line.strip().split("\t") if arr[0] not in meta_map: meta_map[arr[0]] = arr[1] arr = line.strip().split("\t") fo = open("jointed-new", "w") for key in user_map: sorted_user_bh = sorted(user_map[key], key=lambda x:x[1]) #把用戶行爲序列按照時間排序 for line, t in sorted_user_bh: # 對於每個用戶行爲 items = line.split("\t") asin = items[1] j = 0 while True: asin_neg_index = random.randint(0, len(item_list) - 1) #獲取隨機item id index asin_neg = item_list[asin_neg_index] #獲取隨機item id if asin_neg == asin: #若是剛好是那個item id,則繼續選擇 continue items[1] = asin_neg # 寫入負樣本 print>>fo, "0" + "\t" + "\t".join(items) + "\t" + meta_map[asin_neg] j += 1 if j == 1: #negative sampling frequency break # 寫入正樣本 if asin in meta_map: print>>fo, "1" + "\t" + line + "\t" + meta_map[asin] else: print>>fo, "1" + "\t" + line + "\t" + "default_cat"
最後文件摘錄以下,生成了一系列正負樣本。
0 A10000012B7CGYKOMPQ4L 140004314X 5.0 1355616000 Books 1 A10000012B7CGYKOMPQ4L 000100039X 5.0 1355616000 Books 0 A10000012B7CGYKOMPQ4L 1477817603 5.0 1355616000 Books 1 A10000012B7CGYKOMPQ4L 0393967972 5.0 1355616000 Books 0 A10000012B7CGYKOMPQ4L 0778329933 5.0 1355616000 Books 1 A10000012B7CGYKOMPQ4L 0446691437 5.0 1355616000 Books 0 A10000012B7CGYKOMPQ4L B006P5CH1O 4.0 1355616000 Collections & Anthologies
這步驟把樣本分離,目的是肯定時間線上最後兩個樣本。
因此,jointed-new-split-info 文件中,前綴爲 20190119 的兩條記錄就是用戶行爲的最後兩條記錄,正好是一個正樣本,一個負樣本,時間上也是最後兩個。
代碼以下:
def split_test(): fi = open("jointed-new", "r") fo = open("jointed-new-split-info", "w") user_count = {} for line in fi: line = line.strip() user = line.split("\t")[1] if user not in user_count: user_count[user] = 0 user_count[user] += 1 fi.seek(0) i = 0 last_user = "A26ZDKC53OP6JD" for line in fi: line = line.strip() user = line.split("\t")[1] if user == last_user: if i < user_count[user] - 2: # 1 + negative samples print>> fo, "20180118" + "\t" + line else: print>>fo, "20190119" + "\t" + line else: last_user = user i = 0 if i < user_count[user] - 2: print>> fo, "20180118" + "\t" + line else: print>>fo, "20190119" + "\t" + line i += 1
最後文件以下:
20180118 0 A10000012B7CGYKOMPQ4L 140004314X 5.0 1355616000 Books 20180118 1 A10000012B7CGYKOMPQ4L 000100039X 5.0 1355616000 Books 20180118 0 A10000012B7CGYKOMPQ4L 1477817603 5.0 1355616000 Books 20180118 1 A10000012B7CGYKOMPQ4L 0393967972 5.0 1355616000 Books 20180118 0 A10000012B7CGYKOMPQ4L 0778329933 5.0 1355616000 Books 20180118 1 A10000012B7CGYKOMPQ4L 0446691437 5.0 1355616000 Books 20180118 0 A10000012B7CGYKOMPQ4L B006P5CH1O 4.0 1355616000 Collections & Anthologies 20180118 1 A10000012B7CGYKOMPQ4L 0486227081 4.0 1355616000 Books 20180118 0 A10000012B7CGYKOMPQ4L B00HWI5OP4 4.0 1355616000 United States 20180118 1 A10000012B7CGYKOMPQ4L 048622709X 4.0 1355616000 Books 20180118 0 A10000012B7CGYKOMPQ4L 1475005873 4.0 1355616000 Books 20180118 1 A10000012B7CGYKOMPQ4L 0486274268 4.0 1355616000 Books 20180118 0 A10000012B7CGYKOMPQ4L 098960571X 4.0 1355616000 Books 20180118 1 A10000012B7CGYKOMPQ4L 0486404730 4.0 1355616000 Books 20190119 0 A10000012B7CGYKOMPQ4L 1495459225 4.0 1355616000 Books 20190119 1 A10000012B7CGYKOMPQ4L 0830604790 4.0 1355616000 Books
local_aggretor.py 用來生成用戶行爲序列。
例如對於 reviewerID=0 的用戶,他的pos_list爲[13179, 17993, 28326, 29247, 62275],生成的訓練集格式爲(reviewerID, hist, pos_item, 1), (reviewerID, hist, neg_item, 0)。
這裏須要注意hist並不包含pos_item或者neg_item,hist只包含在pos_item以前點擊過的item,由於DIN採用相似attention的機制,只有歷史行爲的attention纔對後續的有影響,因此hist只包含pos_item以前點擊的item纔有意義。
具體邏輯是:
由於 20190119 是時間上排最後的兩個序列,因此最終 local_test 文件中,獲得的是每一個用戶的兩個累積行爲序列,即這個行爲序列從時間上包括從頭至尾全部時間。
這裏文件命名比較奇怪,由於實際訓練測試使用的是 local_test 文件中的數據。
一個正樣本,一個負樣本。兩個序列只有最後一個item id 和 click 與否不一樣,其他都相同。
具體代碼以下:
fin = open("jointed-new-split-info", "r") ftrain = open("local_train", "w") ftest = open("local_test", "w") last_user = "0" common_fea = "" line_idx = 0 for line in fin: items = line.strip().split("\t") ds = items[0] clk = int(items[1]) user = items[2] movie_id = items[3] dt = items[5] cat1 = items[6] if ds=="20180118": fo = ftrain else: fo = ftest if user != last_user: movie_id_list = [] cate1_list = [] else: history_clk_num = len(movie_id_list) cat_str = "" mid_str = "" for c1 in cate1_list: cat_str += c1 + "" for mid in movie_id_list: mid_str += mid + "" if len(cat_str) > 0: cat_str = cat_str[:-1] if len(mid_str) > 0: mid_str = mid_str[:-1] if history_clk_num >= 1: # 8 is the average length of user behavior print >> fo, items[1] + "\t" + user + "\t" + movie_id + "\t" + cat1 +"\t" + mid_str + "\t" + cat_str last_user = user if clk: #若是是click狀態 movie_id_list.append(movie_id) # 累積對應的movie id cate1_list.append(cat1) # 累積對應的cat id line_idx += 1
最後local_test數據摘錄以下:
0 A10000012B7CGYKOMPQ4L 1495459225 Books 000100039X039396797204466914370486227081048622709X04862742680486404730 BooksBooksBooksBooksBooksBooksBooks 1 A10000012B7CGYKOMPQ4L 0830604790 Books 000100039X039396797204466914370486227081048622709X04862742680486404730 BooksBooksBooksBooksBooksBooksBooks
split_by_user.py 的做用是分割成數據集。
是隨機從1~10中選取整數,若是剛好是2,就做爲驗證數據集。
fi = open("local_test", "r") ftrain = open("local_train_splitByUser", "w") ftest = open("local_test_splitByUser", "w") while True: rand_int = random.randint(1, 10) noclk_line = fi.readline().strip() clk_line = fi.readline().strip() if noclk_line == "" or clk_line == "": break if rand_int == 2: print >> ftest, noclk_line print >> ftest, clk_line else: print >> ftrain, noclk_line print >> ftrain, clk_line
舉例以下:
格式爲:label, 用戶id,候選item id,候選item 種類,行爲序列,種類序列
0 A3BI7R43VUZ1TY B00JNHU0T2 Literature & Fiction 0989464105B00B01691C14778097321608442845 BooksLiterature & FictionBooksBooks 1 A3BI7R43VUZ1TY 0989464121 Books 0989464105B00B01691C14778097321608442845 BooksLiterature & FictionBooksBooks
generate_voc.py 的做用是對用戶,電影,種類分別生成三個數據字典。三個字典分別包括全部用戶id,全部電影id,全部種類id。這裏就是簡單的把三種元素從1 開始排序。
能夠理解爲 用movie id,categories,reviewerID分別生產三個 map(movie_map, cate_map, uid_map),key爲對應的原始信息,value爲按key排序後的index(從0開始順序排序),而後將原數據的對應列原始數據轉換成key對應的index。
import cPickle f_train = open("local_train_splitByUser", "r") uid_dict = {} mid_dict = {} cat_dict = {} iddd = 0 for line in f_train: arr = line.strip("\n").split("\t") clk = arr[0] uid = arr[1] mid = arr[2] cat = arr[3] mid_list = arr[4] cat_list = arr[5] if uid not in uid_dict: uid_dict[uid] = 0 uid_dict[uid] += 1 if mid not in mid_dict: mid_dict[mid] = 0 mid_dict[mid] += 1 if cat not in cat_dict: cat_dict[cat] = 0 cat_dict[cat] += 1 if len(mid_list) == 0: continue for m in mid_list.split(""): if m not in mid_dict: mid_dict[m] = 0 mid_dict[m] += 1 iddd+=1 for c in cat_list.split(""): if c not in cat_dict: cat_dict[c] = 0 cat_dict[c] += 1 sorted_uid_dict = sorted(uid_dict.iteritems(), key=lambda x:x[1], reverse=True) sorted_mid_dict = sorted(mid_dict.iteritems(), key=lambda x:x[1], reverse=True) sorted_cat_dict = sorted(cat_dict.iteritems(), key=lambda x:x[1], reverse=True) uid_voc = {} index = 0 for key, value in sorted_uid_dict: uid_voc[key] = index index += 1 mid_voc = {} mid_voc["default_mid"] = 0 index = 1 for key, value in sorted_mid_dict: mid_voc[key] = index index += 1 cat_voc = {} cat_voc["default_cat"] = 0 index = 1 for key, value in sorted_cat_dict: cat_voc[key] = index index += 1 cPickle.dump(uid_voc, open("uid_voc.pkl", "w")) cPickle.dump(mid_voc, open("mid_voc.pkl", "w")) cPickle.dump(cat_voc, open("cat_voc.pkl", "w"))
最終,獲得DIN模型處理的幾個文件:
train.py部分,作的事情就是先用初始模型評估一遍測試集,而後按照batch訓練,每1000次評估測試集。
精簡版代碼以下:
def train( train_file = "local_train_splitByUser", test_file = "local_test_splitByUser", uid_voc = "uid_voc.pkl", mid_voc = "mid_voc.pkl", cat_voc = "cat_voc.pkl", batch_size = 128, maxlen = 100, test_iter = 100, save_iter = 100, model_type = 'DNN', seed = 2, ): with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess: # 獲取訓練數據和測試數據 train_data = DataIterator(train_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen, shuffle_each_epoch=False) test_data = DataIterator(test_file, uid_voc, mid_voc, cat_voc, batch_size, maxlen) n_uid, n_mid, n_cat = train_data.get_n() # 創建模型 model = Model_DIN(n_uid, n_mid, n_cat, EMBEDDING_DIM, HIDDEN_SIZE, ATTENTION_SIZE) iter = 0 lr = 0.001 for itr in range(3): loss_sum = 0.0 accuracy_sum = 0. aux_loss_sum = 0. for src, tgt in train_data: # 準備數據 uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, maxlen, return_neg=True) # 訓練 loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats]) loss_sum += loss accuracy_sum += acc aux_loss_sum += aux_loss iter += 1 if (iter % test_iter) == 0: eval(sess, test_data, model, best_model_path) loss_sum = 0.0 accuracy_sum = 0.0 aux_loss_sum = 0.0 if (iter % save_iter) == 0: model.save(sess, model_path+"--"+str(iter)) lr *= 0.5
DataInput 是一個迭代器,做用就是每次調用返回下一個batch的數據。這段代碼涉及到數據如何按照batch劃分,以及如何構造一個迭代器。
前面提到,訓練數據集格式爲:label, 用戶id,候選item id,候選item 種類,行爲序列,種類序列
基本邏輯是:
__init__
函數中:
代碼以下:
class DataIterator: def __init__(self, source, uid_voc, mid_voc, cat_voc, batch_size=128, maxlen=100, skip_empty=False, shuffle_each_epoch=False, sort_by_length=True, max_batch_size=20, minlen=None): if shuffle_each_epoch: self.source_orig = source self.source = shuffle.main(self.source_orig, temporary=True) else: self.source = fopen(source, 'r') self.source_dicts = [] # 從三個pkl文件中讀取,生成三個字典,分別放在 self.source_dicts 裏面,對應 [uid_voc, mid_voc, cat_voc] for source_dict in [uid_voc, mid_voc, cat_voc]: self.source_dicts.append(load_dict(source_dict)) # 從 "item-info" 讀取,生成映射關係,最後 self.meta_id_map 中的就是每一個movie id 對應的 cateory id,關鍵代碼是: self.meta_id_map[mid_idx] = cat_idx ; f_meta = open("item-info", "r") meta_map = {} for line in f_meta: arr = line.strip().split("\t") if arr[0] not in meta_map: meta_map[arr[0]] = arr[1] self.meta_id_map ={} for key in meta_map: val = meta_map[key] if key in self.source_dicts[1]: mid_idx = self.source_dicts[1][key] else: mid_idx = 0 if val in self.source_dicts[2]: cat_idx = self.source_dicts[2][val] else: cat_idx = 0 self.meta_id_map[mid_idx] = cat_idx # 從 "reviews-info" 讀取,生成負採樣所須要的id list; f_review = open("reviews-info", "r") self.mid_list_for_random = [] for line in f_review: arr = line.strip().split("\t") tmp_idx = 0 if arr[1] in self.source_dicts[1]: tmp_idx = self.source_dicts[1][arr[1]] self.mid_list_for_random.append(tmp_idx) # 得倒各類基礎數據,好比用戶列表長度,movie列表長度等等; self.batch_size = batch_size self.maxlen = maxlen self.minlen = minlen self.skip_empty = skip_empty self.n_uid = len(self.source_dicts[0]) self.n_mid = len(self.source_dicts[1]) self.n_cat = len(self.source_dicts[2]) self.shuffle = shuffle_each_epoch self.sort_by_length = sort_by_length self.source_buffer = [] self.k = batch_size * max_batch_size self.end_of_data = False
最後數據以下:
self = {DataIterator} <data_iterator.DataIterator object at 0x000001F56CB44BA8> batch_size = {int} 128 k = {int} 2560 maxlen = {int} 100 meta_id_map = {dict: 367983} {0: 1572, 115840: 1, 282448: 1, 198250: 1, 4275: 1, 260890: 1, 260584: 1, 110331: 1, 116224: 1, 2704: 1, 298259: 1, 47792: 1, 186701: 1, 121548: 1, 147230: 1, 238085: 1, 367828: 1, 270505: 1, 354813: 1... mid_list_for_random = {list: 8898041} [4275, 4275, 4275, 4275, 4275, 4275, 4275, 4275... minlen = {NoneType} None n_cat = {int} 1601 n_mid = {int} 367983 n_uid = {int} 543060 shuffle = {bool} False skip_empty = {bool} False sort_by_length = {bool} True source = {TextIOWrapper} <_io.TextIOWrapper name='local_train_splitByUser' mode='r' encoding='cp936'> source_buffer = {list: 0} [] source_dicts = {list: 3} 0 = {dict: 543060} {'ASEARD9XL1EWO': 449136, 'AZPJ9LUT0FEPY': 0, 'A2NRV79GKAU726': 16, 'A2GEQVDX2LL4V3': 266686, 'A3R04FKEYE19T6': 354817, 'A3VGDQOR56W6KZ': 4... 1 = {dict: 367983} {'1594483752': 47396, '0738700797': 159716, '1439110239': 193476... 2 = {dict: 1601} {'Residential': 1281, 'Poetry': 250, 'Winter Sports': 1390...
當迭代讀取時候,邏輯以下:
self.source_buffer
沒有數據,則讀取總數爲 k 的文件行數。能夠理解爲一次性讀取最大buffer;self.source_buffer
取出一條數據:
具體代碼見下文:
def __next__(self): if self.end_of_data: self.end_of_data = False self.reset() raise StopIteration source = [] target = [] # 若是 self.source_buffer沒有數據,則讀取總數爲 k 的文件行數。能夠理解爲一次性讀取最大buffer if len(self.source_buffer) == 0: #for k_ in xrange(self.k): for k_ in range(self.k): ss = self.source.readline() if ss == "": break self.source_buffer.append(ss.strip("\n").split("\t")) # sort by history behavior length # 若是設定,則按照用戶歷史行爲長度排序; if self.sort_by_length: his_length = numpy.array([len(s[4].split("")) for s in self.source_buffer]) tidx = his_length.argsort() _sbuf = [self.source_buffer[i] for i in tidx] self.source_buffer = _sbuf else: self.source_buffer.reverse() if len(self.source_buffer) == 0: self.end_of_data = False self.reset() raise StopIteration try: # actual work here,內部迭代開始 while True: # read from source file and map to word index try: ss = self.source_buffer.pop() except IndexError: break uid = self.source_dicts[0][ss[1]] if ss[1] in self.source_dicts[0] else 0 mid = self.source_dicts[1][ss[2]] if ss[2] in self.source_dicts[1] else 0 cat = self.source_dicts[2][ss[3]] if ss[3] in self.source_dicts[2] else 0 # 取出用戶一個歷史行爲 movie id 列表 到 mid_list; tmp = [] for fea in ss[4].split(""): m = self.source_dicts[1][fea] if fea in self.source_dicts[1] else 0 tmp.append(m) mid_list = tmp # 取出用戶一個歷史行爲 cat id 列表 到 cat_list; tmp1 = [] for fea in ss[5].split(""): c = self.source_dicts[2][fea] if fea in self.source_dicts[2] else 0 tmp1.append(c) cat_list = tmp1 # read from source file and map to word index #if len(mid_list) > self.maxlen: # continue if self.minlen != None: if len(mid_list) <= self.minlen: continue if self.skip_empty and (not mid_list): continue # 針對mid_list中的每個pos_mid,製造5個負採樣歷史行爲數據;具體就是從 mid_list_for_random 中隨機獲取5個id(若是與pos_mid相同則再次獲取新的); noclk_mid_list = [] noclk_cat_list = [] for pos_mid in mid_list: noclk_tmp_mid = [] noclk_tmp_cat = [] noclk_index = 0 while True: noclk_mid_indx = random.randint(0, len(self.mid_list_for_random)-1) noclk_mid = self.mid_list_for_random[noclk_mid_indx] if noclk_mid == pos_mid: continue noclk_tmp_mid.append(noclk_mid) noclk_tmp_cat.append(self.meta_id_map[noclk_mid]) noclk_index += 1 if noclk_index >= 5: break noclk_mid_list.append(noclk_tmp_mid) noclk_cat_list.append(noclk_tmp_cat) source.append([uid, mid, cat, mid_list, cat_list, noclk_mid_list, noclk_cat_list]) target.append([float(ss[0]), 1-float(ss[0])]) if len(source) >= self.batch_size or len(target) >= self.batch_size: break except IOError: self.end_of_data = True # all sentence pairs in maxibatch filtered out because of length if len(source) == 0 or len(target) == 0: source, target = self.next() return source, target
在獲取迭代數據以後,還須要進一步處理。
uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, noclk_mids, noclk_cats = prepare_data(src, tgt, return_neg=True)
能夠理解爲把這個batch的數據(假設是128條)分類整合起來。好比把這128條的uids, mids, cats,歷史序列 分別聚合起來,最後統一發給模型進行訓練。
這裏重要的一點是生成了mask。其意義是:
mask 表示掩碼,它對某些值進行掩蓋,使其在參數更新時不產生效果。padding mask 是掩碼的一種,
DIN這裏,因爲一個 Batch 中的用戶行爲序列不必定都相同,其真實長度保存在 keys_length 中,因此這裏要產生 masks 來選擇真正的歷史行爲。
具體代碼以下:
def prepare_data(input, target, maxlen = None, return_neg = False): # x: a list of sentences #s[4]是mid_list, input的每一個item中,mid_list長度不一樣 lengths_x = [len(s[4]) for s in input] seqs_mid = [inp[3] for inp in input] seqs_cat = [inp[4] for inp in input] noclk_seqs_mid = [inp[5] for inp in input] noclk_seqs_cat = [inp[6] for inp in input] if maxlen is not None: new_seqs_mid = [] new_seqs_cat = [] new_noclk_seqs_mid = [] new_noclk_seqs_cat = [] new_lengths_x = [] for l_x, inp in zip(lengths_x, input): if l_x > maxlen: new_seqs_mid.append(inp[3][l_x - maxlen:]) new_seqs_cat.append(inp[4][l_x - maxlen:]) new_noclk_seqs_mid.append(inp[5][l_x - maxlen:]) new_noclk_seqs_cat.append(inp[6][l_x - maxlen:]) new_lengths_x.append(maxlen) else: new_seqs_mid.append(inp[3]) new_seqs_cat.append(inp[4]) new_noclk_seqs_mid.append(inp[5]) new_noclk_seqs_cat.append(inp[6]) new_lengths_x.append(l_x) lengths_x = new_lengths_x seqs_mid = new_seqs_mid seqs_cat = new_seqs_cat noclk_seqs_mid = new_noclk_seqs_mid noclk_seqs_cat = new_noclk_seqs_cat if len(lengths_x) < 1: return None, None, None, None # lengths_x 保存用戶歷史行爲序列的真實長度,maxlen_x 表示序列中的最大長度; n_samples = len(seqs_mid) maxlen_x = numpy.max(lengths_x) #選取mid_list長度中最大的,本例中是583 neg_samples = len(noclk_seqs_mid[0][0]) # 因爲用戶歷史序列的長度是不固定的, 所以引入 mid_his 等矩陣, 將序列長度固定爲 maxlen_x. 對於長度不足 maxlen_x 的序列, 使用 0 來進行填充 (注意 mid_his 等矩陣 使用 zero 矩陣來進行初始化的) mid_his = numpy.zeros((n_samples, maxlen_x)).astype('int64') #tuple<128, 583> cat_his = numpy.zeros((n_samples, maxlen_x)).astype('int64') noclk_mid_his = numpy.zeros((n_samples, maxlen_x, neg_samples)).astype('int64') #tuple<128, 583, 5> noclk_cat_his = numpy.zeros((n_samples, maxlen_x, neg_samples)).astype('int64') #tuple<128, 583, 5> mid_mask = numpy.zeros((n_samples, maxlen_x)).astype('float32') # zip函數用於將可迭代的對象做爲參數,將對象中對應的元素打包成一個個元組,而後返回由這些元組組成的列表 for idx, [s_x, s_y, no_sx, no_sy] in enumerate(zip(seqs_mid, seqs_cat, noclk_seqs_mid, noclk_seqs_cat)): mid_mask[idx, :lengths_x[idx]] = 1. mid_his[idx, :lengths_x[idx]] = s_x cat_his[idx, :lengths_x[idx]] = s_y # noclk_mid_his 和 noclk_cat_his 都是 (128, 583, 5) noclk_mid_his[idx, :lengths_x[idx], :] = no_sx # 就是直接賦值 noclk_cat_his[idx, :lengths_x[idx], :] = no_sy # 就是直接賦值 uids = numpy.array([inp[0] for inp in input]) mids = numpy.array([inp[1] for inp in input]) cats = numpy.array([inp[2] for inp in input]) # 把input(128長的list)中的每一個UID,mid, cat ... 都提出來,聚合,返回 if return_neg: return uids, mids, cats, mid_his, cat_his, mid_mask, numpy.array(target), numpy.array(lengths_x), noclk_mid_his, noclk_cat_his else: return uids, mids, cats, mid_his, cat_his, mid_mask, numpy.array(target), numpy.array(lengths_x)
最後,送入模型訓練,也就是train.py中的這一步:
loss, acc, aux_loss = model.train(sess, [uids, mids, cats, mid_his, cat_his, mid_mask, target, sl, lr, noclk_mids, noclk_cats])