Kaggle實戰——點擊率預估

http://www.javashuo.com/article/p-cvophbmn-d.htmlgit

原創文章,轉載請註明出處: http://blog.csdn.net/chengcheng1394/article/details/78940565github

請安裝TensorFlow1.0,Python3.5
項目地址:
https://github.com/chengstone/kaggle_criteo_ctr_challenge-redis

前言
點擊率預估用來判斷一條廣告被用戶點擊的機率,對每次廣告的點擊作出預測,把用戶最有可能點擊的廣告找出來,是廣告技術最重要的算法之一。算法

數據集下載

此次咱們使用Kaggle上的Display Advertising Challenge挑戰的criteo數據集。
下載數據集請在終端輸入下面命令(腳本文件路徑:./data/download.sh):
wget –no-check-certificate https://s3-eu-west-1.amazonaws.com/criteo-labs/dac.tar.gz
tar zxf dac.tar.gz
rm -f dac.tar.gz
mkdir raw
mv ./*.txt raw/網絡

解壓縮之後,train.txt文件11.7G,test.txt文件1.35G。
數據量太大了,咱們只使用前100萬條數據。
head -n 1000000 test.txt > test_sub100w.txt
head -n 1000000 train.txt > train_sub100w.txt
而後將文件名從新命名爲train.txt和test.txt,文件位置不變。app

Data fields
Label
Target variable that indicates if an ad was clicked (1) or not (0).dom

I1-I13
A total of 13 columns of integer features (mostly count features).機器學習

C1-C26
A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.ide

數據中含有Label字段,表示這條廣告是否被點擊,I1-I13一共13個數值特徵(Dense Input),C1-C26共26個Categorical類別特徵(Sparse Input)。函數

網絡模型

模型包含三部分網絡,一個是FFM(Field-aware Factorization Machines),一個是FM(Factorization Machine),另外一個是DNN,其中FM網絡包含GBDT和FM兩個組件。一般在數據預處理的部分,須要作特徵交叉組合等特徵工程,以便找出幫助咱們預測的特徵出來,這絕對是技術活。

此次咱們跳過特徵工程的步驟,把這些組件和深度神經網絡組合在一塊兒,將挑選特徵的工做交給模型來處理。其中FFM使用了LibFFM,FM使用了LibFM,GBDT使用了LightGBM,固然你也可使用xgboost。

GBDT
給入訓練數據後,GBDT會訓練出若干棵樹,咱們要使用的是GBDT中每棵樹輸出的葉子結點,將這些葉子結點做爲categorical類別特徵輸入給FM。有關決策樹的使用,請參照Facebook的這篇文章Practical Lessons from Predicting Clicks on Ads at Facebook。


FM
FM用來解決數據量大而且特徵稀疏下的特徵組合問題,先來看看公式(只考慮二階多項式的狀況):n表明樣本的特徵數量,xixi是第i個特徵的值,w0w0、wiwi、wiwijj是模型參數。

從公式能夠看出來這是在線性模型基礎上,添加了特徵組合xixjxixj,固然只有在特徵xixi和xjxj都不爲0時纔有意義。然而在實際的應用場景中,訓練組合特徵的參數是很困難的。由於輸入數據廣泛存在稀疏性,這致使xixi和xjxj大部分狀況都是0,而組合特徵的參數wiwijj只有在特徵不爲0時才能訓練出有意義的值。

好比跟購物相關的特徵中,女性可能會更關注化妝品或者首飾之類的物品,而男性可能更關注體育用品或者電子產品等商品,這說明特徵組合訓練是有意義的。而商品特徵可能存在幾百上千種分類,一般咱們將類別特徵轉成One hot編碼的形式,這樣一個特徵就要變成幾百維的特徵,再加上其餘的分類特徵,這致使輸入的特徵空間急劇膨脹,因此數據的稀疏性是實際問題中不可避免的挑戰。

爲了解決二次項參數訓練的問題,引入了矩陣分解的概念。在上一篇文章中咱們討論的是電影推薦系統,咱們構造了用戶特徵向量和電影特徵向量,經過兩個特徵向量的點積獲得了用戶對於某部電影的評分。若是將用戶特徵矩陣與電影特徵矩陣相乘就會獲得全部用戶對全部影片的評分矩陣。

若是將上面的過程反過來看,實際上對於評分矩陣,咱們能夠分解成用戶矩陣和電影矩陣,而評分矩陣中每個數據點就至關於上面討論的組合特徵的參數wiwijj。

對於參數矩陣W,咱們採用矩陣分解的方法,將每個參數wiwijj分解成兩個向量(稱之爲隱向量)的點積。這樣矩陣就能夠分解爲W=VTVW=VTV,而每一個參數wiwijj=⟨vivi,vjvj⟩,vivi是第i維特徵的隱向量,這樣FM的二階公式就變成:

這就是FM模型的思想。

將GBDT輸出的葉子節點做爲訓練數據的輸入,來訓練FM模型。這樣對於咱們的FM網絡,須要訓練GBDT和FM。看得出來,此次咱們的點擊率預測網絡要複雜了許多,影響最終結果的因素和超參更多了。關於FM和GBDT兩個組件的訓練咱們會在下文進行說明。

FFM
接下來須要訓練FFM模型。FFM在FM的基礎上增長了一個Field的概念,好比說一個商品字段,是一個分類特徵,能夠分紅不少不一樣的feature,可是這些feature都屬於同一個Field,或者說同一個categorical的分類特徵均可以放到同一個Field。

這能夠當作是1對多的關係,打個比方,好比職業字段,這是一個特徵,通過One Hot之後,變成了N個特徵。那這N個特徵其實都屬於職業,因此職業就是一個Field。

咱們要經過特徵組合來訓練隱向量,這樣每一維特徵xixi,都會與其餘特徵的每一種Field fjfj學習一個隱向量vi,fjvi,fj。也就是說,隱向量不只與特徵有關,還與Field有關。模型的公式:


DNN
咱們來看DNN的部分。將輸入數據分紅兩部分,一部分是數值特徵(Dense Input),一部分是類別特徵(Sparse Input)。咱們仍然不適用One Hot編碼,將類別特徵傳入嵌入層,獲得多個嵌入向量,再將這些嵌入向量和數值特徵鏈接在一塊兒,傳入全鏈接層,一共鏈接三層全鏈接層,使用Relu激活函數。而後再將第三層全鏈接的輸出和FFM、FM的全鏈接層的輸出鏈接在一塊兒,傳入最後一層全鏈接層。

咱們要學習的目標Label表示廣告是否被點擊了,只有1(點擊)和0(沒有點擊)兩種狀態。因此咱們網絡的最後一層要作Logistic迴歸,在最後一層全鏈接層使用Sigmoid激活函數,獲得廣告被點擊的機率。

使用LogLoss做爲損失函數,FTRL做爲學習算法。
FTRL有關的Paper:Ad_click_prediction_a_view_from_the_trenches

**

注意:LibFFM和LibFM的代碼我作了修改,請使用代碼庫中個人相關代碼。
**

預處理數據集
生成神經網絡的輸入
生成FFM的輸入
生成GBDT的輸入
首先要爲DNN、FFM和GBDT的輸入作預處理。對於數值特徵,咱們將I1-I13轉成0-1之間的小數。類別特徵咱們將某類別使用次數少於cutoff(超參)的忽略掉,留下使用次數多的feature做爲某類別字段的特徵,而後將這些特徵以各自字段爲組進行編號。

好比有C1和C2兩個類別字段,C1下面有特徵a(大於cutoff次)、b(少於cutoff次)、c(大於cutoff次),C2下面有特徵x和y(均大於cutoff次),這樣留下來的特徵就是C1:a、c和C2:x、y。而後以各自字段爲分組進行編號,對於C1字段,a和c的特徵id對應0和1;對於C2字段,x和y也是0和1。

對於類別特徵的輸入數據處理,FFM和GBDT各不相同,咱們分別來講。

GBDT
GBDT的處理要簡單一些,C1-C26每一個字段各自的特徵id值做爲輸入便可。 GBDT的輸入數據格式是:Label I1-I13 C1-C26 因此實際輸入多是這樣:0 小數1 小數2 ~ 小數13 1(C1特徵Id) 0(C2特徵Id) ~ C26特徵Id 其中C1特徵Id是1,說明此處C1字段的feature是c,而C2字段的feature是x。

下面是一段生成的真實數據: 0 0.05 0.004983 0.05 0 0.021594 0.008 0.15 0.04 0.362 0.166667 0.2 0 0.04 2 3 0 0 1 1 0 3 1 0 0 0 0 3 0 0 1 4 1 3 0 0 2 0 1 0

很抱歉,個人造句能力實在不好,要是上面一段文字看的你很混亂的話,那就直接看代碼吧:)

FFM
FFM的輸入數據要複雜一些,詳細能夠參看官方Github上的說明,摘抄以下:

It is important to understand the difference between field and feature. For example, if we have a raw data like this:

Click Advertiser Publisher
===== ========== =========
0 Nike CNN
1 ESPN BBC
1
2
3
4
Here, we have

* 2 fields: Advertiser and Publisher
* 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC
1
2
Usually you will need to build two dictionares, one for field and one for features, like this:

DictField[Advertiser] -> 0
DictField[Publisher] -> 1

DictFeature[Advertiser-Nike] -> 0
DictFeature[Publisher-CNN] -> 1
DictFeature[Advertiser-ESPN] -> 2
DictFeature[Publisher-BBC] -> 3
1
2
3
4
5
6
7
Then, you can generate FFM format data:

0 0:0:1 1:1:1
1 0:2:1 1:3:1
1
2
Note that because these features are categorical, the values here are all ones.

fields應該很好理解,features的劃分跟以前GBDT有些不同,在剛剛GBDT的處理中咱們是每一個類別內獨立編號,C1有features 0~n,C2有features 0~n。而此次FFM是全部的features統一塊兒來編號。你看它的例子,C1是Advertiser,有兩個feature,C2是Publisher,有兩個feature,統一塊兒來編號就是0~3。而在GBDT咱們要獨立編號的,看起來像這樣:

DictFeature[Advertiser-Nike] -> 0
DictFeature[Advertiser-ESPN] -> 1
DictFeature[Publisher-CNN] -> 0
DictFeature[Publisher-BBC] -> 1
1
2
3
4
如今咱們假設有第三條數據,看看如何構造FFM的輸入數據:

Click Advertiser Publisher
===== ========== =========
0 Nike CNN
1 ESPN BBC
0 Lining CNN
1
2
3
4
5
按照規則,應該是像下面這樣:

DictFeature[Advertiser-Nike] -> 0
DictFeature[Publisher-CNN] -> 1
DictFeature[Advertiser-ESPN] -> 2
DictFeature[Publisher-BBC] -> 3
DictFeature[Advertiser-Lining] -> 4
1
2
3
4
5
在咱們此次FFM的輸入數據處理中,跟上面略有些區別,每一個類別編號之後,下一個類別繼續編號,因此最終的features編號是這樣的:

DictFeature[Advertiser-Nike] -> 0
DictFeature[Advertiser-ESPN] -> 1
DictFeature[Advertiser-Lining] -> 2
DictFeature[Publisher-CNN] -> 3
DictFeature[Publisher-BBC] -> 4
1
2
3
4
5
對於咱們的數據是從I1開始編號的,從I1-I13,因此C1的編號要從加13開始。

這是一條來自真實的FFM輸入數據:
0 0:0:0.05 1:1:0.004983 2:2:0.05 3:3:0 4:4:0.021594 5:5:0.008 6:6:0.15 7:7:0.04 8:8:0.362 9:9:0.166667 10:10:0.2 11:11:0 12:12:0.04 13:15:1 14:29:1 15:64:1 16:76:1 17:92:1 18:101:1 19:107:1 20:122:1 21:131:1 22:133:1 23:143:1 24:166:1 25:179:1 26:209:1 27:216:1 28:243:1 29:260:1 30:273:1 31:310:1 32:317:1 33:318:1 34:333:1 35:340:1 36:348:1 37:368:1 38:381:1

DNN
DNN的輸入數據就沒有那麼複雜了,仍然是I1-I13的小數和C1-C26的統一編號,就像FFM同樣,只是不須要從加13開始,最後是Label。
真實數據就像這樣:
0.05,0.004983,0.05,0,0.021594,0.008,0.15,0.04,0.362,0.166667,0.2,0,0.04,2,16,51,63,79,88,94,109,118,120,130,153,166,196,203,230,247,260,297,304,305,320,327,335,355,368,0

要說明的就這麼多了,咱們來看看代碼吧,由於要同時生成訓練數據、驗證數據和測試數據,因此要運行一段時間。

核心代碼講解
完整代碼請參見項目地址。
如下代碼來自百度deep_fm的preprocess.py,稍稍添了些代碼,我就不重複造輪子了:)

# There are 13 integer features and 26 categorical features
continous_features = range(1, 14)
categorial_features = range(14, 40)

# Clip integer features. The clip point for each integer feature
# is derived from the 95% quantile of the total values in each feature
continous_clip = [20, 600, 100, 50, 64000, 500, 100, 50, 500, 10, 10, 10, 50]

class ContinuousFeatureGenerator:
"""
Normalize the integer features to [0, 1] by min-max normalization
"""

def __init__(self, num_feature):
self.num_feature = num_feature
self.min = [sys.maxsize] * num_feature
self.max = [-sys.maxsize] * num_feature

def build(self, datafile, continous_features):
with open(datafile, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
for i in range(0, self.num_feature):
val = features[continous_features[i]]
if val != '':
val = int(val)
if val > continous_clip[i]:
val = continous_clip[i]
self.min[i] = min(self.min[i], val)
self.max[i] = max(self.max[i], val)

def gen(self, idx, val):
if val == '':
return 0.0
val = float(val)
return (val - self.min[idx]) / (self.max[idx] - self.min[idx])

class CategoryDictGenerator:
"""
Generate dictionary for each of the categorical features
"""

def __init__(self, num_feature):
self.dicts = []
self.num_feature = num_feature
for i in range(0, num_feature):
self.dicts.append(collections.defaultdict(int))

def build(self, datafile, categorial_features, cutoff=0):
with open(datafile, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
for i in range(0, self.num_feature):
if features[categorial_features[i]] != '':
self.dicts[i][features[categorial_features[i]]] += 1
for i in range(0, self.num_feature):
self.dicts[i] = filter(lambda x: x[1] >= cutoff,
self.dicts[i].items())

self.dicts[i] = sorted(self.dicts[i], key=lambda x: (-x[1], x[0]))
vocabs, _ = list(zip(*self.dicts[i]))
self.dicts[i] = dict(zip(vocabs, range(1, len(vocabs) + 1)))
self.dicts[i]['<unk>'] = 0

def gen(self, idx, key):
if key not in self.dicts[idx]:
res = self.dicts[idx]['<unk>']
else:
res = self.dicts[idx][key]
return res

def dicts_sizes(self):
return list(map(len, self.dicts))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def preprocess(datadir, outdir):
"""
All the 13 integer features are normalzied to continous values and these
continous features are combined into one vecotr with dimension 13.

Each of the 26 categorical features are one-hot encoded and all the one-hot
vectors are combined into one sparse binary vector.
"""
dists = ContinuousFeatureGenerator(len(continous_features))
dists.build(os.path.join(datadir, 'train.txt'), continous_features)

dicts = CategoryDictGenerator(len(categorial_features))
dicts.build(
os.path.join(datadir, 'train.txt'), categorial_features, cutoff=200)#200 50

dict_sizes = dicts.dicts_sizes()
categorial_feature_offset = [0]
for i in range(1, len(categorial_features)):
offset = categorial_feature_offset[i - 1] + dict_sizes[i - 1]
categorial_feature_offset.append(offset)

random.seed(0)

# 90% of the data are used for training, and 10% of the data are used
# for validation.
train_ffm = open(os.path.join(outdir, 'train_ffm.txt'), 'w')
valid_ffm = open(os.path.join(outdir, 'valid_ffm.txt'), 'w')

train_lgb = open(os.path.join(outdir, 'train_lgb.txt'), 'w')
valid_lgb = open(os.path.join(outdir, 'valid_lgb.txt'), 'w')

with open(os.path.join(outdir, 'train.txt'), 'w') as out_train:
with open(os.path.join(outdir, 'valid.txt'), 'w') as out_valid:
with open(os.path.join(datadir, 'train.txt'), 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
continous_feats = []
continous_vals = []
for i in range(0, len(continous_features)):

val = dists.gen(i, features[continous_features[i]])
continous_vals.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))
continous_feats.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

categorial_vals = []
categorial_lgb_vals = []
for i in range(0, len(categorial_features)):
val = dicts.gen(i, features[categorial_features[i]]) + categorial_feature_offset[i]
categorial_vals.append(str(val))
val_lgb = dicts.gen(i, features[categorial_features[i]])
categorial_lgb_vals.append(str(val_lgb))

continous_vals = ','.join(continous_vals)
categorial_vals = ','.join(categorial_vals)
label = features[0]
if random.randint(0, 9999) % 10 != 0:
out_train.write(','.join(
[continous_vals, categorial_vals, label]) + '\n')
train_ffm.write('\t'.join(label) + '\t')
train_ffm.write('\t'.join(
['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
train_ffm.write('\t'.join(
['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')

train_lgb.write('\t'.join(label) + '\t')
train_lgb.write('\t'.join(continous_feats) + '\t')
train_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

else:
out_valid.write(','.join(
[continous_vals, categorial_vals, label]) + '\n')
valid_ffm.write('\t'.join(label) + '\t')
valid_ffm.write('\t'.join(
['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
valid_ffm.write('\t'.join(
['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')

valid_lgb.write('\t'.join(label) + '\t')
valid_lgb.write('\t'.join(continous_feats) + '\t')
valid_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

train_ffm.close()
valid_ffm.close()

train_lgb.close()
valid_lgb.close()

test_ffm = open(os.path.join(outdir, 'test_ffm.txt'), 'w')
test_lgb = open(os.path.join(outdir, 'test_lgb.txt'), 'w')

with open(os.path.join(outdir, 'test.txt'), 'w') as out:
with open(os.path.join(datadir, 'test.txt'), 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')

continous_feats = []
continous_vals = []
for i in range(0, len(continous_features)):
val = dists.gen(i, features[continous_features[i] - 1])
continous_vals.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))
continous_feats.append(
"{0:.6f}".format(val).rstrip('0').rstrip('.'))#('{0}'.format(val))

categorial_vals = []
categorial_lgb_vals = []
for i in range(0, len(categorial_features)):
val = dicts.gen(i,
features[categorial_features[i] -
1]) + categorial_feature_offset[i]
categorial_vals.append(str(val))

val_lgb = dicts.gen(i, features[categorial_features[i] - 1])
categorial_lgb_vals.append(str(val_lgb))

continous_vals = ','.join(continous_vals)
categorial_vals = ','.join(categorial_vals)

out.write(','.join([continous_vals, categorial_vals]) + '\n')

test_ffm.write('\t'.join(['{}:{}:{}'.format(ii, ii, val) for ii,val in enumerate(continous_vals.split(','))]) + '\t')
test_ffm.write('\t'.join(
['{}:{}:1'.format(ii + 13, str(np.int32(val) + 13)) for ii, val in enumerate(categorial_vals.split(','))]) + '\n')

test_lgb.write('\t'.join(continous_feats) + '\t')
test_lgb.write('\t'.join(categorial_lgb_vals) + '\n')

test_ffm.close()
test_lgb.close()
return dict_sizes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
訓練FFM
數據準備好了,開始調用LibFFM,訓練FFM模型。
learning rate是0.1,迭代32次,訓練好後保存的模型文件是model_ffm。

cmd = './libffm/libffm/ffm-train --auto-stop -r 0.1 -t 32 -s {nr_thread} -p ./data/valid_ffm.txt ./data/train_ffm.txt model_ffm'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
1
2
訓練結果:

['First check if the text file has already been converted to binary format (1.3 seconds)\n',
'Binary file found. Skip converting text to binary\n',
'First check if the text file has already been converted to binary format (0.2 seconds)\n',
'Binary file found. Skip converting text to binary\n',
'iter tr_logloss va_logloss tr_time\n',
' 1 0.49339 0.48196 12.8\n',
' 2 0.47621 0.47651 25.9\n',
' 3 0.47149 0.47433 39.0\n',
' 4 0.46858 0.47277 51.2\n',
' 5 0.46630 0.47168 63.0\n',
' 6 0.46447 0.47092 74.7\n',
' 7 0.46269 0.47038 86.4\n',
' 8 0.46113 0.47000 98.0\n',
' 9 0.45960 0.46960 109.6\n',
' 10 0.45811 0.46940 121.2\n',
' 11 0.45660 0.46913 132.5\n',
' 12 0.45509 0.46899 144.3\n',
' 13 0.45366 0.46903\n',
'Auto-stop. Use model at 12th iteration.\n']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
FFM模型訓練好了,咱們把訓練、驗證和測試數據輸入給FFM,獲得FFM層的輸出,輸出的文件名爲*.out.logit

cmd = './libffm/libffm/ffm-predict ./data/train_ffm.txt model_ffm tr_ffm.out'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
cmd = './libffm/libffm/ffm-predict ./data/valid_ffm.txt model_ffm va_ffm.out'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
cmd = './libffm/libffm/ffm-predict ./data/test_ffm.txt model_ffm te_ffm.out true'.format(nr_thread=NR_THREAD)
os.popen(cmd).readlines()
1
2
3
4
5
6
訓練GBDT
如今調用LightGBM訓練GBDT模型,由於決策樹較容易過擬合,咱們設置樹的個數爲32,葉子節點數設爲30,深度就不設置了,學習率設爲0.05。

def lgb_pred(tr_path, va_path, _sep = '\t', iter_num = 32):
# load or create your dataset
print('Load data...')
df_train = pd.read_csv(tr_path, header=None, sep=_sep)
df_test = pd.read_csv(va_path, header=None, sep=_sep)

y_train = df_train[0].values
y_test = df_test[0].values
X_train = df_train.drop(0, axis=1).values
X_test = df_test.drop(0, axis=1).values

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': {'l2', 'auc', 'logloss'},
'num_leaves': 30,
# 'max_depth': 7,
'num_trees': 32,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0
}

print('Start training...')
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=iter_num,
valid_sets=lgb_eval,
feature_name=["I1","I2","I3","I4","I5","I6","I7","I8","I9","I10","I11","I12","I13","C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
categorical_feature=["C1","C2","C3","C4","C5","C6","C7","C8","C9","C10","C11","C12","C13","C14","C15","C16","C17","C18","C19","C20","C21","C22","C23","C24","C25","C26"],
early_stopping_rounds=5)

print('Save model...')
# save model to file
gbm.save_model('lgb_model.txt')

print('Start predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

return gbm,y_pred,X_train,y_train
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
訓練的結果:

[1] valid_0's l2: 0.241954 valid_0's auc: 0.70607
Training until validation scores don't improve for 5 rounds.
[2] valid_0's l2: 0.234704 valid_0's auc: 0.715608
[3] valid_0's l2: 0.228139 valid_0's auc: 0.717791
[4] valid_0's l2: 0.222168 valid_0's auc: 0.72273
[5] valid_0's l2: 0.216728 valid_0's auc: 0.724065
[6] valid_0's l2: 0.211819 valid_0's auc: 0.725036
[7] valid_0's l2: 0.207316 valid_0's auc: 0.727427
[8] valid_0's l2: 0.203296 valid_0's auc: 0.728583
[9] valid_0's l2: 0.199582 valid_0's auc: 0.730092
[10] valid_0's l2: 0.196185 valid_0's auc: 0.730792
[11] valid_0's l2: 0.193063 valid_0's auc: 0.732316
[12] valid_0's l2: 0.190268 valid_0's auc: 0.733773
[13] valid_0's l2: 0.187697 valid_0's auc: 0.734782
[14] valid_0's l2: 0.185351 valid_0's auc: 0.735636
[15] valid_0's l2: 0.183215 valid_0's auc: 0.736346
[16] valid_0's l2: 0.181241 valid_0's auc: 0.737393
[17] valid_0's l2: 0.179468 valid_0's auc: 0.737709
[18] valid_0's l2: 0.177829 valid_0's auc: 0.739096
[19] valid_0's l2: 0.176326 valid_0's auc: 0.740135
[20] valid_0's l2: 0.174948 valid_0's auc: 0.741065
[21] valid_0's l2: 0.173675 valid_0's auc: 0.742165
[22] valid_0's l2: 0.172499 valid_0's auc: 0.742672
[23] valid_0's l2: 0.171471 valid_0's auc: 0.743246
[24] valid_0's l2: 0.17045 valid_0's auc: 0.744415
[25] valid_0's l2: 0.169582 valid_0's auc: 0.744792
[26] valid_0's l2: 0.168746 valid_0's auc: 0.745478
[27] valid_0's l2: 0.167966 valid_0's auc: 0.746282
[28] valid_0's l2: 0.167264 valid_0's auc: 0.74675
[29] valid_0's l2: 0.166582 valid_0's auc: 0.747429
[30] valid_0's l2: 0.16594 valid_0's auc: 0.748392
[31] valid_0's l2: 0.165364 valid_0's auc: 0.748986
[32] valid_0's l2: 0.164844 valid_0's auc: 0.749362
Did not meet early stopping. Best iteration is:
[32] valid_0's l2: 0.164844 valid_0's auc: 0.749362
Save model...
Start predicting...
The rmse of prediction is: 0.406009502303
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
咱們把每一個特徵的重要程度排個序看看
def ret_feat_impt(gbm):
gain = gbm.feature_importance("gain").reshape(-1, 1) / sum(gbm.feature_importance("gain"))
col = np.array(gbm.feature_name()).reshape(-1, 1)
return sorted(np.column_stack((col, gain)),key=lambda x: x[1],reverse=True)
1
2
3
4
[array(['I6', '0.1978774213012332'],
dtype='<U32'), array(['I11', '0.1892171073393491'],
dtype='<U32'), array(['C13', '0.09876586224832032'],
dtype='<U32'), array(['I7', '0.09328723289667494'],
dtype='<U32'), array(['C15', '0.07837089393651243'],
dtype='<U32'), array(['I1', '0.06896606612740637'],
dtype='<U32'), array(['C18', '0.03397325870627491'],
dtype='<U32'), array(['C4', '0.03194220375573926'],
dtype='<U32'), array(['I13', '0.027751948092299045'],
dtype='<U32'), array(['C14', '0.022884477973766117'],
dtype='<U32'), array(['C17', '0.01758709018584479'],
dtype='<U32'), array(['I3', '0.01745531293913725'],
dtype='<U32'), array(['C24', '0.015748415135270675'],
dtype='<U32'), array(['C7', '0.014203757070472703'],
dtype='<U32'), array(['I8', '0.013413268591324624'],
dtype='<U32'), array(['C11', '0.012366386458128355'],
dtype='<U32'), array(['C10', '0.011022221770323784'],
dtype='<U32'), array(['I5', '0.01042866903792042'],
dtype='<U32'), array(['C16', '0.010389410428237439'],
dtype='<U32'), array(['I9', '0.009918639946598076'],
dtype='<U32'), array(['C2', '0.006787009911825981'],
dtype='<U32'), array(['C12', '0.005168884905437884'],
dtype='<U32'), array(['I4', '0.00468917800335175'],
dtype='<U32'), array(['C26', '0.003364625407413743'],
dtype='<U32'), array(['C23', '0.0031263193710805628'],
dtype='<U32'), array(['C21', '0.0008737398560005959'],
dtype='<U32'), array(['C19', '0.00042059860405565207'],
dtype='<U32'), array(['I2', '0.0'],
dtype='<U32'), array(['I10', '0.0'],
dtype='<U32'), array(['I12', '0.0'],
dtype='<U32'), array(['C1', '0.0'],
dtype='<U32'), array(['C3', '0.0'],
dtype='<U32'), array(['C5', '0.0'],
dtype='<U32'), array(['C6', '0.0'],
dtype='<U32'), array(['C8', '0.0'],
dtype='<U32'), array(['C9', '0.0'],
dtype='<U32'), array(['C20', '0.0'],
dtype='<U32'), array(['C22', '0.0'],
dtype='<U32'), array(['C25', '0.0'],
dtype='<U32')]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
經過eli5分析參數
import eli5

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import csv
import numpy as np

with open('./data/train_eli5.csv', 'rt') as f:
data = list(csv.DictReader(f))

_all_xs = [{k: v for k, v in row.items() if k != 'clicked'} for row in data]
_all_ys = np.array([int(row['clicked']) for row in data])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
899991 items total, 25.5% true
1
# from xgboost import XGBClassifier
import warnings
# xgboost <= 0.6a2 shows a warning when used with scikit-learn 0.18+
warnings.filterwarnings('ignore', category=UserWarning)
class CSCTransformer:
def transform(self, xs):
# work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543
return xs.tocsc()
def fit(self, *args):
return self

clf = lgb.LGBMClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, CSCTransformer(), clf)

def evaluate(_clf):
scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
_clf.fit(train_xs, train_ys) # so that parts of the original pipeline are fitted

evaluate(pipeline)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Accuracy: 0.776 ± 0.003
1
booster = clf.booster_ #若是運行出錯請使用這句clf.booster()
original_feature_names = booster.feature_name
booster.feature_names = vec.get_feature_names()
# recover original feature names
booster.feature_names = original_feature_names
1
2
3
4
5
from eli5 import show_weights
show_weights(clf, vec=vec)
1
2


from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)
1
2


用LightGBM的輸出生成FM數據
數據格式請參見libFM 1.4.2 manual中的說明,截取文檔中的格式說明以下:

GBDT已經訓練好了,咱們須要GBDT輸出的葉子節點做爲輸入數據X傳給FM,一共30個葉子節點,那麼輸入給FM的數據格式就是X中不是0的數據的index:value。

一段真實數據以下:0 0:31 1:61 2:93 3:108 4:149 5:182 6:212 7:242 8:277 9:310 10:334 11:365 12:401 13:434 14:465 15:491 16:527 17:552 18:589 19:619 20:648 21:678 22:697 23:744 24:770 25:806 26:826 27:862 28:899 29:928 30:955 31:988

def generat_lgb2fm_data(outdir, gbm, dump, tr_path, va_path, te_path, _sep = '\t'):
with open(os.path.join(outdir, 'train_lgb2fm.txt'), 'w') as out_train:
with open(os.path.join(outdir, 'valid_lgb2fm.txt'), 'w') as out_valid:
with open(os.path.join(outdir, 'test_lgb2fm.txt'), 'w') as out_test:
df_train_ = pd.read_csv(tr_path, header=None, sep=_sep)
df_valid_ = pd.read_csv(va_path, header=None, sep=_sep)
df_test_= pd.read_csv(te_path, header=None, sep=_sep)

y_train_ = df_train_[0].values
y_valid_ = df_valid_[0].values

X_train_ = df_train_.drop(0, axis=1).values
X_valid_ = df_valid_.drop(0, axis=1).values
X_test_= df_test_.values

train_leaves= gbm.predict(X_train_, num_iteration=gbm.best_iteration, pred_leaf=True)
valid_leaves= gbm.predict(X_valid_, num_iteration=gbm.best_iteration, pred_leaf=True)
test_leaves= gbm.predict(X_test_, num_iteration=gbm.best_iteration, pred_leaf=True)

tree_info = dump['tree_info']
tree_counts = len(tree_info)
for i in range(tree_counts):
train_leaves[:, i] = train_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
valid_leaves[:, i] = valid_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
test_leaves[:, i] = test_leaves[:, i] + tree_info[i]['num_leaves'] * i + 1
# print(train_leaves[:, i])
# print(tree_info[i]['num_leaves'])

for idx in range(len(y_train_)):
out_train.write((str(y_train_[idx]) + '\t'))
out_train.write('\t'.join(
['{}:{}'.format(ii, val) for ii,val in enumerate(train_leaves[idx]) if float(val) != 0 ]) + '\n')

for idx in range(len(y_valid_)):
out_valid.write((str(y_valid_[idx]) + '\t'))
out_valid.write('\t'.join(
['{}:{}'.format(ii, val) for ii,val in enumerate(valid_leaves[idx]) if float(val) != 0 ]) + '\n')

for idx in range(len(X_test_)):
out_test.write('\t'.join(
['{}:{}'.format(ii, val) for ii,val in enumerate(test_leaves[idx]) if float(val) != 0 ]) + '\n')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
訓練FM
爲訓練FM的數據已經準備好了,咱們調用LibFM進行訓練。
迭代64次,使用sgd訓練,學習率是0.00000001,訓練好的模型保存爲文件fm_model。
訓練輸出的log,Train和Test的數值不是loss,是accuracy。

cmd = './libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 64 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -save_model fm_model'
os.popen(cmd).readlines()
1
2
訓練結果:

['----------------------------------------------------------------------------\n',
'libFM\n',
' Version: 1.4.4\n',
' Author: Steffen Rendle, srendle@libfm.org\n',
' WWW: http://www.libfm.org/\n',
'This program comes with ABSOLUTELY NO WARRANTY; for details see license.txt.\n',
'This is free software, and you are welcome to redistribute it under certain\n',
'conditions; for details see license.txt.\n',
'----------------------------------------------------------------------------\n',
'Loading train...\t\n',
'has x = 1\n',
'has xt = 0\n',
'num_rows=899991\tnum_values=28799712\tnum_features=32\tmin_target=0\tmax_target=1\n',
'Loading test... \t\n',
'has x = 1\n',
'has xt = 0\n',
'num_rows=100009\tnum_values=3200288\tnum_features=32\tmin_target=0\tmax_target=1\n',
'#relations: 0\n',
'Loading meta data...\t\n',
'learnrate=1e-08\n',
'learnrates=1e-08,1e-08,1e-08\n',
'#iterations=64\n',
"SGD: DON'T FORGET TO SHUFFLE THE ROWS IN TRAINING DATA TO GET THE BEST RESULTS.\n",
'#Iter= 0\tTrain=0.625438\tTest=0.619484\n',
'#Iter= 1\tTrain=0.636596\tTest=0.632013\n',
'#Iter= 2\tTrain=0.627663\tTest=0.623114\n',
'#Iter= 3\tTrain=0.609776\tTest=0.606605\n',
'#Iter= 4\tTrain=0.563581\tTest=0.56092\n',
'#Iter= 5\tTrain=0.497907\tTest=0.495655\n',
'#Iter= 6\tTrain=0.461677\tTest=0.461408\n',
'#Iter= 7\tTrain=0.453666\tTest=0.452639\n',
'#Iter= 8\tTrain=0.454026\tTest=0.453419\n',
'#Iter= 9\tTrain=0.456836\tTest=0.455919\n',
'#Iter= 10\tTrain=0.46032\tTest=0.459339\n',
'#Iter= 11\tTrain=0.466546\tTest=0.465358\n',
'#Iter= 12\tTrain=0.473565\tTest=0.472317\n',
'#Iter= 13\tTrain=0.481726\tTest=0.480967\n',
'#Iter= 14\tTrain=0.492357\tTest=0.491216\n',
'#Iter= 15\tTrain=0.504419\tTest=0.502935\n',
'#Iter= 16\tTrain=0.517793\tTest=0.516214\n',
'#Iter= 17\tTrain=0.533604\tTest=0.532102\n',
'#Iter= 18\tTrain=0.552926\tTest=0.5515\n',
'#Iter= 19\tTrain=0.575645\tTest=0.573198\n',
'#Iter= 20\tTrain=0.59418\tTest=0.590887\n',
'#Iter= 21\tTrain=0.610691\tTest=0.607815\n',
'#Iter= 22\tTrain=0.626138\tTest=0.623384\n',
'#Iter= 23\tTrain=0.640751\tTest=0.637923\n',
'#Iter= 24\tTrain=0.65393\tTest=0.652141\n',
'#Iter= 25\tTrain=0.666099\tTest=0.6641\n',
'#Iter= 26\tTrain=0.677933\tTest=0.675419\n',
'#Iter= 27\tTrain=0.689539\tTest=0.687108\n',
'#Iter= 28\tTrain=0.700177\tTest=0.697397\n',
'#Iter= 29\tTrain=0.709265\tTest=0.706156\n',
'#Iter= 30\tTrain=0.716553\tTest=0.713266\n',
'#Iter= 31\tTrain=0.723218\tTest=0.719635\n',
'#Iter= 32\tTrain=0.729163\tTest=0.726065\n',
'#Iter= 33\tTrain=0.734428\tTest=0.731354\n',
'#Iter= 34\tTrain=0.738863\tTest=0.735844\n',
'#Iter= 35\tTrain=0.74284\tTest=0.740323\n',
'#Iter= 36\tTrain=0.746316\tTest=0.743793\n',
'#Iter= 37\tTrain=0.749123\tTest=0.746333\n',
'#Iter= 38\tTrain=0.751573\tTest=0.748493\n',
'#Iter= 39\tTrain=0.753264\tTest=0.750292\n',
'#Iter= 40\tTrain=0.754803\tTest=0.751642\n',
'#Iter= 41\tTrain=0.756011\tTest=0.753062\n',
'#Iter= 42\tTrain=0.756902\tTest=0.753892\n',
'#Iter= 43\tTrain=0.757642\tTest=0.754872\n',
'#Iter= 44\tTrain=0.758293\tTest=0.755372\n',
'#Iter= 45\tTrain=0.758855\tTest=0.755782\n',
'#Iter= 46\tTrain=0.759293\tTest=0.756322\n',
'#Iter= 47\tTrain=0.759695\tTest=0.756652\n',
'#Iter= 48\tTrain=0.760084\tTest=0.756982\n',
'#Iter= 49\tTrain=0.760343\tTest=0.757252\n',
'#Iter= 50\tTrain=0.76055\tTest=0.757332\n',
'#Iter= 51\tTrain=0.760706\tTest=0.757582\n',
'#Iter= 52\tTrain=0.760944\tTest=0.757842\n',
'#Iter= 53\tTrain=0.761035\tTest=0.757952\n',
'#Iter= 54\tTrain=0.761173\tTest=0.758152\n',
'#Iter= 55\tTrain=0.761291\tTest=0.758382\n',
'#Iter= 56\tTrain=0.76142\tTest=0.758412\n',
'#Iter= 57\tTrain=0.761541\tTest=0.758452\n',
'#Iter= 58\tTrain=0.761677\tTest=0.758572\n',
'#Iter= 59\tTrain=0.76175\tTest=0.758692\n',
'#Iter= 60\tTrain=0.761829\tTest=0.758822\n',
'#Iter= 61\tTrain=0.761855\tTest=0.758862\n',
'#Iter= 62\tTrain=0.761918\tTest=0.759002\n',
'#Iter= 63\tTrain=0.761988\tTest=0.758972\n',
'Final\tTrain=0.761988\tTest=0.758972\n',
'Writing FM model to fm_model\n']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
FM模型訓練好了,咱們把訓練、驗證和測試數據輸入給FM,獲得FM層的輸出,輸出的文件名爲*.fm.logits

cmd = './libfm/libfm/bin/libFM -task c -train ./data/train_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix tr'
os.popen(cmd).readlines()
cmd = './libfm/libfm/bin/libFM -task c -train ./data/valid_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix va'
os.popen(cmd).readlines()
cmd = './libfm/libfm/bin/libFM -task c -train ./data/test_lgb2fm.txt -test ./data/valid_lgb2fm.txt -dim ’1,1,8’ -iter 32 -method sgd -learn_rate 0.00000001 -regular ’0,0,0.01’ -init_stdev 0.1 -load_model fm_model -train_off true -prefix te -test2predict true'
os.popen(cmd).readlines()
1
2
3
4
5
6
開始構建模型
embed_dim = 32
sparse_max = 30000 # sparse_feature_dim = 117568
sparse_dim = 26
dense_dim = 13
out_dim = 400
1
2
3
4
5
定義輸入佔位符

import tensorflow as tf
def get_inputs():
dense_input = tf.placeholder(tf.float32, [None, dense_dim], name="dense_input")
sparse_input = tf.placeholder(tf.int32, [None, sparse_dim], name="sparse_input")
FFM_input = tf.placeholder(tf.float32, [None, 1], name="FFM_input")
FM_input = tf.placeholder(tf.float32, [None, 1], name="FM_input")

targets = tf.placeholder(tf.float32, [None, 1], name="targets")
LearningRate = tf.placeholder(tf.float32, name = "LearningRate")
return dense_input, sparse_input, FFM_input, FM_input, targets, LearningRate
1
2
3
4
5
6
7
8
9
10
輸入類別特徵,從嵌入層得到嵌入向量

def get_sparse_embedding(sparse_input):
with tf.name_scope("sparse_embedding"):
sparse_embed_matrix = tf.Variable(tf.random_uniform([sparse_max, embed_dim], -1, 1), name = "sparse_embed_matrix")
sparse_embed_layer = tf.nn.embedding_lookup(sparse_embed_matrix, sparse_input, name = "sparse_embed_layer")
sparse_embed_layer = tf.reshape(sparse_embed_layer, [-1, sparse_dim * embed_dim])
return sparse_embed_layer
1
2
3
4
5
6
輸入數值特徵,和嵌入向量連接在一塊兒通過三層全鏈接層

def get_dnn_layer(dense_input, sparse_embed_layer):
with tf.name_scope("dnn_layer"):
input_combine_layer = tf.concat([dense_input, sparse_embed_layer], 1) #(?, 845 = 832 + 13)
fc1_layer = tf.layers.dense(input_combine_layer, out_dim, name = "fc1_layer", activation=tf.nn.relu)
fc2_layer = tf.layers.dense(fc1_layer, out_dim, name = "fc2_layer", activation=tf.nn.relu)
fc3_layer = tf.layers.dense(fc2_layer, out_dim, name = "fc3_layer", activation=tf.nn.relu)
return fc3_layer
1
2
3
4
5
6
7
構建計算圖
如前所述,將FFM和FM層的輸出通過全鏈接層,再和數值特徵、嵌入向量的三層全鏈接層的輸出鏈接在一塊兒,作Logistic迴歸。
採用LogLoss損失,FtrlOptimizer優化損失。

tf.reset_default_graph()
train_graph = tf.Graph()
with train_graph.as_default():
dense_input, sparse_input, FFM_input, FM_input, targets, lr = get_inputs()
sparse_embed_layer = get_sparse_embedding(sparse_input)
fc3_layer = get_dnn_layer(dense_input, sparse_embed_layer)

ffm_fc_layer = tf.layers.dense(FFM_input, 1, name = "ffm_fc_layer")
fm_fc_layer = tf.layers.dense(FM_input, 1, name = "fm_fc_layer")
feature_combine_layer = tf.concat([ffm_fc_layer, fm_fc_layer, fc3_layer], 1) #(?, 402)

with tf.name_scope("inference"):
logits = tf.layers.dense(feature_combine_layer, 1, name = "logits_layer")
pred = tf.nn.sigmoid(logits, name = "prediction")

with tf.name_scope("loss"):
# LogLoss損失,Logistic迴歸到點擊率
# cost = tf.losses.sigmoid_cross_entropy(targets, logits )
sigmoid_cost = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets, logits=logits, name = "sigmoid_cost")
logloss_cost = tf.losses.log_loss(labels=targets, predictions=pred)
cost = logloss_cost # + sigmoid_cost
loss = tf.reduce_mean(cost)
# 優化損失
# train_op = tf.train.AdamOptimizer(lr).minimize(loss) #cost
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.FtrlOptimizer(lr) #tf.train.FtrlOptimizer(lr) AdamOptimizer
gradients = optimizer.compute_gradients(loss) #cost
train_op = optimizer.apply_gradients(gradients, global_step=global_step)

# Accuracy
with tf.name_scope("score"):
correct_prediction = tf.equal(tf.to_float(pred > 0.5), targets)
accuracy = tf.reduce_mean(tf.to_float(correct_prediction), name="accuracy")

# auc, uop = tf.contrib.metrics.streaming_auc(pred, targets)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
超參
數據量太大,咱們只跑一個epoch。

# Number of Epochs
num_epochs = 1
# Batch Size
batch_size = 32

# Learning Rate
learning_rate = 0.01
# Show stats for every n number of batches
show_every_n_batches = 25

save_dir = './save'

ffm_tr_out_path = './tr_ffm.out.logit'
ffm_va_out_path = './va_ffm.out.logit'
fm_tr_out_path = './tr.fm.logits'
fm_va_out_path = './va.fm.logits'
train_path = './data/train.txt'
valid_path = './data/valid.txt'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
讀取FFM的輸出

ffm_train = pd.read_csv(ffm_tr_out_path, header=None)
ffm_train = ffm_train[0].values

ffm_valid = pd.read_csv(ffm_va_out_path, header=None)
ffm_valid = ffm_valid[0].values
1
2
3
4
5
讀取FM的輸出

fm_train = pd.read_csv(fm_tr_out_path, header=None)
fm_train = fm_train[0].values

fm_valid = pd.read_csv(fm_va_out_path, header=None)
fm_valid = fm_valid[0].values
1
2
3
4
5
讀取數據集
將DNN數據和FM、FFM的輸出數據讀取出來,並鏈接在一塊兒

train_data = pd.read_csv(train_path, header=None)
train_data = train_data.values

valid_data = pd.read_csv(valid_path, header=None)
valid_data = valid_data.values

cc_train = np.concatenate((ffm_train.reshape(-1, 1), fm_train.reshape(-1, 1), train_data), 1)
cc_valid = np.concatenate((ffm_valid.reshape(-1, 1), fm_valid.reshape(-1, 1), valid_data), 1)

np.random.shuffle(cc_train)
np.random.shuffle(cc_valid)

train_y = cc_train[:,-1]
test_y = cc_valid[:,-1]

train_X = cc_train[:,0:-1]
test_X = cc_valid[:,0:-1]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
訓練網絡
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import time
import datetime
from sklearn.metrics import log_loss
from sklearn.learning_curve import learning_curve
from sklearn import metrics
def train_model(num_epochs):
losses = {'train':[], 'test':[]}
acc_lst = {'train':[], 'test':[]}
pred_lst = []

with tf.Session(graph=train_graph) as sess:


# Keep track of gradient values and sparsity
grad_summaries = []
for g, v in gradients:
if g is not None:
grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name.replace(':', '_')), g)
sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name.replace(':', '_')), tf.nn.zero_fraction(g))
grad_summaries.append(grad_hist_summary)
grad_summaries.append(sparsity_summary)
grad_summaries_merged = tf.summary.merge(grad_summaries)

# Output directory for models and summaries
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))

# Summaries for loss and accuracy
loss_summary = tf.summary.scalar("loss", loss)
# acc_summary = tf.scalar_summary("accuracy", accuracy)

# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, grad_summaries_merged])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

# Inference summaries
inference_summary_op = tf.summary.merge([loss_summary])
inference_summary_dir = os.path.join(out_dir, "summaries", "inference")
inference_summary_writer = tf.summary.FileWriter(inference_summary_dir, sess.graph)

sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
saver = tf.train.Saver()
for epoch_i in range(num_epochs):

#將數據集分紅訓練集和測試集
train_batches = get_batches(train_X, train_y, batch_size)
test_batches = get_batches(test_X, test_y, batch_size)

#訓練的迭代,保存訓練損失
for batch_i in range(len(train_X) // batch_size):
x, y = next(train_batches)

feed = {
dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),
sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),
FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),
FM_input: np.reshape(x.take(1,1), [batch_size, 1]),
targets: np.reshape(y, [batch_size, 1]),
lr: learning_rate}
# _ = sess.run([train_op], feed) #cost
step, train_loss, summaries, _, prediction, acc = sess.run(
[global_step, loss, train_summary_op, train_op, pred, accuracy], feed) #cost

prediction = prediction.reshape(y.shape)
losses['train'].append(train_loss)

acc_lst['train'].append(acc)
train_summary_writer.add_summary(summaries, step) #

if(np.mean(y) != 0):
auc = metrics.roc_auc_score(y, prediction)
else:
auc = -1

# Show every <show_every_n_batches> batches
if (epoch_i * (len(train_X) // batch_size) + batch_i) % show_every_n_batches == 0:
time_str = datetime.datetime.now().isoformat()
print('{}: Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f} accuracy = {} auc = {}'.format(
time_str,
epoch_i,
batch_i,
(len(train_X) // batch_size),
train_loss,
acc,
auc))
# print(metrics.classification_report(y, np.float32(prediction > 0.5)))

#使用測試數據的迭代
for batch_i in range(len(test_X) // batch_size):
x, y = next(test_batches)

feed = {
dense_input: x.take([2,3,4,5,6,7,8,9,10,11,12,13,14],1),
sparse_input: x.take([15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40],1),
FFM_input: np.reshape(x.take(0,1), [batch_size, 1]),
FM_input: np.reshape(x.take(1,1), [batch_size, 1]),
targets: np.reshape(y, [batch_size, 1]),
lr: learning_rate}
# Get Prediction
step, test_loss, summaries, prediction, acc = sess.run(
[global_step, loss, inference_summary_op, pred, accuracy], feed) #cost

#保存測試損失和準確率
prediction = prediction.reshape(y.shape)
losses['test'].append(test_loss)

acc_lst['test'].append(acc)
inference_summary_writer.add_summary(summaries, step) #
pred_lst.append(prediction)

if(np.mean(y) != 0):
auc = metrics.roc_auc_score(y, prediction)
else:
auc = -1

time_str = datetime.datetime.now().isoformat()
if (epoch_i * (len(test_X) // batch_size) + batch_i) % show_every_n_batches == 0:
print('{}: Epoch {:>3} Batch {:>4}/{} test_loss = {:.3f} accuracy = {} auc = {}'.format(
time_str,
epoch_i,
batch_i,
(len(test_X) // batch_size),
test_loss,
acc,
auc))
print(metrics.classification_report(y, np.float32(prediction > 0.5)))

# Save Model
saver.save(sess, save_dir) #, global_step=epoch_i
print('Model Trained and Saved')
save_params((losses, acc_lst, pred_lst, save_dir))
return losses, acc_lst, pred_lst, save_dir
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
losses, acc_lst, pred_lst, load_dir = train_model(1)
1
輸出驗證集上的訓練信息
平均準確率
平均損失
平均Auc
預測的平均點擊率
精確率、召回率、F1 Score等信息
由於數據中大部分都是負例,正例較少,若是模型所有猜0就能有75%的準確率,因此準確率這個指標是不可信的。

咱們須要關注正例的精確率和召回率,固然最主要仍是要看LogLoss的值,由於比賽採用的評價指標是LogLoss,而不是採用AUC值。

def train_info():
print("Test Mean Acc : {}".format(np.mean(acc_lst['test']))) #test_pred_mean
print("Test Mean Loss : {}".format(np.mean(losses['test']))) #test_pred_mean
print("Mean Auc : {}".format(metrics.roc_auc_score(test_y[:-9], np.array(pred_lst).reshape(-1, 1))))
print("Mean prediction : {}".format(np.mean(np.array(pred_lst).reshape(-1, 1))))
print(metrics.classification_report(test_y[:-9], np.float32(np.array(pred_lst).reshape(-1, 1) > 0.5)))
1
2
3
4
5
6
Test Mean Acc : 0.7814300060272217
Test Mean Loss : 0.46838584542274475
Mean Auc : 0.7792937214782675
Mean prediction : 0.2552148997783661
precision recall f1-score support

0.0 0.81 0.93 0.86 74426
1.0 0.63 0.34 0.45 25574

avg / total 0.76 0.78 0.76 100000
1
2
3
4
5
6
7
8
9
10
TensorBoard中查看loss


總結
以上就是點擊率預估的完整過程,沒有進行完整數據的訓練,而且有不少超參能夠調整,從只跑了一次epoch的結果來看,驗證集上的LogLoss是0.46,其餘數據都在75%~80%之間,這跟FFM、GBDT和FM網絡訓練的準確率差很少。

擴展閱讀Code for the 3rd place finish for Avazu Click-Through Rate PredictionKaggle : Display Advertising Challenge( ctr 預估 )用機器學習對CTR預估建模Beginner's Guide to Click-Through Rate Prediction with Logistic Regression2nd place solution for Avazu click-through rate prediction competition常見計算廣告點擊率預估算法總結3 Idiots' Approach for Display Advertising ChallengeSolution to the Outbrain Click Prediction competitionDeep Interest Network for Click-Through Rate PredictionLearning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction重磅!阿里媽媽首次公開自研CTR預估核心算法MLR阿里蓋坤團隊提出深度興趣網絡,更懂用戶何時會剁手深刻FFM原理與實踐今天的分享就到這裏,就醬~--------------------- 做者:你先等等 來源:CSDN 原文:https://blog.csdn.net/chengcheng1394/article/details/78940565 版權聲明:本文爲博主原創文章,轉載請附上博文連接!

相關文章
相關標籤/搜索