CTR學習筆記&代碼實現4-深度ctr模型 NFM/AFM

時間 2020-06-03

標籤 ctr 學習筆記代碼實現深度模型 nfm afm 简体版

原文原文鏈接

這一節咱們總結FM另外兩個遠親NFM，AFM。NFM和AFM都是針對Wide&Deep 中Deep部分的改造。上一章PNN用到了向量內積外積來提取特徵交互信息，總共向量乘積就這幾種，這不NFM就帶着element-wise(hadamard) product來了。AFM則是引入了注意力機制把NFM的等權求和變成了加權求和。html

如下代碼針對Dense輸入感受更容易理解模型結構，針對spare輸入的代碼和完整代碼 👇
https://github.com/DSXiangLi/CTRpython

NFM

NFM的創新點是在wide&Deep的Deep部分，在Embedding層和全聯接層之間加入了BI-Pooling層,也就是Embedding兩兩作element-wise乘積獲得 \(N*(N-1)/2\)個 \(1*K\)的矩陣而後作sum_pooling獲得最終\(1*k\)的矩陣。git

\[f_{BI}(V_x) = \sum_{i=1}^n\sum_{j=i+1}^n (x_iv_i) \odot (x_jv_j) \]

Deep部分的模型結構以下github

和其餘模型的聯繫

NFM不接全鏈接層，直接weight=1輸出就是FM，因此NFM能夠在FM上學到更高階的特徵交互。app

有看到一種說法是DeepFM是FM和Deep並聯，NFM是把FM和Deep串聯，也是能夠這麼理解，但感受本質是在學習不一樣的信息，把FM放在wide側是幫助學習二階‘記憶特徵’，把FM放在Deep側是幫助學習高階‘泛化特徵’。ide

NFM和PNN都是用向量相乘的方式來幫助全聯接層提煉特徵交互信息。雖然一個是element-wise product一個是inner product,但區別其實只是作sum_pooling時axis的差別。 IPNN是在k的axis上求和獲得\(N^2\)個scaler拼接成輸入，而NFM是在\(N^2\)的axis上求和獲得\(1*K\)的輸入。學習

下面這個例子能夠比較直觀的比較一下FM，NFM，IPNN對Embedding的處理(爲了簡單理解給了Embedding簡單數值)ui

\[\begin{align} & embedding_1 = [0.5,0.5,0.5]\\ & embedding_2 = [2,2,2]\\ & embedding_3 = [4,4,4]\\ & embedding_1 \odot embedding_2 = [1,1,1]\\ & embedding_1 \odot embedding_3 = [2,2,2]\\ & embedding_2 \odot embedding_3 = [8,8,8]\\ & IPNN = [3，6，24] \\ & NFM = [11，11，11]\\ & FM = [33]\\ \end{align} \]

NFM幾個想吐槽的點spa

和FNN，PNN同樣對低階特徵的提煉比較有限
這個sum_pooling一樣會存在信息損失，不一樣的特徵交互對Target的影響不一樣，等權加和必定不是最好的方法，但也算是爲特徵交互提供了一種新方法

代碼實現

@tf_estimator_model
def model_fn_dense(features, labels, mode, params):
    dense_feature, sparse_feature = build_features()
    dense = tf.feature_column.input_layer(features, dense_feature)
    sparse = tf.feature_column.input_layer(features, sparse_feature)

    field_size = len( dense_feature )
    embedding_size = dense_feature[0].variable_shape.as_list()[-1]
    embedding_matrix = tf.reshape( dense, [-1, field_size, embedding_size] )  # batch * field_size *emb_size

    with tf.variable_scope('Linear_output'):
        linear_output = tf.layers.dense( sparse, units=1 )
        add_layer_summary( 'linear_output', linear_output )

    with tf.variable_scope('BI_Pooling'):
        sum_square = tf.pow(tf.reduce_sum(embedding_matrix, axis=1), 2)
        square_sum = tf.reduce_sum(tf.pow(embedding_matrix, 2), axis=1)
        dense = tf.subtract(sum_square, square_sum)
        add_layer_summary( dense.name, dense )

    dense = stack_dense_layer(dense, params['hidden_units'],
                              dropout_rate = params['dropout_rate'], batch_norm = params['batch_norm'],
                              mode = mode, add_summary = True)

    with tf.variable_scope('output'):
        y = linear_output + dense
        add_layer_summary( 'output', y )

    return y

AFM

AFM和NFM一樣使用element-wise product來提取特徵交互信息，和NFM直接等權重作pooling不一樣的是，AFM增長了一層Attention Layer來學習pooling的權重。3d

Deep部分的模型結構以下

\[\begin{align} f_{Att} = \sum_{i=1}^n\sum_{j=i+1}^n a_{ij}(v_ix_i) \odot (v_jx_j) \end{align} \]

注意力部分是一個簡單的全聯接層，輸出的是\(N(N-1)/2\)的矩陣，做爲sum_pooling的權重向量，對element-wise特徵交互向量進行加權求和。加權求和的向量直接鏈接output，再也不通過全聯接層。若是權重爲1，那AFM和不帶全聯接層的NFM是同樣滴。

\[\begin{align} a_{ij} &= h^T ReLU(W (v_ix_i) \odot (v_jx_j) +b) \\ a_{ij} &= \frac{exp(a_{ij})}{\sum_{ij}exp(a_{ij})}\\ \end{align} \]

AFM幾個想吐槽的點

不帶全聯接層會致使高級特徵表達有限，不過這個不重要啦，AFM更多仍是爲特徵交互提供了Attention的新思路

代碼實現

@tf_estimator_model
def model_fn_dense(features, labels, mode, params):
    dense_feature, sparse_feature = build_features()
    dense = tf.feature_column.input_layer(features, dense_feature) # lz linear concat of embedding
    sparse = tf.feature_column.input_layer(features, sparse_feature)

    field_size = len( dense_feature )
    embedding_size = dense_feature[0].variable_shape.as_list()[-1]
    embedding_matrix = tf.reshape( dense, [-1, field_size, embedding_size] )  # batch * field_size *emb_size

    with tf.variable_scope('Linear_part'):
        linear_output = tf.layers.dense(sparse, units=1)
        add_layer_summary( 'linear_output', linear_output )

    with tf.variable_scope('Elementwise_Interaction'):
        elementwise_list = []
        for i in range(field_size):
            for j in range(i+1, field_size):
                vi = tf.gather(embedding_matrix, indices=i, axis=1, batch_dims=0,name = 'vi') # batch * emb_size
                vj = tf.gather(embedding_matrix, indices=j, axis=1, batch_dims=0,name = 'vj')
                elementwise_list.append(tf.multiply(vi,vj)) # batch * emb_size
        elementwise_matrix = tf.stack(elementwise_list) # (N*(N-1)/2) * batch * emb_size
        elementwise_matrix = tf.transpose(elementwise_matrix, [1,0,2]) # batch * (N*(N-1)/2) * emb_size

    with tf.variable_scope('Attention_Net'):
        # 2 fully connected layer
        dense = tf.layers.dense(elementwise_matrix, units = params['attention_factor'], activation = 'relu') # batch * (N*(N-1)/2) * t
        add_layer_summary( dense.name, dense )
        attention_weight = tf.layers.dense(dense, units=1, activation = 'softmax') # batch *(N*(N-1)/2) * 1
        add_layer_summary( attention_weight.name, attention_weight)

    with tf.variable_scope('Attention_pooling'):
        interaction_output = tf.reduce_sum(tf.multiply(elementwise_matrix, attention_weight), axis=1) # batch * emb_size
        interaction_output = tf.layers.dense(interaction_output, units=1) # batch * 1

    with tf.variable_scope('output'):
        y = interaction_output + linear_output
        add_layer_summary( 'output', y )

    return y

CTR學習筆記&代碼實現系列👇

https://github.com/DSXiangLi/CTR

CTR學習筆記&代碼實現1-深度學習的前奏LR->FFM
CTR學習筆記&代碼實現2-深度ctr模型 MLP->Wide&Deep
CTR學習筆記&代碼實現3-深度ctr模型 FNN->PNN->DeepFM

資料

Jun Xiao, Hao Ye ,2017, Attentional Factorization Machines - Learning the Weight of Feature Interactions via Attention Networks
Xiangnan He, Tat-Seng Chua,2017, Neural Factorization Machines for Sparse Predictive Analytics
https://zhuanlan.zhihu.com/p/86181485

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。