Wide & Deep Learning Model

Generalized linear models with nonlinear feature transformations (特徵工程 + 線性模型) are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions (線性模型中學習到的特徵係數解釋性強)through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort(線性模型的泛化性能須要大量的特徵工程).html

With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. (dnn從稀疏的特徵向量中學習獲得低維度詞向量,泛化性能較好,可是可能欠擬合)However, deep neural networks with embeddings can over-generalize。git

Wide & Deep learning—jointly trained wide linear models and deep neural networks—to combine the benefits of memorization and generalization for recommender systems.github

The Wide Component算法

The wide component is a generalized linear model of the form $y = w^T x + b$, as illustrated in Figure 1 (left). y is the prediction, $x = [x_1, x_2, ..., x_d] $ is a vector of d features, $w = [w_1, w_2, ..., w_d]$ are the model parameters and b is the bias. The feature set includes raw input features and transformed features(好比組合特徵).網絡

The Deep Componentapp

The deep component is a feed-forward neural network, as shown in Figure 1 (right). For categorical features, the original inputs are feature strings (e.g., 「language=en」). Each of these sparse, high-dimensional categorical features are first converted into a low-dimensional and dense real-valued vector(有兩種處理辦法,一種是對每一個field的特徵進行embedding獲得一個詞向量,一種是全部field特徵都one-hot後再經由一個embedding層後獲得一個詞向量,這裏採用後者), often referred to as an embedding vector. The dimensionality of the embeddings are usually on the order of O(10) to O(100). The embedding vectors are initialized randomly and then the values are trained to minimize the final loss function during model training. These low-dimensional dense embedding vectors are then fed into the hidden layers of a neural network in the forward pass. Specifically, each hidden layer performs the following computation: less

$a ^{(l+1)} = f(W^{(l)} a ^{(l)} + b^ {(l)} )$dom

 where l is the layer number and f is the activation function, often rectified linear units (ReLUs). $a^{(l)} , b^{(l)} , and W^{(l)} $ are the activations, bias, and model weights at l-th layer.ide

Joint Training of Wide & Deep Model函數

The wide component and deep component are combined using a weighted sum of their output log odds as the prediction, which is then fed to one common logistic loss function for joint training(Wide部分和deep部分都輸出一個機率,而後加權平均(權值須要學習),加權平均後獲得的機率可直接用於對數損失函數,其實就是LR). (paddle中的實現:concatenate LR和DNN的輸出,獲得一個二維向量,經由一個全鏈接層(激活函數爲sigmoid),輸出最終機率。其實和前面的註釋意義同樣)Note that there is a distinction between joint training and ensemble. In an ensemble, individual models are trained separately without knowing each other, and their predictions are combined only at inference time but not at training time. In contrast, joint training optimizes all parameters simultaneously by taking both the wide and deep part as well as the weights of their sum into account at training time. For joint training the wide part only needs to complement the weaknesses of the deep part with a small number of cross-product feature transformations, rather than a full-size wide model.

Joint training of a Wide & Deep Model is done by backpropagating the gradients from the output to both the wide and deep part of the model simultaneously using mini-batch stochastic optimization. In the experiments, we used Followthe-regularized-leader (FTRL) algorithm [3] with L1 regularization as the optimizer for the wide part of the model, and AdaGrad [1] for the deep part(Wide部分使用FTRL優化算法即sgd+L1正則,Deep部分使用AdaGrad優化算法,可是paddle中對於一個Model只能指定一個優化方法).

The combined model is illustrated in Figure 1 (center). For a logistic regression problem, the model’s prediction is:

 

where Y is the binary class label, σ(·) is the sigmoid function, φ(x) are the cross product transformations of the original features x, and b is the bias term. $w_{wide}$ is the vector of all wide model weights, and $w_{deep}$ are the weights applied on the final activations $a^{(lf)}$ .

 實際項目中,Deep部分的輸入爲類別特徵,須要進行one-hot處理,Wide部分其實就是一個LR,使用統計特徵,cvr特徵等,統計特徵進行one-hot處理,cvr特徵須要離散化再one-hot處理。

 搭建網絡:

class CTRmodel(object):
    '''
    A CTR model which implements wide && deep learning model.
    '''

    def __init__(self,
                 dnn_layer_dims,
                 dnn_input_dim,
                 lr_input_dim,
                 model_type=ModelType.create_classification(),
                 is_infer=False):
        '''
        @dnn_layer_dims: list of integer
            dims of each layer in dnn
        @dnn_input_dim: int
            size of dnn's input layer
        @lr_input_dim: int
            size of lr's input layer
        @is_infer: bool
            whether to build a infer model
        '''
        self.dnn_layer_dims = dnn_layer_dims
        self.dnn_input_dim = dnn_input_dim
        self.lr_input_dim = lr_input_dim
        self.model_type = model_type
        self.is_infer = is_infer

        self._declare_input_layers()

        self.dnn = self._build_dnn_submodel_(self.dnn_layer_dims)
        self.lr = self._build_lr_submodel_()

        # model's prediction
        # TODO(superjom) rename it to prediction
        if self.model_type.is_classification():
            self.model = self._build_classification_model(self.dnn, self.lr)
        if self.model_type.is_regression():
            self.model = self._build_regression_model(self.dnn, self.lr)

    # layer.data: define DataLayer For NeuralNetwork.
    def _declare_input_layers(self):
        # Deep部分的輸入,使用類別特徵,須要one-hot處理
        # Sparse binary vector : the input feature is a sparse vector and the every element in this vector is either zero or one.
# sparse_binary_vector的輸入是特徵值的下標組成的向量 self.dnn_merged_input = layer.data( name='dnn_input', type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim)) # Wide部分的輸入,使用統計特徵和cvr特徵等,統計特徵one-hot處理,cvr特徵先離散化再one-hot # Sparse vector : the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value. # sparse_vector的輸入是(index:value)元素組成的向量
self.lr_merged_input = layer.data( name='lr_input', type=paddle.data_type.sparse_vector(self.lr_input_dim)) # 二分類模型學習的標籤 # Dense Vector : the input feature is dense float vector. if not self.is_infer: self.click = paddle.layer.data( name='click', type=dtype.dense_vector(1)) # Deep部分使用了標準的多層前向傳導的 DNN 模型,這裏輸入的特徵都做了one-hot處理,而後做爲一個總體進行embedding,獲得詞向量。
# DeepFM對每一個field單獨進行embedding,有多個embedding層
# 注意使用的時候dnn_layer_dims = [128, 64, 32, 1] def _build_dnn_submodel_(self, dnn_layer_dims): ''' build DNN submodel. ''' dnn_embedding = layer.fc( input=self.dnn_merged_input, size=dnn_layer_dims[0]) _input_layer = dnn_embedding for i, dim in enumerate(dnn_layer_dims[1:]): fc = layer.fc( input=_input_layer, size=dim, act=paddle.activation.Relu(), name='dnn-fc-%d' % i) _input_layer = fc return _input_layer # Wide部分,直接使用LR模型,激活函數改成RELU來加速. def _build_lr_submodel_(self): ''' config LR submodel ''' # size是layer dimension,個人理解是layer中神經元數目 fc = layer.fc( input=self.lr_merged_input, size=1, act=paddle.activation.Relu()) return fc # 融合Wide和Deep部分 def _build_classification_model(self, dnn, lr): merge_layer = layer.concat(input=[dnn, lr]) # sigmoid輸出機率 self.output = layer.fc( input=merge_layer, size=1, # use sigmoid function to approximate ctr rate, a float value between 0 and 1. act=paddle.activation.Sigmoid()) if not self.is_infer: # multi_binary_label_cross_entropy_cost: a loss layer for multi binary label cross entropy # 分類問題使用交叉熵損失 self.train_cost = paddle.layer.multi_binary_label_cross_entropy_cost( input=self.output, label=self.click) return self.output def _build_regression_model(self, dnn, lr): merge_layer = layer.concat(input=[dnn, lr]) self.output = layer.fc( input=merge_layer, size=1, act=paddle.activation.Sigmoid()) if not self.is_infer: # 迴歸問題使用mse損失 self.train_cost = paddle.layer.mse_cost( input=self.output, label=self.click) return self.output

訓練模型:

dnn_layer_dims = [128, 64, 32, 1]

# ==============================================================================
#                   cost and train period
# ==============================================================================


def train():
    args = parse_args()
    args.model_type = ModelType(args.model_type)
    paddle.init(use_gpu=False, trainer_count=1)
    dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_file)

    # create ctr model.
    model = CTRmodel(
        dnn_layer_dims,
        dnn_input_dim,
        lr_input_dim,
        model_type=args.model_type,
        is_infer=False)
    # Parameters is a dictionary contains Paddle’s parameter,輸入是網絡的cost layer
    params = paddle.parameters.create(model.train_cost)
    optimizer = paddle.optimizer.AdaGrad()

    trainer = paddle.trainer.SGD(
        cost=model.train_cost, parameters=params, update_equation=optimizer)

    dataset = reader.Dataset()

    def __event_handler__(event):
        if isinstance(event, paddle.event.EndIteration):
            num_samples = event.batch_id * args.batch_size
            if event.batch_id % 100 == 0:
                logger.warning("Pass %d, Samples %d, Cost %f, %s" % (
                    event.pass_id, num_samples, event.cost, event.metrics))

            if event.batch_id % 1000 == 0:
                if args.test_data_path:
                    result = trainer.test(
                        reader=paddle.batch(
                            dataset.test(args.test_data_path),
                            batch_size=args.batch_size),
                        feeding=reader.feeding_index)
                    logger.warning("Test %d-%d, Cost %f, %s" %
                                   (event.pass_id, event.batch_id, result.cost,
                                    result.metrics))

                path = "{}-pass-{}-batch-{}-test-{}.tar.gz".format(
                    args.model_output_prefix, event.pass_id, event.batch_id,
                    result.cost)
                with gzip.open(path, 'w') as f:
                    params.to_tar(f)

    trainer.train(
        # shuffle: 每次讀入buffer_size條訓練數據到一個buffer裏,而後隨機打亂其順序,而且逐條輸出
        # 一個batched reader每次yield一個minibatch
        # num_passes : The total train passes.
        # feeding (dict|list) : Feeding is a map of neural network input name and array index that reader returns.
        reader=paddle.batch(
            paddle.reader.shuffle(
                dataset.train(args.train_data_path), buf_size=500),
            batch_size=args.batch_size),
        feeding=reader.feeding_index,
        event_handler=__event_handler__,
        num_passes=args.num_passes)

 

調參

初始化參數

默認狀況下,PaddlePaddle使用均值0,標準差爲$(\frac{1}{\sqrt{d}})$ 來初始化參數。其中$d$ 爲參數矩陣的寬度。這種初始化方式在通常狀況下不會產生不好的結果。若是用戶想要自定義初始化方式,PaddlePaddle目前提供兩種參數初始化的方式(在定義layer的時候設置參數的初始化方式):

  • 高斯分佈。將 param_attr 設置成 param_attr=ParamAttr(initial_mean=0.0, initial_std=1.0)
  • 均勻分佈。將 param_attr 設置成 param_attr=ParamAttr(initial_max=1.0, initial_min=-1.0)

 paddle中fc層的參數有:

 

調整學習率

上述在定義

optimizer = paddle.optimizer.AdaGrad()

的時候能夠指定相關參數。

    adam_optimizer = paddle.optimizer.Adam(
        learning_rate=1e-3,
        regularization=paddle.optimizer.L2Regularization(rate=1e-3),
        model_average=paddle.optimizer.ModelAverage(average_window=0.5))

 

 

http://www.datakit.cn/blog/2016/08/21/wdnn.html

 

Using pre-trained word vectors in embedding layer:

https://github.com/PaddlePaddle/Paddle/issues/490

 

paddle實現推薦系統:

http://book.paddlepaddle.org/index.cn.html

相關文章
相關標籤/搜索