《DRN: A Deep Reinforcement Learning Framework for News Recommendation》強化學習推薦系統

時間 2019-12-05

標籤 drn deep reinforcement learning framework news recommendation 強化學習推薦系統简体版

原文原文鏈接

摘要

新聞推薦系統中，新聞具備很強的動態特徵（dynamic nature of news features），目前一些模型已經考慮到了動態特徵。算法

一：他們只處理了當前的獎勵（ctr）;、網絡

二：有一些模型利用了用戶的反饋，如用戶返回的頻率。（user feedback other than click / no click labels (e.g., how frequentuser returns) );app

三：會給用戶推送一些內容相似的新聞，用戶看多了會無聊。框架

爲了解決上述問題，咱們提出了DQR,明確處理將來的獎勵。爲了獲取更多的用戶反饋信息，咱們使用user return pattern 做爲用戶點擊的補充，dom

同時也包含了探索特性去發現用戶的興趣。ide

1 INTRODUCTION

新聞太多啦，每一個人都看不完，因此新聞推薦太有必要啦。新聞推薦技術總結以下：函數

content based methods [19, 22, 33], 學習

collaborative fltering based methods [11, 28 ，34]優化

hybrid methods [12, 24, 25]
---------------------------------
this

deep learning models [8, 45, 52]
然而上述方法面臨3個挑戰

一、the dynamic changes in news recommendations are difcult to handle

1- 新聞很快 outdated. 新聞候選集變化很快。

2-用戶在不一樣新聞上的興趣也會不斷變化，所以須要按期更新模型。儘管有一些在想模型能夠捕獲動態特徵，可是隻是最優化當前的獎勵（ctr)

忽略了當前的推薦會對將來產生影響。

二、current recommendation methods [23, 35, 36, 43] usually only consider the click / no click labels or ratings as users’ feedback.

三、its tendency to keep recommending similar items to users, which might decrease users’ interest in similar topics.

一些方法加入了一些隨機好比，simple ϵ-greedy strategy [31] or Upper Confdence Bound (UCB) [23, 43] (mainly for Multi-Armed Bandit methods)

ϵ-greedy strategy may recommend the customer with totally unrelated items

UCB can not get a relatively accurate reward estimation for an item until this item has been tried several times

所以，提出了本身的模型，DQN 能夠考慮當前的獎勵與將來的獎勵。

Second, we consider user return as another form of user feedback information, by maintainingan activeness score for each user ，

咱們不是考慮最近的信息，咱們考慮的是歷史上的信息。

Third, we propose to apply a Dueling Bandit Gradient Descent (DBGD) method [16, 17, 49] for exploration, by choosing random item candidates in the neighborhood of the current recommender.

environment ：user+news

state被定義爲用戶的特徵表示，action被定義爲新聞的特徵表示。

每個時刻，當用戶想看新聞的時候，一個state(i.e., features of users) 和 action集合(i.e., features of news candidates)會傳給agent.

The agent 會選擇最好的action(i.e., recommending a list ofnews to user) 而且獲得用戶的feedback做爲reward。Specifcally ，reward 是

click label 與用戶活躍度的估計（estimation of user activeness ）構成。

全部的推薦與feedback會存在memory裏，每一個小時更新算法。

創新點以下：

一、提出dqn用在新聞推薦，考慮了當前的獎勵與將來的獎勵。

二、用戶活躍度輔助咱們提升推薦準確率。

三、更有效的exploration 機制， Dueling Bandit Gradient Descent

四、已經在某應用中上線，效果好。

2 RELATED WORK

2.1 News recommendation algorithms

Conventional news recommendation methods can be divided into three categories.

一、Content-based methods [19, 22, 33] 使用 news term frequency features (e.g., TF-IDF) 和用戶畫像(based on historical news)，而後，選擇跟用戶畫像類似的新聞進行推薦。

二、 collaborative fltering methods [11] usually make rating prediction utilizing the past ratings of current user or similar users [28, 34], or the combination of these two [11]

三、To combine the advantages of the former two groups of methods, hybrid methods [12, 24, 25] are further proposed to improve the user profle modeling.

四、deep learning models [8, 45, 52] have shown much superior performance than previous three categories of models due to its capability of modeling complex user-item relationship

2.2 Reinforcement learning in recommendation

2.2.1 Contextual Multi-Armed Bandit models

context 包括用戶的特徵跟 item的特徵，論文【23】假設rewrad是context的函數。

2.2.2 Markov Decision Process models

capture the reward of current iteration, but also the potential reward in the future iterations

以前的模型是離散的，比較難train,咱們的模型是連續的。

4.1 Model framework

離線: 提取了4種特徵( from news and users)，用DQN預測 reward(.e., a combination of user-news click label and user activeness). 使用離線的點擊日誌訓練。

在線：

1 push:

在每一個 timestamp (t1, t2, t3, t4, t5, ...), 用戶提出請求，agent根據輸入（user feature 和 news candidates）產生 L top-k 個list of news(模型產生的 + exploration)
2 feedback

用戶根據 L 返回點擊狀況。

3 minor update

在每一個timestamp, 根據上一個用戶u ，推薦列表 L，feedback B , agent G 經過比較 Q and Q˜ 的 performance來更新參數。

If Q˜ better recommendation result, the current network will be updated towards Q˜ . Otherwise, Q will be kept unchanged

4 major update
TR時刻後，G根據用戶的feedback與memory中的活動記錄，利用經驗回放技術，更新Q.

(5) Repeat step (1)-(4)

4.2 Feature construction

• News features includes 417 dimension one hot features that describe whether certain property appears in this piece ofnews,

including headline, provider, ranking, entity name,category, topic category, and click counts in last 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively.

新聞的特徵：包括題目，做者，排名，類別等等，共417維

• User features mainly describes the features (i.e., headline, provider, ranking, entity name, category, and topic category) of the news that the user clicked in 1 hour, 6 hours, 24 hours,

1 week, and 1 year respectively. There is also a total click count for each time granularity. Therefore, there will be totally 413 × 5 = 2065 dimensions.

用戶的特徵：包括用戶在1小時，6小時，24小時，1周，1年內點擊過的新聞的特徵表示，共413*5=2065維。

• User news features. These 25-dimensional features describe the interaction between user and one certain piece of news,

i.e., the frequency for the entity (also category, topic category and provider) to appear in the history of the user’s readings.

• Context features. These 32-dimensional features describe the context when a news request happens, including time, weekday,

and the freshness of the news (the gap between request time and news publish time).

上下文特徵：32維的上下文信息，如時間，周幾，新聞的新鮮程度等

在這四組特徵中，用戶特徵和上下文特徵用於表示當前的state，新聞特徵和交互特徵用語表示當前的一個action。

4.3 Deep Reinforcement Recommendation

這裏深度強化學習用的是Dueling-Double-DQN。

將用戶特徵和上下文特徵用於表示當前的state，新聞特徵和交互特徵用語表示當前的一個action，通過模型能夠輸出當前狀態state採起這個action的預測Q值。

Q現實值包含兩個部分：當即得到的獎勵和將來得到獎勵的折現：

rimmediate 表明 rewards (用戶是否點擊了this piece of news)

DDQN 公式：

4.4 User Activeness

傳統方法只考慮ctr。用戶活躍度也行重要。是本文提出的新的能夠用做推薦結果反饋的指標。用戶活躍度能夠理解爲使用app的頻率，好的推薦結果能夠增長用戶使用該app的頻率，所以能夠做爲一個反饋指標。

若是用戶在必定時間內沒有點擊行爲，活躍度會降低，但一旦有了點擊行爲，活躍度會上升。

在考慮了點擊和活躍度以後，以前提到過的當即獎勵變爲：

4.5 Explore

本文的探索採起的是Dueling Bandit Gradient Descent 算法，算法的結構以下：

在DQN網絡的基礎上又多出來一個exploration network Q ̃ ，這個網絡的參數是由當前的Q網絡參數基礎上加入必定的噪聲產生的，具體來講：

當一個用戶請求到來時，由兩個網絡同時產生top-K的新聞列表，而後將兩者產生的新聞進行必定程度的混合，而後獲得用戶的反饋。若是exploration network Q ̃的效果好的話，那麼當前Q網絡的參數向着exploration network Q ̃的參數方向進行更新，具體公式以下：

不然的話，當前Q網絡的參數不變。

總的來講，使用深度強化學習來進行推薦，同時考慮了用戶活躍度和對多樣性推薦的探索，能夠說是一個很完備的推薦框架了！

參考：

https://blog.csdn.net/r3ee9y2oefcu40/article/details/82880302

http://www.personal.psu.edu/~gjz5038/paper/www2018_reinforceRec/www2018_reinforceRec.pdf

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。