新聞推薦系統中,新聞具備很強的動態特徵(dynamic nature of news features),目前一些模型已經考慮到了動態特徵。算法
二:有一些模型利用了用戶的反饋,如用戶返回的頻率。(user feedback other than click / no click labels (e.g., how frequentuser returns) );app
爲了解決上述問題,咱們提出了DQR,明確處理將來的獎勵。爲了獲取更多的用戶反饋信息,咱們使用user return pattern 做爲用戶點擊的補充,dom
一、the dynamic changes in news recommendations are difcult to handle
1- 新聞很快 outdated. 新聞候選集變化很快。
二、current recommendation methods [23, 35, 36, 43] usually only consider the click / no click labels or ratings as users’ feedback.
三、its tendency to keep recommending similar items to users, which might decrease users’ interest in similar topics.
一些方法加入了一些隨機 好比,simple ϵ-greedy strategy [31] or Upper Confdence Bound (UCB) [23, 43] (mainly for Multi-Armed Bandit methods)
ϵ-greedy strategy may recommend the customer with totally unrelated items
UCB can not get a relatively accurate reward estimation for an item until this item has been tried several times
所以,提出了本身的模型,DQN 能夠考慮當前的獎勵與將來的獎勵。
Second, we consider user return as another form of user feedback information, by maintainingan activeness score for each user ,
Third, we propose to apply a Dueling Bandit Gradient Descent (DBGD) method [16, 17, 49] for exploration, by choosing random item candidates in the neighborhood of the current recommender.
environment :user+news
每個時刻,當用戶想看新聞的時候,一個state(i.e., features of users) 和 action集合(i.e., features of news candidates)會傳給agent.
The agent 會選擇最好的action(i.e., recommending a list ofnews to user) 而且 獲得用戶的feedback做爲reward。Specifcally ,reward 是
click label 與 用戶活躍度的估計(estimation of user activeness )構成。
三、更有效的exploration 機制, Dueling Bandit Gradient Descent
Conventional news recommendation methods can be divided into three categories.
一、Content-based methods [19, 22, 33] 使用 news term frequency features (e.g., TF-IDF) 和 用戶畫像(based on historical news),而後,選擇跟用戶畫像類似的新聞進行推薦。
二、 collaborative fltering methods [11] usually make rating prediction utilizing the past ratings of current user or similar users [28, 34], or the combination of these two [11]
三、To combine the advantages of the former two groups of methods, hybrid methods [12, 24, 25] are further proposed to improve the user profle modeling.
四、deep learning models [8, 45, 52] have shown much superior performance than previous three categories of models due to its capability of modeling complex user-item relationship
context 包括用戶的特徵跟 item的特徵,論文【23】假設rewrad是context的函數。
2.2.2 Markov Decision Process models
capture the reward of current iteration, but also the potential reward in the future iterations
4.1 Model framework
離線: 提取了4種特徵( from news and users),用DQN預測 reward(.e., a combination of user-news click label and user activeness). 使用離線的點擊日誌訓練。
1 push:
在每一個 timestamp (t1, t2, t3, t4, t5, ...), 用戶提出請求,agent根據輸入(user feature 和 news candidates)產生 L top-k 個list of news(模型產生的 + exploration)
2 feedback
用戶根據 L 返回點擊狀況。
3 minor update
在每一個timestamp, 根據上一個用戶u ,推薦列表 L,feedback B , agent G 經過比較 Q and Q˜ 的 performance來更新參數。
If Q˜ better recommendation result, the current network will be updated towards Q˜ . Otherwise, Q will be kept unchanged
4 major update
(5) Repeat step (1)-(4)
• News features includes 417 dimension one hot features that describe whether certain property appears in this piece ofnews,
including headline, provider, ranking, entity name,category, topic category, and click counts in last 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively.
• User features mainly describes the features (i.e., headline, provider, ranking, entity name, category, and topic category) of the news that the user clicked in 1 hour, 6 hours, 24 hours,
1 week, and 1 year respectively. There is also a total click count for each time granularity. Therefore, there will be totally 413 × 5 = 2065 dimensions.
• User news features. These 25-dimensional features describe the interaction between user and one certain piece of news,
i.e., the frequency for the entity (also category, topic category and provider) to appear in the history of the user’s readings.
• Context features. These 32-dimensional features describe the context when a news request happens, including time, weekday,
and the freshness of the news (the gap between request time and news publish time).
4.3 Deep Reinforcement Recommendation
rimmediate 表明 rewards (用戶是否點擊了this piece of news)
DDQN 公式:
傳統方法只考慮ctr。用戶活躍度也行重要。 是本文提出的新的能夠用做推薦結果反饋的指標。用戶活躍度能夠理解爲使用app的頻率,好的推薦結果能夠增長用戶使用該app的頻率,所以能夠做爲一個反饋指標。
本文的探索採起的是Dueling Bandit Gradient Descent 算法,算法的結構以下:
在DQN網絡的基礎上又多出來一個exploration network Q ̃ ,這個網絡的參數是由當前的Q網絡參數基礎上加入必定的噪聲產生的,具體來講:
當一個用戶請求到來時,由兩個網絡同時產生top-K的新聞列表,而後將兩者產生的新聞進行必定程度的混合,而後獲得用戶的反饋。若是exploration network Q ̃的效果好的話,那麼當前Q網絡的參數向着exploration network Q ̃的參數方向進行更新,具體公式以下: