[論文閱讀] Deep Recurrent Q-Network

時間 2019-11-20

原文原文鏈接

前言

這篇博客介紹了一個基於 Deep Q-network(DQN) 改進的網絡框架 Deep Recurrent Q-network (DRQN) 。DRQN 在網絡中引入了 long short-term memory (LSTM) 結構，使網絡具備記憶性。使網絡在僅接受單幀狀態做爲輸入時，也能夠達到必定的遊戲水平。而且 DRQN 還能夠必定程度上解決遊戲狀態僅部分可知的環境（相似於星際爭霸的戰爭迷霧）。當前頂尖的遊戲 AI 如 Alpha Star， OpenAI Five 都在其網絡中使用了 LSTM 結構。該論文或許能夠爲理解這些頂級 AI 爲什麼使用該結構提供理解上的幫助。網絡

論文信息

Title : Deep Recurrent Q-Learning for Partially Observable MDPs
Authors : Hausknecht Matthew ; Stone Peter
Year : 2015
APA Ref : Hausknecht, M., & Stone, P. (2015, September). Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series.
In-text cication : Hausknecht & Stone , 2015

Abstract :框架

Deep Reinforcement Learning has yielded proficient controllers for complex tasks. However, these controllers have limited memory and rely on being able to perceive the complete game screen at each decision point. To address these shortcomings, this article investigates the effects of adding recurrency to a Deep Q-Network (DQN) by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. The resulting Deep Recurrent Q-Network (DRQN), although capable of seeing only a single frame at each timestep, successfully integrates information through time and replicates DQN's performance on standard Atari games and partially observed equivalents featuring flickering game screens. Additionally, when trained with partial observations and evaluated with incrementally more complete observations, DRQN's performance scales as a function of observability. Conversely, when trained with full observations and evaluated with partial observations, DRQN's performance degrades less than DQN's. Thus, given the same length of history, recurrency is a viable alternative to stacking a history of frames in the DQN's input layer and while recurrency confers no systematic advantage when learning to play the game, the recurrent net can better adapt at evaluation time if the quality of observations changes. less

核心思想

神經網絡的記憶力

該文章提出的 DRQN 利用 LSTM 使智能體有了記憶力。在 DQN 的論文中，智能體接受到的遊戲狀態是連續的四幀。其實並不是必定要是連續的四幀，論文提到也能夠是連續的三幀或者五幀。但必定要是連續的多幀。由於單幀的遊戲狀態有時候沒法完整的表達全部的遊戲信息。以 Atarti 的 Pong 遊戲爲例。post

上圖是 Pong，這個遊戲很簡單，遊戲雙方各操控一個長方形白塊，有一個球在二者之間移動。木板碰到球時，球就會彈回去。沒有接到球的一方判負。學習

對於該遊戲，假設僅有單幀遊戲狀態，那麼能知道的信息有雙方白塊的位置，和球的位置。但球的移動方向，移動速度，白塊的移動方向，移動速度都無法得知。只有在連續多幀的遊戲狀態中，才能獲得上述的信息。假設僅利用不完美的遊戲信息來進行行動決策，那麼此時的決策過程就被稱爲 Partially-Observable Markov Decision Process (POMDP)。假設僅使用單幀的遊戲狀態，那麼 DQN 在玩 Pong 時就變成了 POMDP 的狀況，而 DQN 並不能很好的處理 POMDP。此外固定的遊戲幀數具備侷限性，對於其餘類型的問題，可能遊戲信息須要更多的遊戲幀數來得到。所以論文做者基於 DQN 引入了 LSTM 結構，使網絡具備記憶性，從而能夠對以前的遊戲狀態進行記錄。利用 LSTM 結構來決定「記住」什麼，「忘記」什麼。ui

Deep Recurrent Q-network

網絡結構

DRQN 的網絡結構和 DQN 相似。不一樣之處僅在於在網絡最後的全鏈接層被替換成了 LSTM。具體的網絡結構參數如圖所示。this

基於完整遊戲輪數的採樣

DQN 中更新網絡的方式是從 replay buffer 採樣連續的4幀。在 DRQN 中，採樣更新網絡的方式有兩種，一種是採樣完整的 episode ，而後從 initial state 開始做爲輸入更新網絡。第二種更新方式是採樣完整的 episode，而後從隨機的時間點開始做爲第一個狀態來更新網絡。實驗結果證實兩種方式都是有效的，可是從邏輯上來講，第二種方法更合理一點。由於第一種方法違反了 DQN 隨機採樣的原則。lua