一文讀懂深度強化學習算法 A3C （Actor-Critic Algorithm）

時間 2019-12-18

標籤一文讀懂深度強化學習算法 a3c actor critic algorithm 欄目 C&C++ 简体版

原文原文鏈接

一文讀懂深度強化學習算法 A3C （Actor-Critic Algorithm）python

2017-12-25 16:29:19 git

　　對於 A3C 算法感受本身老是隻知其一;不知其二，現將其梳理一下，記錄在此，也給想學習的小夥伴一個參考。github

　　想要認識清楚這個算法，須要對 DRL 的算法有比較深入的瞭解，推薦你們先了解下 Deep Q-learning 和 Policy Gradient 算法。算法

　　咱們知道，DRL 算法大體能夠分爲以下這幾個類別：Value Based and Policy Based，其經典算法分別爲：Q-learning 和 Policy Gradient Method。網絡

　　而本文所涉及的 A3C 算法則是結合 Policy 和 Value Function 的產物，其中，基於 Policy 的方法，其優缺點總結以下：
app

　　Advantages:
　　　　1. Better convergence properties （更好的收斂屬性）
　　　　2. Effective in high-dimensional or continuous action spaces（在高維度和連續動做空間更加有效）
　　　　3. Can learn stochastic policies（能夠Stochastic 的策略）
　　Disadvantages:
　　　　1. Typically converge to a local rather than global optimum（一般獲得的都是局部最優解）
　　　　2. Evaluating a policy is typically inefficient and high variance （評價策略一般不是很是高效，而且有很高的誤差）
框架

　　咱們首先簡要介紹一些背景知識（Background）：less

　　在 RL 的基本設置當中，有 agent，environment, action, state, reward 等基本元素。agent 會與 environment 進行互動，而產生軌跡，經過執行動做 action，使得 environment 發生狀態的變化，s -> s' ；而後 environment 會給 agent 當前動做選擇以 reward（positive or negative）。經過不斷的進行這種交互，使得積累愈來愈多的 experience，而後更新 policy，構成這個封閉的循環。爲了簡單起見，咱們僅僅考慮 deterministic environment，即：在狀態 s 下，選擇 action a 老是會獲得相同的狀態 s‘。 dom

　　爲了清楚起見，咱們先定義一些符號：異步

　　1. stochastic policy $\pi(s)$ 決定了 agent's action, 這意味着，其輸出並不是 single action，而是 distribution of probability over actions (動做的機率分佈)，sum 起來爲 1.

　　2. $\pi(a|s)$ 表示在狀態 s 下，選擇 action a 的機率；

　　而咱們所要學習的策略 $\pi$，就是關於 state s 的函數，返回全部 actions 的機率。

　　咱們知道，agent 的目標是最大化所能獲得的獎勵（reward），咱們用 reward 的指望來表達這個。在機率分佈 P 當中，value X 的指望是：

　　其中 Xi 是 X 的全部可能的取值，Pi 是對應每個 value 出現的機率。指望就能夠看做是 value Xi 與權重 Pi 的加權平均。

　　這裏有一個很重要的事情是： if we had a pool of values X, ratio of which was given by P, and we randomly picked a number of these, we would expect the mean of them to be $E_{P}[X]$ . And the mean would get closer to $E_{P}[X]$ as the number of samples rise.

　　咱們再來定義 policy $\pi$ 的 value function V(s)，將其看做是 指望的折扣回報 (expected discounted return)，能夠看做是下面的迭代的定義：

　　這個函數的意思是說：當前狀態 s 所能得到的 return，是下一個狀態 s‘ 所能得到 return 和在狀態轉移過程當中所獲得 reward r 的加和。

　　此外，還有 action value function Q(s, a)，這個和 value function 是息息相關的，即：

　　此時，咱們能夠定義一個新的 function A(s, a) ，這個函數稱爲 優點函數（advantage function）:

　　其表達了在狀態 s 下，選擇動做 a 有多好。若是 action a 比 average 要好，那麼，advantage function 就是 positive 的，不然，就是 negative 的。

　　Policy Gradient：

　　當咱們構建 DQN agent 的時候，咱們利用 NN 來估計的是 Q(s, a) 函數。這裏，咱們採用不一樣的方法來作，既然 policy $\pi$ 是 state $s$ 的函數，那麼，咱們能夠直接根據 state 的輸入來估計策略的選擇嘛。

　　這裏，咱們 NN 的輸入是 state s，輸出是 an action probability distribution $\pi_\theta$，其示意圖爲：

　　實際的執行過程當中，咱們能夠按照這個 distribution 來選擇動做，或者 直接選擇機率最大的那個 action。

　　可是，爲了獲得更好的 policy，咱們必須進行更新。那麼，如何來優化這個問題呢？咱們須要某些度量（metric）來衡量 policy 的好壞。

　　咱們定一個函數 $J(\pi)$，表示一個策略所能獲得的折扣的獎賞，從初始狀態 s0 出發獲得的全部的平均：

　　咱們發現這個函數的確很好的表達了，一個 policy 有多好。可是問題是很難估計，好消息是：we don't have to。

　　咱們須要關注的僅僅是如何改善其質量就好了。若是咱們知道這個 function 的 gradient，就變的很 trivial （專門查了詞典，這個是：瑣碎的，微不足道的，的意思，恩，不用謝）。

　　有一個很簡便的方法來計算這個函數的梯度：

　　這裏其實從目標函數到這個梯度的變換，有點忽然，咱們先跳過這個過程，就假設已是這樣子了。後面，我再給出比較詳細的推導過程。

　　這裏能夠參考 Policy Gradient 的原始paper：Policy Gradient Methods for Reinforcement Learning with Function Approximation

　　或者是 David Silver 的 YouTube 課程：https://www.youtube.com/watch?v=KHZVXao4qXs

　　簡單而言，這個指望內部的兩項：

　　第一項，是優點函數，即：選擇該 action 的優點，當低於 average value 的時候，該項爲 negative，當比 average 要好的時候，該項爲 positive；是一個標量（scalar）；

　　第二項，告訴咱們了使得 log 函數增長的方向；

　　將這兩項乘起來，咱們發現：likelihood of actions that are better than average is increased, and likelihood of actions worse than average is decreased.

　　Fortunately, running an episode with a policy π yields samples distributed exactly as we need. States encountered and actions taken are indeed an unbiased sample from the $\rho^\pi$ and π(s) distributions. That’s great news. We can simply let our agent run in the environment and record the (s, a, r, s’) samples. When we gather enough of them, we use the formula above to find a good approximation of the gradient $\nabla_\theta\;J(\pi)$ . We can then use any of the existing techniques based on gradient descend to improve our policy.

　　Actor-Critic：

　　咱們首先要計算的是優點函數 A(s, a)，將其展開：

　　運行一次獲得的 sample 能夠給咱們提供一個 Q(s, a) 函數的 unbiased estimation。咱們知道，這個時候，咱們僅僅須要知道 V(s) 就能夠計算 A(s, a）。

　　這個 value function 是容易用 NN 來計算的，就像在 DQN 中估計 action-value function 同樣。相比較而言，這個更簡單，由於每一個 state 僅僅有一個 value。

　　咱們能夠將 value function 和 action-value function 聯合的進行預測。最終的網絡框架以下：

　　這裏，咱們有兩個東西須要優化，即： actor 以及 critic。

　　actor：優化這個 policy，使得其表現的愈來愈好；

　　critic：嘗試估計 value function，使其更加準確；

　　這些東西來自於 the Policy Gradient Theorem :

　　簡單來說，就是：actor 執行動做，而後 critic 進行評價，說這個動做的選擇是好是壞。

　　Parallel agents：

　　若是隻用單個 agent 進行樣本的採集，那麼咱們獲得的樣本就很是有多是高度相關的，這會使得 machine learning 的model 出問題。由於 machine learning 學習的條件是：sample 知足獨立同分布的性質。可是不能是這樣子高度相關的。在 DQN 中，咱們引入了 experience replay 來克服這個難題。可是，這樣子就是 offline 的了，由於你是先 sampling，而後將其存儲起來，而後再 update 你的參數。

　　那麼，問題來了，可否 online 的進行學習呢？而且在這個過程當中，仍然打破這種高度相關性呢？

　　We can run several agents in parallel, each with its own copy of the environment, and use their samples as they arrive.

　　1. Different agents will likely experience different states and transitions, thus avoiding the correlation².

　　2. Another benefit is that this approach needs much less memory, because we don’t need to store the samples.

　　此外，還有一個概念也是很是重要的：N-step return 。

　　一般咱們計算的 Q(s, a), V(s) or A(s, a) 函數的時候，咱們只是計算了 1-step 的 return。

　　在這種狀況下，咱們利用的是從 sample （s0, a0, r0, s1）得到的 即刻獎勵（immediate return），而後該函數下一步預測 value 給咱們提供了一個估計 approximation。可是，咱們能夠利用更多的步驟來提供另一個估計：

　　或者 n-step return：

　　The n-step return has an advantage that changes in the approximated function get propagated much more quickly. Let’s say that the agent experienced a transition with unexpected reward. In 1-step return scenario, the value function would only change slowly one step backwards with each iteration. In n-step return however, the change is propagated n steps backwards each iteration, thus much quicker.

　　N-step return has its drawbacks. It’s higher variance because the value depends on a chain of actions which can lead into many different states. This might endanger the convergence.

　　這個就是 異步優點actor-critic 算法（Asynchronous advantage actor-critic , 即：A3C）。

　　以上是 A3C 的算法部分，下面從 coding 的角度來看待這個算法：

　　基於 python+Keras+gym 的code 實現，能夠參考這個 GitHub 連接：https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py

　　所涉及到的大體流程，能夠概括爲：

　　在這其中，最重要的是 loss function 的定義：

　　其中， $L_{\pi}$ is the loss of the policy, $L_v$ is the value error and $L_{reg}$ is a regularization term. These parts are multiplied by constants $c_v$ and $c_{reg}$ , which determine what part we stress more.

　　下面分別對這三個部分進行介紹：

　　1. Policy Loss：

　　咱們定義 objective function $J(\pi)$ 以下：

　　這個是：經過策略 $\pi$ 平均全部起始狀態所獲得的總的 reward（total reward an agent can achieve under policy $\pi$ averaged over all starting states）。

　　根據 Policy Gradient Theorem 咱們能夠獲得該函數的 gradient：

　　咱們嘗試最大化這個函數，那麼，對應的 loss 就是這個負函數：

　　咱們將 A(s,a) 看作是一個 constant，而後從新將上述函數改寫爲以下的形式：

　　咱們就對於minibatch 中全部樣本進行平均，來掃一遍這個指望值。最終的 loss 能夠記爲：

　　2. Value Loss:

　　the truth value function V(s) 應該是知足 Bellman Equation 的：

　　而咱們估計的 V(s) 應該是收斂的，那麼，根據上述式子，咱們能夠計算該 error：

　　這裏你們可能比較模糊，剛開始我也是比較暈，這裏的 groundtruth 是怎麼獲得的？？？

　　其實這裏是根據 sampling 到的樣本，而後計算兩個 V(s) 之間的偏差，看這兩個 value function 之間的差距。

　　因此，咱們定義 Lv 爲 mean squared error （given all samples）:

　　3. Regularizaiton with Policy Entropy :

　　爲什麼要加這一項呢？咱們想要在 agent 與 environment 進行互動的過程當中，平衡探索和利用，咱們想去以必定的概率來嘗試其餘的 action，從而不至於採樣獲得的樣本太過於集中。因此，引入這個 entropy，來使得輸出的分佈，可以更加的平衡。舉個例子：

　　fully deterministic policy [1, 0, 0, 0] 的 entropy 是 0 ；而 totally uniform policy[0.25, 0.25, 0.25, 0.25]的 entropy 對於四個value的分佈，值是最大的。

　　咱們爲了使得輸出的分佈更加均衡，因此要最大化這個 entropy，那麼就是 minimize 這個負的 entropy。

　　總而言之，咱們能夠藉助於現有的 deep learning 的框架來 minimize 這個這些 total loss，以達到優化網絡參數的目的。

　　Reference：

　　1. https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py

　　2. https://jaromiru.com/2017/03/26/lets-make-an-a3c-implementation/

　　3. https://www.youtube.com/watch?v=KHZVXao4qXs

　　4. https://github.com/ikostrikov/pytorch-a3c

　　======================================================

　　 Policy Gradient Method 目標函數梯度的計算過程：

　　======================================================

　　reference paper：policy-gradient-methods-for-reinforcement-learning-with-function-approximation （NIPS 2000, MIT press）

　　過去有不少算法都是基於 value-function 進行的，雖然取得了很大的進展，可是這種方法有以下兩個侷限性：　　
　　首先，這類方法的目標在於找到 deterministic policy，可是最優的策略一般都是 stochastic 的，以特定的機率選擇不一樣的 action；

　　其次，一個任意的小改變，均可能會致使一個 action 是否會被選擇。這個不連續的改變，已經被廣泛認爲是創建收斂精度的關鍵瓶頸。

　　而策略梯度的方法，則是從另一個角度來看待這個問題。咱們知道，咱們的目標就是想學習一個，從 state 到 action 的一個策略而已，那麼，咱們有必要非得先學一個 value function 嗎？咱們能夠直接輸入一個 state，而後通過 NN，輸出action 的distribution 就好了嘛，而後，將 NN 中的參數，看作是可調節的 policy 的參數。咱們假設 policy 的實際執行的表現爲 $\rho$，即：the averaged reward per step。咱們能夠直接對這個 $\rho$ 求偏導，而後進行參數更新，就能夠進行學習了嘛：

　　若是上述公式是成立的，那麼，$\theta$ 一般均可以保證能夠收斂到局部最優策略。而這個文章就提供了上述梯度的一個無偏估計，這是利用一個估計的知足特定屬性的 value function，從 experience 中進行估計。

　　1. Policy Gradient Theorem （策略梯度定理）

　　這裏討論的是標準的 reinforcement learning framework，有一個 agent 與環境進行交互，而且知足馬爾科夫屬性。

　　在每一個時刻 $t \in {0, 1, 2, ... }$ 的狀態，動做，獎勵分別記爲：st, at, rt。而環境的動態特徵能夠經過狀態轉移機率（state transition probability）來刻畫。

　　從上面，能夠發現各個概念的符號表示及其意義。

　　====>> 　　未完，待續。。。

　　======================================================

　　　　　　　　Pytorch for A3C

　　======================================================

　　本文將繼續以 Pytorch 框架爲基礎，從代碼層次上來看具體的實現，本文所用的 code，來自於：https://github.com/ikostrikov/pytorch-a3c

　　代碼的層次以下所示：

　　咱們來看幾個核心的code：