強化學習詳解與代碼實現

時間 2019-12-23

標籤強化學習詳解代碼實現简体版

原文原文鏈接

強化學習詳解與代碼實現

本文系做者原創，轉載請註明出處:http://www.javashuo.com/article/p-dybcgrip-e.htmlhtml

1.引言

2.強化學習原理

2.1 強化學習定義（RL Reinforcement Learing）

2.2 馬爾科夫決策過程（MDP Markov Decision Process）

2.3 貝爾曼方程（Bellman）

2.4 Q-Learning

3.代碼實現與說明（python3.5）

4.運行結果

5.參考文獻

1.引言

相信你們對由Google開發的AlphaGo機器人在2016年圍棋對弈中擊敗韓國的圍棋大師李世石還記憶猶新吧。當時，我的也確實被這場人機大戰的結果深深震撼而且恐懼了。震撼是由於機器人的智慧超越了人類只有在科幻大片中看到，而今，這種故事卻真真實實的發生在咱們的現實中，恐懼是對未知的一種天然反應，也正是由於這種恐懼，咱們纔有了去探索未知的本能，去揭開AlphaGo機器人背後技術原理的面紗。相信你們已經猜到了AlphaGo機器運用的技術原理，不錯，那就是強化學習（Reinforcement Learning）。python

2.強化學習原理

2.1 強化學習定義

強化學習是一種經過交互的目標導向學習方法，旨在找到連續時間序列的最優策略。這個定義比較抽象（說實話，抽象的東西雖然簡潔、準確，可是也很是難以理解）。舉個容易理解的例子：git

在你面前有兩條路，天然就有兩個不一樣方向，只有一條路，一個方向能夠到達目的地，有個前提條件是你不知道目的地在他們當中的哪一個方向？github

是否是感受很抓瞎，徹底沒辦法。對的，若是按照這種場景，咱們確定是沒辦法的，可是若是給你個機會，讓你在兩個不一樣方向都去嘗試一下，你是否是就知道哪個方向是正確的。算法

強化學習的一個核心點就是要嘗試，由於只有嘗試了以後，它才能發現哪些行爲會致使獎勵的最大化，而當前的行爲可能不只僅會影響即時獎勵，還會影響下一步的獎勵以及後續的全部獎勵。由於一個目標的實現，是由一步一步的行爲串聯實現的。app

在上面的場景當中，涉及到了強化學習的幾個主要因素：智能體（Agent）,環境（Environment）,狀態（State）、動做（Action）、獎勵（Reward）、策略（Policy）。less

智能體（Agent）：強化學習的本體，做爲學習者或者決策者，上述場景是指咱們本身。ide

環境（Environment）：強化學習智能體之外的一切，主要由狀態集合組成。函數

狀態（State）：一個表示環境的數據，狀態集則是環境中全部可能的狀態。好比，走一步就會達到一個新的狀態。學習

動做（Action）：智能體能夠作出的動做，動做集則是智能體能夠作出的全部動做。好比，走一步這個過程就是一個動做。

獎勵（Reward）：智能體在執行一個動做後，得到的正/負反饋信號，獎勵集則是智能體能夠得到的全部反饋信息。走正確就獎勵，錯誤就懲罰。

策略（Policy）：強化學習是從環境狀態到動做的映射學習，稱該映射關係爲策略。通俗的理解，即智能體如何選擇動做的思考過程稱爲策略。

第一步：智能體嘗試執行了某個動做後，環境將會轉換到一個新的狀態，固然，對於這個新的狀態，環境會給出獎勵或者懲罰。

第二步：智能體根據新的狀態和環境反饋的獎勵或懲罰，執行新的動做，如此反覆，直至到達目標。

第三步：智能體根據獎勵最大值找到到達目標的最佳策略，而後根據這個策略到達目標。

要注意的是，智能體要嘗試執行全部可能的動做，到達目標，最終會有全部可能動做對應全部可能狀態的一張映射表（Q-table）。

這裏借用知乎論壇關於強化學習各個因素關係的一幅圖（https://www.zhihu.com/topic/20039099/intro）

原理咱們明白了，接下來咱們就來看大神如何將這些原理抽象出來，如何用數學的公式來表示的。

2.2 馬爾科夫決策過程（MDP Markov Decision Process）

馬爾科夫決策過程由5個因素組成：

S：狀態集（states）

A：動做集（actions）

P：狀態轉移機率

R：即時獎勵（reward）

$γ$

$γ$ $γ$

$γ$

狀態價值函數（評價某個狀態獎勵的數學公式）：

表示在時刻的狀態能得到獎勵的指望。

最優價值函數（某個策略下獎勵指望最大值）：

2.3 貝爾曼方程（Bellman）

貝爾曼方程是更通常的狀態價值函數表達式，它表示當前狀態的價值由當前的獎勵和下一狀態的價值組成。這裏借用某位大神的一幅圖形象說明：

這裏假按期望都是 0.5，7.4對應的狀態的下一個狀態是（紅色圓圈）和（0）。

下一個狀態分別是，，，沒有下一個狀態。

對應的價值 = 當前的獎勵 + 下一狀態的價值

上述圖計算價值在表述上有點不對，沒有將即時獎勵和下一狀態價值分開，在理解上容易形成混亂。

因此對應的價值 = 0.5 * (10 + 1) + 0.5 * (0.2 *（-1.3）+ 0.4 * 2.7 + 0.4 * 7.4 )

貝爾曼最優化方程：

表示在某個狀態下最優價值函數的值，也就是說，智能體在該狀態下所能得到累積獎勵值得最大值。

2.4 Q-Learning

學習在一個給定的狀態時，採起了一個特定的行動後，所獲得的回報，而後遍歷全部可能的行動，獲得全部狀態的回報 Q （Table）。

其實，每一個 == 。Q-Table生成的算法流程：

1. 初始化Q-Table 每一個狀態（s）對應的回報爲 0；

2. 隨機選取一個狀態（s）做爲遍歷的起始點；

3. 在當前狀態（s）的全部可能的行動（A）中按順序遍歷每個行動（a）;

4. 移動到下一個狀態；

5. 在新狀態上選擇 Q 值最大的那個行動（a1）；

6. 用貝爾曼方程更新Q-Table中相應狀態-行動對應的價值（）。按順序遍歷第3步其餘可能的行動，重複第3 - 6步；

7. 將新狀態設置爲當前狀態，而後重複第2 - 6步，直到到達目標狀態；

這裏注意有兩層循環：

外層：遍歷全部狀態；

內層：遍歷每一個狀態的全部可能的行動；

舉個樣例，左邊是Q-Learning過程（R:獎勵），中間是state = 1，action = 5生成Q-Table結果，右邊是最終Q-Table結果。

3.代碼實現與說明（python3.5）

這裏舉一個例子來加深對強化學習原理的理解。遊戲規則以下：以灰色兩個框做爲目標出口，圖中有16個狀態（加上目標出口），每一個狀態都有4個方向（上，下，左，右），找出任一狀態到目標出口的最優方向（以下圖）。

gridworld.py

  1 import io
  2 import numpy as np
  3 import sys
  4 from gym.envs.toy_text import discrete
  5 
  6 UP = 0
  7 RIGHT = 1
  8 DOWN = 2
  9 LEFT = 3
 10 
 11 class GridworldEnv(discrete.DiscreteEnv):
 12     """
 13     Grid World environment from Sutton's Reinforcement Learning book chapter 4.
 14     You are an agent on an MxN grid and your goal is to reach the terminal
 15     state at the top left or the bottom right corner.
 16 
 17     For example, a 4x4 grid looks as follows:
 18 
 19     T  o  o  o
 20     o  x  o  o
 21     o  o  o  o
 22     o  o  o  T
 23 
 24     x is your position and T are the two terminal states.
 25 
 26     You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
 27     Actions going off the edge leave you in your current state.
 28     You receive a reward of -1 at each step until you reach a terminal state.
 29     """
 30 
 31     metadata = {'render.modes': ['human', 'ansi']}
 32 
 33     def __init__(self, shape=[4,4]):
 34         if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
 35             raise ValueError('shape argument must be a list/tuple of length 2')
 36 
 37         self.shape = shape
 38 
 39         nS = np.prod(shape)
 40         nA = 4
 41 
 42         MAX_Y = shape[0]
 43         MAX_X = shape[1]
 44 
 45         P = {}
 46         grid = np.arange(nS).reshape(shape)
 47         it = np.nditer(grid, flags=['multi_index'])
 48 
 49         while not it.finished:
 50             s = it.iterindex
 51             y, x = it.multi_index
 52 
 53             # P[s][a] = (prob, next_state, reward, is_done)
 54             P[s] = {a : [] for a in range(nA)}
 55 
 56             is_done = lambda s: s == 0 or s == (nS - 1)
 57             reward = 0.0 if is_done(s) else -1.0
 58 
 59             # We're stuck in a terminal state
 60             if is_done(s):
 61                 P[s][UP] = [(1.0, s, reward, True)]
 62                 P[s][RIGHT] = [(1.0, s, reward, True)]
 63                 P[s][DOWN] = [(1.0, s, reward, True)]
 64                 P[s][LEFT] = [(1.0, s, reward, True)]
 65             # Not a terminal state
 66             else:
 67                 ns_up = s if y == 0 else s - MAX_X
 68                 ns_right = s if x == (MAX_X - 1) else s + 1
 69                 ns_down = s if y == (MAX_Y - 1) else s + MAX_X
 70                 ns_left = s if x == 0 else s - 1
 71                 P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
 72                 P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
 73                 P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
 74                 P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]
 75 
 76             it.iternext()
 77 
 78         # Initial state distribution is uniform
 79         isd = np.ones(nS) / nS
 80 
 81         # We expose the model of the environment for educational purposes
 82         # This should not be used in any model-free learning algorithm
 83         self.P = P
 84 
 85         super(GridworldEnv, self).__init__(nS, nA, P, isd)
 86 
 87     def _render(self, mode='human', close=False):
 88         """ Renders the current gridworld layout
 89 
 90          For example, a 4x4 grid with the mode="human" looks like:
 91             T  o  o  o
 92             o  x  o  o
 93             o  o  o  o
 94             o  o  o  T
 95         where x is your position and T are the two terminal states.
 96         """
 97         if close:
 98             return
 99 
100         outfile = io.StringIO() if mode == 'ansi' else sys.stdout
101 
102         grid = np.arange(self.nS).reshape(self.shape)
103         it = np.nditer(grid, flags=['multi_index'])
104         while not it.finished:
105             s = it.iterindex
106             y, x = it.multi_index
107 
108             if self.s == s:
109                 output = " x "
110             elif s == 0 or s == self.nS - 1:
111                 output = " T "
112             else:
113                 output = " o "
114 
115             if x == 0:
116                 output = output.lstrip()
117             if x == self.shape[1] - 1:
118                 output = output.rstrip()
119 
120             outfile.write(output)
121 
122             if x == self.shape[1] - 1:
123                 outfile.write("\n")
124 
125             it.iternext()

View Code

ValueIteration.py

 1 import numpy as np
 2 import gridworld as gw
 3 
 4 
 5 env = gw.GridworldEnv()
 6 # V表是當前狀態走下一步時最大回報（先求下一步不一樣方向的回報，而後求最大值）；
 7 # Q表是計算全部狀態全部可能的走法；
 8 def value_iteration(env, theta=0.0001, discount_factor=1.0):
 9     """
10     Value Iteration Algorithm.
11     
12     Args:
13         env: OpenAI environment. env.P represents the transition probabilities of the environment.
14         theta: Stopping threshold. If the value of all states changes less than theta
15             in one iteration we are done.
16         discount_factor: lambda time discount factor.
17         
18     Returns:
19         A tuple (policy, V) of the optimal policy and the optimal value function.
20     """
21     
22     def one_step_lookahead(state, V):
23         """
24         Helper function to calculate the value for all action in a given state.
25         
26         Args:
27             state: The state to consider (int)
28             V: The value to use as an estimator, Vector of length env.nS
29         
30         Returns:
31             A vector of length env.nA containing the expected value of each action.
32         """
33         """
34         遊戲規則： 0，15 做爲出口 ，找出每一個狀態（16個）到達出口最快的一步
35          0  o  o  o
36          o  x  o  o
37          o  o  o  o
38          o  o  o  15
39         """
40         # env.P 是個初始狀態table 在狀態0,15 有獎勵，其餘都是懲罰 -1
41         # 每一個action回報(上，下，左，右) = 機率因子 *（即時獎勵 + 折扣因子 * 下個狀態回報(當前行動到達的下個狀態)
42         A = np.zeros(env.nA)
43         for a in range(env.nA):
44             for prob, next_state, reward, done in env.P[state][a]:
45                 A[a] += prob * (reward + discount_factor * V[next_state])
46         return A
47 
48     #每次獲取16個狀態的回報V（矩陣），
49     # 找到當前狀態某個行爲的最大回報與當前狀態歷史回報最小值 < 超參theta，就結束循環。
50     # 這時狀態V就是每一個狀態某個即將行爲的最大回報
51     V = np.zeros(env.nS)
52     while True:
53         # Stopping condition
54         delta = 0
55         # Update each state...
56         # 有16個狀態，每次都須要所有遍歷
57         for s in range(env.nS):
58             # Do a one-step lookahead to find the best action
59             A = one_step_lookahead(s, V)
60             best_action_value = np.max(A)
61             # Calculate delta across all states seen so far
62             delta = max(delta, np.abs(best_action_value - V[s]))
63             # Update the value function
64             V[s] = best_action_value        
65         # Check if we can stop 
66         if delta < theta:
67             break
68     #(16,4)
69     # Create a deterministic policy using the optimal value function
70     policy = np.zeros([env.nS, env.nA])
71     for s in range(env.nS):
72         # One step lookahead to find the best action for this state
73         A = one_step_lookahead(s, V)
74         best_action = np.argmax(A)  # 最大值對應索引
75         # Always take the best action
76         policy[s, best_action] = 1.0
77     
78     return policy, V
79 
80 policy, v = value_iteration(env)
81 
82 print("Policy Probability Distribution:")
83 print(policy)
84 print("")
85 
86 print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):")
87 # 返回的是最大值的索引  axis = 1 列
88 print(np.reshape(np.argmax(policy, axis=1), env.shape))
89 print("")