相信你們對由Google開發的AlphaGo機器人在2016年圍棋對弈中擊敗韓國的圍棋大師李世石還記憶猶新吧。當時,我的也確實被這場人機大戰的結果深深震撼而且恐懼了。震撼是由於機器人的智慧超越了人類只有在科幻大片中看到,而今,這種故事卻真真實實的發生在咱們的現實中,恐懼是對未知的一種天然反應,也正是由於這種恐懼,咱們纔有了去探索未知的本能,去揭開AlphaGo機器人背後技術原理的面紗。相信你們已經猜到了AlphaGo機器運用的技術原理,不錯,那就是強化學習(Reinforcement Learning)。python
決策過程: 智能體初始狀態,選擇一個動做
表示在 時刻的狀態
這裏假按期望都是 0.5,7.4對應的狀態 的下一個狀態是
(紅色圓圈) 和
(0) 。
= 當前的獎勵 + 下一狀態的價值
上述圖計算 價值在表述上有點不對,沒有將即時獎勵和下一狀態價值分開,在理解上容易形成混亂。
因此 對應的價值
= 0.5 * (10 + 1) + 0.5 * (0.2 *(-1.3)+ 0.4 * 2.7 + 0.4 * 7.4 )
學習在一個給定的狀態時,採起了一個特定的行動後,所獲得的回報 ,而後遍歷全部可能的行動,獲得全部狀態的回報 Q (Table)。
其實,每一個 ==
1. 初始化Q-Table 每一個狀態(s)對應的回報爲 0;
2. 隨機選取一個狀態(s)做爲遍歷的起始點;
3. 在當前狀態(s)的全部可能的行動(A)中按順序遍歷每個行動(a);
4. 移動到下一個狀態 ;
5. 在新狀態上選擇 Q 值最大的那個行動(a1);
6. 用貝爾曼方程更新Q-Table中相應狀態-行動對應的價值()。按順序遍歷第3步其餘可能的行動,重複第3 - 6步;
7. 將新狀態設置爲當前狀態,而後重複第2 - 6步,直到到達目標狀態;
舉個樣例,左邊是Q-Learning過程(R:獎勵),中間是state = 1,action = 5生成Q-Table結果,右邊是最終Q-Table結果。
1 import io 2 import numpy as np 3 import sys 4 from gym.envs.toy_text import discrete 5 6 UP = 0 7 RIGHT = 1 8 DOWN = 2 9 LEFT = 3 10 11 class GridworldEnv(discrete.DiscreteEnv): 12 """ 13 Grid World environment from Sutton's Reinforcement Learning book chapter 4. 14 You are an agent on an MxN grid and your goal is to reach the terminal 15 state at the top left or the bottom right corner. 16 17 For example, a 4x4 grid looks as follows: 18 19 T o o o 20 o x o o 21 o o o o 22 o o o T 23 24 x is your position and T are the two terminal states. 25 26 You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3). 27 Actions going off the edge leave you in your current state. 28 You receive a reward of -1 at each step until you reach a terminal state. 29 """ 30 31 metadata = {'render.modes': ['human', 'ansi']} 32 33 def __init__(self, shape=[4,4]): 34 if not isinstance(shape, (list, tuple)) or not len(shape) == 2: 35 raise ValueError('shape argument must be a list/tuple of length 2') 36 37 self.shape = shape 38 39 nS = np.prod(shape) 40 nA = 4 41 42 MAX_Y = shape[0] 43 MAX_X = shape[1] 44 45 P = {} 46 grid = np.arange(nS).reshape(shape) 47 it = np.nditer(grid, flags=['multi_index']) 48 49 while not it.finished: 50 s = it.iterindex 51 y, x = it.multi_index 52 53 # P[s][a] = (prob, next_state, reward, is_done) 54 P[s] = {a : [] for a in range(nA)} 55 56 is_done = lambda s: s == 0 or s == (nS - 1) 57 reward = 0.0 if is_done(s) else -1.0 58 59 # We're stuck in a terminal state 60 if is_done(s): 61 P[s][UP] = [(1.0, s, reward, True)] 62 P[s][RIGHT] = [(1.0, s, reward, True)] 63 P[s][DOWN] = [(1.0, s, reward, True)] 64 P[s][LEFT] = [(1.0, s, reward, True)] 65 # Not a terminal state 66 else: 67 ns_up = s if y == 0 else s - MAX_X 68 ns_right = s if x == (MAX_X - 1) else s + 1 69 ns_down = s if y == (MAX_Y - 1) else s + MAX_X 70 ns_left = s if x == 0 else s - 1 71 P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))] 72 P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))] 73 P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))] 74 P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))] 75 76 it.iternext() 77 78 # Initial state distribution is uniform 79 isd = np.ones(nS) / nS 80 81 # We expose the model of the environment for educational purposes 82 # This should not be used in any model-free learning algorithm 83 self.P = P 84 85 super(GridworldEnv, self).__init__(nS, nA, P, isd) 86 87 def _render(self, mode='human', close=False): 88 """ Renders the current gridworld layout 89 90 For example, a 4x4 grid with the mode="human" looks like: 91 T o o o 92 o x o o 93 o o o o 94 o o o T 95 where x is your position and T are the two terminal states. 96 """ 97 if close: 98 return 99 100 outfile = io.StringIO() if mode == 'ansi' else sys.stdout 101 102 grid = np.arange(self.nS).reshape(self.shape) 103 it = np.nditer(grid, flags=['multi_index']) 104 while not it.finished: 105 s = it.iterindex 106 y, x = it.multi_index 107 108 if self.s == s: 109 output = " x " 110 elif s == 0 or s == self.nS - 1: 111 output = " T " 112 else: 113 output = " o " 114 115 if x == 0: 116 output = output.lstrip() 117 if x == self.shape[1] - 1: 118 output = output.rstrip() 119 120 outfile.write(output) 121 122 if x == self.shape[1] - 1: 123 outfile.write("\n") 124 125 it.iternext()
1 import numpy as np 2 import gridworld as gw 3 4 5 env = gw.GridworldEnv() 6 # V表是當前狀態走下一步時最大回報(先求下一步不一樣方向的回報,而後求最大值); 7 # Q表是計算全部狀態全部可能的走法; 8 def value_iteration(env, theta=0.0001, discount_factor=1.0): 9 """ 10 Value Iteration Algorithm. 11 12 Args: 13 env: OpenAI environment. env.P represents the transition probabilities of the environment. 14 theta: Stopping threshold. If the value of all states changes less than theta 15 in one iteration we are done. 16 discount_factor: lambda time discount factor. 17 18 Returns: 19 A tuple (policy, V) of the optimal policy and the optimal value function. 20 """ 21 22 def one_step_lookahead(state, V): 23 """ 24 Helper function to calculate the value for all action in a given state. 25 26 Args: 27 state: The state to consider (int) 28 V: The value to use as an estimator, Vector of length env.nS 29 30 Returns: 31 A vector of length env.nA containing the expected value of each action. 32 """ 33 """ 34 遊戲規則: 0,15 做爲出口 ,找出每一個狀態(16個)到達出口最快的一步 35 0 o o o 36 o x o o 37 o o o o 38 o o o 15 39 """ 40 # env.P 是個初始狀態table 在狀態0,15 有獎勵,其餘都是懲罰 -1 41 # 每一個action回報(上,下,左,右) = 機率因子 *(即時獎勵 + 折扣因子 * 下個狀態回報(當前行動到達的下個狀態) 42 A = np.zeros(env.nA) 43 for a in range(env.nA): 44 for prob, next_state, reward, done in env.P[state][a]: 45 A[a] += prob * (reward + discount_factor * V[next_state]) 46 return A 47 48 #每次獲取16個狀態的回報V(矩陣), 49 # 找到當前狀態某個行爲的最大回報與當前狀態歷史回報最小值 < 超參theta,就結束循環。 50 # 這時狀態V就是每一個狀態某個即將行爲的最大回報 51 V = np.zeros(env.nS) 52 while True: 53 # Stopping condition 54 delta = 0 55 # Update each state... 56 # 有16個狀態,每次都須要所有遍歷 57 for s in range(env.nS): 58 # Do a one-step lookahead to find the best action 59 A = one_step_lookahead(s, V) 60 best_action_value = np.max(A) 61 # Calculate delta across all states seen so far 62 delta = max(delta, np.abs(best_action_value - V[s])) 63 # Update the value function 64 V[s] = best_action_value 65 # Check if we can stop 66 if delta < theta: 67 break 68 #(16,4) 69 # Create a deterministic policy using the optimal value function 70 policy = np.zeros([env.nS, env.nA]) 71 for s in range(env.nS): 72 # One step lookahead to find the best action for this state 73 A = one_step_lookahead(s, V) 74 best_action = np.argmax(A) # 最大值對應索引 75 # Always take the best action 76 policy[s, best_action] = 1.0 77 78 return policy, V 79 80 policy, v = value_iteration(env) 81 82 print("Policy Probability Distribution:") 83 print(policy) 84 print("") 85 86 print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):") 87 # 返回的是最大值的索引 axis = 1 列 88 print(np.reshape(np.argmax(policy, axis=1), env.shape)) 89 print("")
遊戲規則有16個狀態(state),每一個狀態有4個行動方向(action),所以Q-Table大小是(16, 4),具體生成結果以下圖:
最終每一個狀態最佳行動方向大小(4, 4),結果爲