上篇文章講到Q-learning, Sarsa與Q-learning的在決策上是徹底相同的,不一樣之處在於學習的方式上算法
此次咱們用openai gym的Taxi來作演示dom
Taxi是一個出租車的遊戲,把顧客送到目的地+20分,每走一步-1分,若是在路上把乘客趕下車的話扣10分學習
Sarsa是一種在線學習算法,也就是on-polic,Sarsa在每次更新算法時都是基於肯定的action,而Q-learning尚未肯定spa
Sarsa相對比較保守,他的每一步行動都是基於下一個Q(s',a')來完成的
code
咱們來看Sarsa的算法部分orm
是否是看起來很眼熟,沒錯和Q-learning的區別很小blog
Q-learning每次都時action'都選擇最大化,而Sarsa每次更新都會選擇下一個action,在咱們對代碼中對應的代碼也就是遊戲
obervation_, reward, done, info=env.step(action)
action_=choise(obervation_)
首先咱們初始化遊戲環境it
import gym import numpy as np env=gym.make('Taxi-v2') env.seed(1995) MAX_STEP=env.spec.timestep_limit ALPHA=0.01 EPS=1 GAMMA=0.8
TRACE_DACAY=0.9
q_table=np.zeros([env.observation_space.n,env.action_space.n],dtype=np.float32)
eligibility_trace=np.zeros([env.observation_space.n,env.action_space.n],dtype=np.float32)
對沒錯,Sarsa仍是須要Q表來保存經驗的,細心的小夥伴們必定發現咱們多了一個eligibility_trace的變量,這個是作什麼用的呢,這個是用來保存每一個回合的每一步的,在新的回合開始後就會清零io
Sarsa的決策上仍是和Q-learning相同的
def choise(obervation): if np.random.uniform()<EPS: action=env.action_space.sample() else: action=np.argmax(q_table[obervation]) return action
下面是咱們的核心部分,就是學習啦^_^
#這裏是Q-learning的學習更新部分
def learn(state,action,reward,obervation_): q_table[state][action]+=ALPHA*(reward+GAMMA*(max(q_table[obervation_])-q_table[state,action]))
#這裏是Sarsa的學習更新部分
def learn(state,action,reward,obervation_,action_): global q_table,eligibility_trace error=reward + GAMMA * q_table[obervation_,action_] - q_table[state, action] eligibility_trace[state]*=0 eligibility_trace[state][action]=1 q_table+=ALPHA*error*eligibility_trace eligibility_trace*=GAMMA*TRACE_DACAY
噠當,我用紅線標示出來了,聰明的你必定發現了不一樣對吧
青色標示出來的表明的意思是沒經歷一輪,咱們讓他+1證實這是得到reward中不可獲取的一步
最後一行
eligibility_trace*=GAMMA*TRACE_DACAY
隨着時間來衰減eligibility_trace的值,離獲取reward越遠的步,他的必要性也就越小
讓咱們大幹一場吧
下面是全部的代碼,小夥伴們快來運行把
import gym import numpy as np env=gym.make('Taxi-v2') env.seed(1995) MAX_STEP=env.spec.timestep_limit ALPHA=0.01 EPS=1 GAMMA=0.8 TRACE_DACAY=0.9 q_table=np.zeros([env.observation_space.n,env.action_space.n],dtype=np.float32) eligibility_trace=np.zeros([env.observation_space.n,env.action_space.n],dtype=np.float32) def choise(obervation): if np.random.uniform()<EPS: action=env.action_space.sample() else: action=np.argmax(q_table[obervation]) return action def learn(state,action,reward,obervation_,action_): global q_table,eligibility_trace error=reward + GAMMA * q_table[obervation_,action_] - q_table[state, action] eligibility_trace[state]*=0 eligibility_trace[state][action]=1 q_table+=ALPHA*error*eligibility_trace eligibility_trace*=GAMMA*TRACE_DACAY SCORE=0 for exp in xrange(50000): obervation=env.reset() EPS-= 0.001 action=choise(obervation) eligibility_trace*=0 for i in xrange(MAX_STEP): # env.render() obervation_, reward, done, info=env.step(action) action_=choise(obervation_) learn(obervation,action,reward,obervation_,action_) obervation=obervation_ action=action_ SCORE+=reward if done: break if exp % 1000 == 0: print 'esp,score (%d,%d)' % (exp, SCORE) SCORE = 0 print 'fenshu is %d'%SCORE
歡迎你們一塊兒來學習^_^
最後附上一幅結果圖
效率明顯提升了^_^