在birdbot實現的FlappyBird基礎上訓練AI,這個FlappyBird的實現對遊戲進行了簡單的封裝,能夠很方便獲得遊戲的狀態來輔助算法實現。同時能夠顯示遊戲界面方便調試,可以看到算法實現的效果。也能夠選擇關閉遊戲界面以及聲音,這樣遊戲仍然能正常運行,通常用於訓練階段,能夠減小CPU的佔用python
實現參考的是SarvagyaVaish的Flappy Bird RLgit
Q-Learning是強化學習算法中value-based的算法github
Q即爲Q(s,a)就是在某一時刻的 s 狀態下(s∈S),採起 動做a (a∈A)動做可以得到收益的指望,環境會根據agent的動做反饋相應的回報reward,因此算法的主要思想就是將State與Action構建成一張Q-table來存儲Q值,而後根據Q值來選取可以得到最大的收益的動做算法
Q-Table | a1 | a2 |
---|---|---|
s1 | q(s1,a1) | q(s1,a2) |
s2 | q(s2,a1) | q(s2,a2) |
s3 | q(s3,a1) | q(s3,a2) |
在更新的過程當中,引入了學習速率alpha,控制先前的Q值和新的Q值之間有多少差別被保留app
γ爲折扣因子,0<= γ<1,γ=0表示當即回報,γ趨於1表示未來回報,γ決定時間的遠近對回報的影響程度less
詳細的Q-Learning過程能夠參考下面這篇dom
A Painless Q-learning Tutorial (一個 Q-learning 算法的簡明教程)學習
每個狀態,有兩個可能的動做this
獎勵的機制徹底基於鳥是否存活spa
僞代碼
初始化 Q = {}; while Q 未收斂: 初始化小鳥的位置S,開始新一輪遊戲 while S != 死亡狀態: 使用策略π,得到動做a=π(S) 使用動做a進行遊戲,得到小鳥的新位置S',與獎勵R(S,a) Q[S,A] ← (1-α)*Q[S,A] + α*(R(S,a) + γ* max Q[S',a]) // 更新Q S ← S'
Q[s,a] ← Q[s,a] + α (r + γ*V(s') - Q[s,a])
import pyglet import random import pickle import atexit import os from pybird.game import Game class Bot: def __init__(self, game): self.game = game # constants self.WINDOW_HEIGHT = Game.WINDOW_HEIGHT self.PIPE_WIDTH = Game.PIPE_WIDTH # this flag is used to make sure at most one tap during # every call of run() self.tapped = False self.game.play() # variables for plan self.Q = {} self.alpha = 0.7 self.explore = 100 self.pre_s = (9999, 9999) self.pre_a = 'do_nothing' self.absolute_path = os.path.split(os.path.realpath(__file__))[0] self.memo = self.absolute_path + '/memo' if os.path.isfile(self.memo): _dict = pickle.load(open(self.memo)) self.Q = _dict["Q"] self.game.record.iters = _dict.get("iters", 0) self.game.record.best_iter = _dict.get("best_iter", 0) def do_at_exit(): _dict = {"Q": self.Q, "iters": self.game.record.iters, "best_iter": self.game.record.best_iter} pickle.dump(_dict, open(self.memo, 'wb')) atexit.register(do_at_exit) # this method is auto called every 0.05s by the pyglet def run(self): if self.game.state == 'PLAY': self.tapped = False # call plan() to execute your plan self.plan(self.get_state()) else: state = self.get_state() bird_state = list(state['bird']) bird_state[2] = 'dead' state['bird'] = bird_state # do NOT allow tap self.tapped = True self.plan(state) # restart game print 'iters:',self.game.record.iters,' score:', self.game.record.get(), 'best: ', self.game.record.best_score self.game.record.inc_iters() self.game.restart() self.game.play() # get the state that robot needed def get_state(self): state = {} # bird's position and status(dead or alive) state['bird'] = (int(round(self.game.bird.x)), \ int(round(self.game.bird.y)), 'alive') state['pipes'] = [] # pipes' position for i in range(1, len(self.game.pipes), 2): p = self.game.pipes[i] if p.x < Game.WINDOW_WIDTH: # this pair of pipes shows on screen x = int(round(p.x)) y = int(round(p.y)) state['pipes'].append((x, y)) state['pipes'].append((x, y - Game.PIPE_HEIGHT_INTERVAL)) return state # simulate the click action, bird will fly higher when tapped # It can be called only once every time slice(every execution cycle of plan()) def tap(self): if not self.tapped: self.game.bird.jump() self.tapped = True # That's where the robot actually works # NOTE Put your code here def plan(self, state): x = state['bird'][0] y = state['bird'][1] if len(state['pipes']) == 0: if y < self.WINDOW_HEIGHT / 2: self.tap() return h, v = 9999, 9999 reward = -1000 if state['bird'][2] == 'dead' else 1 for i in range(1, len(state['pipes']), 2): p = state['pipes'][i] if x <= p[0] + self.PIPE_WIDTH: h = p[0] + self.PIPE_WIDTH - x v = p[1] - y break scale = 10 h /= scale v /= scale self.Q.setdefault((h, v), {'tap': 0, 'do_nothing': 0}) self.Q.setdefault(self.pre_s, {'tap': 0, 'do_nothing': 0}) tap_v = self.Q[(h, v)]['tap'] nothing_v = self.Q[(h, v)]['do_nothing'] self.Q[self.pre_s][self.pre_a] += self.alpha * (reward + max(tap_v, nothing_v) - self.Q[self.pre_s][self.pre_a]) self.pre_s = (h, v) if random.randint(0, self.explore) > 100: self.pre_a = "do_nothing" if random.randint(0, 1) else "tap" else: tap_v = self.Q[self.pre_s]['tap'] nothing_v = self.Q[self.pre_s]['do_nothing'] self.pre_a = "do_nothing" if tap_v <= nothing_v else "tap" if self.pre_a == 'tap': self.tap() else: pass if __name__ == '__main__': show_window = True enable_sound = False game = Game() game.set_sound(enable_sound) bot = Bot(game) def update(dt): game.update(dt) bot.run() pyglet.clock.schedule_interval(update, Game.TIME_INTERVAL) if show_window: window = pyglet.window.Window(Game.WINDOW_WIDTH, Game.WINDOW_HEIGHT, vsync = False) @window.event def on_draw(): window.clear() game.draw() pyglet.app.run() else: pyglet.app.run()
所有代碼見github倉庫