【飛槳開發者說】王子瑞，四川大學電氣工程學院2018級自動化專業本科生，飛槳開發者技術專家PPDE，RoboMaster川大火鍋戰隊成員，強化學習愛好者git

超級馬里奧兄弟做爲幾代人的童年回憶，陪伴了咱們的成長。現在，隨着深度強化學習的發展，愈來愈多的遊戲已經被AI征服。今天，咱們將以超級馬里奧爲例子，展現如何用深度強化學習試着通關遊戲。算法

下載安裝命令

## CPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

馬里奧遊戲環境簡介

遊戲環境只給予3次機會通關，即玩家或AI須要在3次機會內經過遊戲的32關。環境提供了 RIGHT_ONLY、SIMPLE_MOVEMENT和COMPLEX_MOVEMENT三種難度的操做模式。人們只須要對環境輸入各類動做所表明的數值，就能實現對馬里奧的各類操做。多線程

馬里奧遊戲環境連接：app

https://pypi.org/project/gym-super-mario-bros/框架

PPO算法簡介

相信瞭解強化學習的各位必定據說過近端策略優化PPO算法吧。PPO算法是一種新型的 Policy Gradient算法，Policy Gradient算法對步長十分敏感。在訓練過程當中，若沒有選擇到合適的步長，新舊策略的變化可能會出現差別過大的現象，不利於模型的收斂。PPO提出了新的目標函數，能夠在多個訓練步驟中實現小幅度的更新，解決了Policy Gradient算法中步長難以肯定的問題。ide

PPO算法論文連接：函數

https://arxiv.org/abs/1707.06347學習

基於飛槳框架2.0實現PPO

在此以前，咱們先看看模型結構。模型是Actor-Critic結構，可是咱們對模型結構作了一點簡化，Actor和Critic只在輸出層有所區別。因爲模型處理的是圖像信息，故咱們在全鏈接層前加入了卷積層。下面就讓咱們用飛槳框架2.0實現PPO算法吧！測試

class MARIO(Layer):
    def __init__(self, input_num, actions):
        super(MARIO, self).__init__()
        self.num_input = input_num
        self.channels = 32
        self.kernel = 3
        self.stride = 2
        self.padding = 1
        self.fc = 32 * 6 * 6
        self.conv0 = Conv2D(out_channels=self.channels, 
                                    kernel_size=self.kernel, 
                                    stride=self.stride, 
                                    padding=self.padding, 
                                    dilation=[1, 1], 
                                    groups=1, 
                                    in_channels=input_num)
        self.relu0 = ReLU()
        self.conv1 = Conv2D(out_channels=self.channels, 
                                    kernel_size=self.kernel, 
                                    stride=self.stride, 
                                    padding=self.padding, 
                                    dilation=[1, 1], 
                                    groups=1, 
                                    in_channels=self.channels)
        self.relu1 = ReLU()
        self.conv2 = Conv2D(out_channels=self.channels, 
                                    kernel_size=self.kernel, 
                                    stride=self.stride, 
                                    padding=self.padding, 
                                    dilation=[1, 1], 
                                    groups=1, 
                                    in_channels=self.channels)
        self.relu2 = ReLU()
        self.conv3 = Conv2D(out_channels=self.channels, 
                                    kernel_size=self.kernel, 
                                    stride=self.stride, 
                                    padding=self.padding, 
                                    dilation=[1, 1], 
                                    groups=1, 
                                    in_channels=self.channels)
        self.relu3 = ReLU()
        self.linear0 = Linear(in_features=int(self.fc), out_features=512)
        self.linear1 = Linear(in_features=512, out_features=actions)
        self.linear2 = Linear(in_features=512, out_features=1)

    def forward(self, x):
        x = paddle.to_tensor(data=x)
        x = self.conv0(x)
        x = self.relu0(x)
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.conv3(x)
        x = self.relu3(x)
        x = paddle.reshape(x, [x.shape[0], -1])
        x = self.linear0(x)
        logits = self.linear1(x)
        value = self.linear2(x)
        return logits, value

本文的PPO屬於在線學習，大體分爲如下三個模塊：優化

獲取動做軌跡
計算優點函數
數據採樣與模型參數更新

因爲PPO是Policy Gradient算法，咱們的智能體須要生成一個類別分佈，即一個包含每一個動做發生機率的向量。而後根據向量中的機率，選擇咱們的動做。最後與環境交互，並將返回的各類狀態信息以及獎勵存入列表當中備用。下面是獲取動做軌跡模塊的部分代碼：

for _ in range(num_local_steps):
            logits, value = model(curr_states)
            values.append(value.squeeze())
            policy = F.softmax(logits, axis=1)
            old_m = Categorical(policy) # 生成類別分佈
            action = old_m.sample([1]) # 採樣
            old_log_policy = old_m.log_prob(action)
            old_log_policies.append(old_log_policy)
            [agent_conn.send(("step", act)) for agent_conn, act in zip(envs.agent_conns, action.numpy().astype("int8"))]
            state, reward, done, info = zip(*[agent_conn.recv() for agent_conn in envs.agent_conns])

接着上面的代碼，咱們須要計算優點函數。具體來講，優點函數指當前狀態s採用動做a的收益與當前狀態s平均收益的差。優點越大，動做a收益就越高，同一狀態下采用該動做的機率也就應該更高。

這裏，咱們用到了廣義優點估計GAE（Generalized Advantage Estimator），幾乎全部最早進的Policy Gradient算法實現都使用了該技術。這項技術主要用來修正咱們Critic模型提供的價值，使其成爲方差最小的無偏估計。

for value, reward, done in list(zip(values, rewards, dones))[::-1]:
            gae = gae * gamma * tau
            gae = gae + reward + gamma * next_value.detach().numpy() * (1.0 - done) - value.detach().numpy()
            next_value = value
            R.append(paddle.to_tensor(gae + value.detach().numpy()))
advantages = R - values

最後，咱們使用PPO算法更新模型參數。這裏並無計算KL散度，而是經過截斷的方式實現小幅度的更新。

for i in range(num_epochs):
    indice = paddle.randperm(num_local_steps * num_processes)
    for j in range(batch_size):
        batch_indices = indice[
                        int(j * (num_local_steps * num_processes / batch_size)): int((j + 1) * (
                                num_local_steps * num_processes / batch_size))]
        logits, value = model(paddle.gather(states, batch_indices, axis=0))
        new_policy = F.softmax(logits, axis=1)
        new_m = Categorical(new_policy)
        new_log_policy = new_m.log_prob(paddle.gather(actions, batch_indices, axis=0))
        ratio = paddle.exp(new_log_policy - paddle.gather(old_log_policies, batch_indices, axis=0))
        advantages = paddle.gather(advantages, batch_indices, axis=0)
        actor_loss = paddle.to_tensor(list((ratio * advantages).numpy() + (paddle.clip(ratio, 1.0 - epsilon, 1.0 + epsilon) * advantages).numpy()))
        actor_loss = -paddle.mean(paddle.min(actor_loss, axis=0))
        critic_loss = F.smooth_l1_loss(paddle.gather(R, batch_indices), value)
        entropy_loss = paddle.mean(new_m.entropy())
        total_loss = actor_loss + critic_loss - beta * entropy_loss
        clip_grad = paddle.nn.ClipGradByNorm(clip_norm=0.25)
        optimizer = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters(), grad_clip=clip_grad)
        optimizer.clear_grad()
        total_loss.backward()
        optimizer.step()

因爲篇幅有限，此部分只呈現了簡要思路與部分刪減後的代碼，感興趣的同窗能夠直接查看源碼。

通關小技巧

馬里奧的通關小技巧有不少，這裏主要給你們提供三個方向的思路：

原始輸入圖像預處理
獎勵函數重設置
多線程/並行訓練

原始輸入圖像預處理：簡化圖像特徵，疊合連續4幀圖像做爲輸入，能夠起到捕捉遊戲環境的動態性的做用。

獎勵函數重設置：不一樣的獎勵函數所鼓勵的行爲是不一樣的，例如提升踩怪的獎勵，就可使模型更傾向於踩怪。本文從新分配了一下各類獎勵的權重，對於通關也有更豐厚的額外獎勵。

def step(self, action):
        state, reward, done, info = self.env.step(action)
        if self.monitor:
            self.monitor.record(state)
        state = process_frame(state)
        reward += (info["score"] - self.curr_score) / 40.
        self.curr_score = info["score"]
        if done:
            if info["flag_get"]:
                reward += 50
            else:
                reward -= 50
            self.env.reset()
        return state, reward / 10., done, info

多線程/並行訓練：並行化能夠有效提升模型的訓練效率，同時也是目前強化學習的趨勢之一。本文經過 Python 的 multiprocess 模塊實現並行化。

class MultipleEnvironments:
    def __init__(self, world, stage, action_type, num_envs, output_path=None):
        self.agent_conns, self.env_conns = zip(*[mp.Pipe() for _ in range(num_envs)])
        '''選擇操做模式
        '''
        if action_type == "right":
            actions = RIGHT_ONLY
        elif action_type == "simple":
            actions = SIMPLE_MOVEMENT
        else:
            actions = COMPLEX_MOVEMENT
        '''建立多環境
        '''
        self.envs = [create_train_env(world, stage, actions, output_path=output_path) for _ in range(num_envs)]
        self.num_states = self.envs[0].observation_space.shape[0]
        self.num_actions = len(actions)

        '''建立多進程
        '''
        for index in range(num_envs):
            process = mp.Process(target=self.run, args=(index,))
            process.start()
            self.env_conns[index].close()
    def run(self, index):
        self.agent_conns[index].close()
        while True:
            request, action = self.env_conns[index].recv()
            if request == "step":
                self.env_conns[index].send(self.envs[index].step(int(action)))
            elif request == "reset":
                self.env_conns[index].send(self.envs[index].reset())
            else:
                raise NotImplementedError

效果展現：本文以關卡1-1爲例。目前多線程訓練的馬里奧已經正式經過測試。在 8 線程下，訓練過程當中咱們的馬里奧可以得到的Reward值變化趨勢以下圖所示。

最後，誠邀你們收看超級馬里奧兄弟1-1通關全過程：

全文回顧

咱們在這篇文章裏，先簡單介紹超級馬里奧兄弟的遊戲環境，而後補充一些與PPO算法有關的知識，並基於Paddle2.0進行實現了該算法。

除此以外，本文還總結了一些通關小技巧，並以遊戲第一關爲例，展現了訓練過程。最後，咱們向你們展示了訓練完成後的效果。

訓練代碼項目連接：

https://aistudio.baidu.com/aistudio/projectdetail/1434971

通關展現項目連接：

https://aistudio.baidu.com/aistudio/projectdetail/1434950

下載安裝命令

## CPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 訪問 PaddlePaddle 官網，瞭解更多相關內容。

2021年2月3日做者在飛槳 PaddlePaddle B站直播間分享：《用深度強化學習輕鬆通關馬里奧》

歡迎收看~ 嗶哩嗶哩直播，二次元彈幕直播平臺 (bilibili.com)

本文同步分享在博客「飛槳 PaddlePaddle」（CSDN）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。

AI又對遊戲下手了，用強化學習通關超級馬里奧兄弟

馬里奧遊戲環境簡介

PPO算法簡介

基於飛槳框架2.0實現PPO

通關小技巧

全文回顧