18世紀,布豐提出如下問題:設咱們有一個以平行且等距木紋鋪成的地板(如圖),node
如今隨意拋一支長度比木紋之間距離小的針,求針和其中一條木紋相交的機率。並以此機率,布豐提出的一種計算圓周率的方法——隨機投針法。這就是蒲豐投針問題(又譯「布豐投針問題」)。 算法
咱們來看一下投針算法的步驟:編程
隨機模擬 (或者統計模擬) 方法有一個很酷的別名是蒙特卡羅方法(Monte Carlo Simulation)。這個方法的發展始於 20 世紀 40 年代,和原子彈製造的曼哈頓計劃密切相關,當時的幾個頂級科學家,包括烏拉姆、馮. 諾依曼、費米、費曼、Nicholas Metropolis, 在美國洛斯阿拉莫斯國家實驗室研究裂變物質的中子連鎖反應的時候,開始使用統計模擬的方法, 並在最先的計算機上進行編程實現。windows
現代的統計模擬方法最先由數學家烏拉姆提出,被 Metropolis 命名爲蒙特卡羅方法,蒙特卡羅是著名的賭場,賭博老是和統計密切關聯的,因此這個命名風趣而貼切,很快被你們普遍接受。網絡
在上一節中,咱們舉了一個蒲豐投針的例子。投針的過程就是進行大量的隨機試驗,經過對投針結果(表象)的觀測和記錄,反過來推測一個未知隨機量(裏像)。因此本質上說,布豐投針實驗就是一種蒙特卡洛方法。app
像投針實驗同樣,用經過機率實驗所求的機率估計來估計一個未知量,這樣的方法統稱爲蒙特卡羅方法(Monte Carlo method)。dom
在現實世界中,大量存在一些複雜性過程,因爲這類模型含有不肯定的隨機因素,咱們很難直接用一個肯定性模型來分析和描述。面對這種狀況.數據科學家難以做定量分析,得不到解析的結果,或者是雖有解析結果,但計算代價太大以致不能使用。在這種狀況下,能夠考慮採用 Monte Carlo 方法。ide
一個典型的例子以下,若是咱們須要計算任意曲邊梯形面積的近似計水塘的面積.想象應該怎樣作呢?函數
很顯然,咱們不可能寫出一個精確的數學公式,來模擬任意形狀池塘的面積,咱們應該放棄精確性的思想,轉而尋求隨機性,世界原本就是複雜的,也許隨機性纔是世界的真相。優化
一種可行的測量方法以下:
Relevant Link:
https://baike.baidu.com/item/%E8%92%B2%E4%B8%B0%E6%8A%95%E9%92%88%E9%97%AE%E9%A2%98/10876943?fr=aladdin https://cosx.org/2013/01/lda-math-mcmc-and-gibbs-sampling/
蒙特卡洛樹搜索是一種基於樹結構的蒙特卡洛方法,所謂的蒙特卡洛樹搜索就是基於蒙特卡洛方法在整個2N(N等於決策次數,即樹深度)空間中進行啓發式搜索,基於必定的反饋尋找出最優的樹結構路徑(可行解)。歸納來講就是,MCTS是一種肯定規則驅動的啓發式隨機搜索算法。
上面這句話其實闡述了MCTS的5個主要核心部分:
算法的優化核心思想總結一句話就是:在肯定方向的漸進收斂(樹搜索的準確性)和隨機性(隨機模擬的通常性)之間尋求一個最佳平衡。體現了納什均衡的思想精髓。
咱們知道,MCTS搜索就是創建一棵樹的過程。蒙特卡羅樹搜索大概能夠被分紅四步。選擇(Selection),拓展(Expansion),模擬(Simulation),反向傳播(Backpropagation)。下面咱們逐個來分析。
在開始階段,搜索樹只有一個節點,即根節點。搜索樹中的每個節點包含了三個基本信息:
在選擇階段,須要從父節點(首次選擇從根節點開始),也就是要作決策的局面R出發向下選擇出一個最急迫須要被拓展的節點N,即選擇向哪一個子節點方向生長。
對於被檢查的局面而言,存在三種可能:
每個被檢查的節點的被訪問次數,在每次選擇階段階段都會自增。
在選擇階段結束時候,咱們查找到了一個最迫切被拓展的節點N,以及他一個還沒有拓展的動做A。在搜索樹中建立一個新的節點Nn做爲N的一個新子節點。Nn的局面就是節點N在執行了動做A以後的局面。
爲了讓新節點Nn獲得一個初始的評分。咱們從Nn開始,讓遊戲隨機進行,直到獲得一個遊戲結局,這個結局將做爲Nn的初始評分。通常使用勝利/失敗來做爲評分,只有1或者0。
在Nn的模擬結束以後,它的父節點N以及從根節點到N的路徑上的全部節點都會根據本次模擬的結果來修改本身的累計評分。注意,若是在選擇環節中直接發現了一個遊戲結局的話,根據該結局來更新評分。
每一次迭代都會拓展搜索樹,隨着迭代次數的增長,搜索樹的規模也不斷增長。當到了必定的迭代次數或者時間以後結束,選擇根節點下最好的子節點做爲本次決策的結果。
,其中 vi 是節點估計值,ni 是節點被訪問的次數,而 N 則是其父節點已經被訪問的總次數。C 是可調整參數。
UCB 公式對已知收益節點增強收斂,同時鼓勵接觸那些相對不曾訪問的節點的嘗試性探索。這是一個動態均衡公式。
每一個節點的收益估計基於隨機模擬不斷更新,因此節點必須被訪問若干次來確保估計變得更加可信,事實上,這也是隨機統計的要求(大數狀況下頻率近似估計機率)。
理論上說,MCTS 估計會在搜索的開始不大可靠,而最終會在給定充分的時間後收斂到更加可靠的估計上,在無限時間下可以達到最優估計。
這使得 MCTS 更加適合那些有着更大的分支搜索空間的博弈遊戲,好比說 19X19 的圍棋。這麼大的組合空間會給標準的基於深度或者寬度的搜索方法帶來問題。可是 MCTS 會有選擇地朝某些方向進行深度搜索,同時選擇性地放棄其餘顯然不可能的方向。
這個小節,咱們來看一個MCTS實現的簡單遊戲對弈代碼,筆者會先給出各個主要模塊的說明,最後給出完整的可運行代碼,
class TreeNode(object): """A node in the MCTS tree. Each node keeps track of its own value Q, prior probability P, and its visit-count-adjusted prior score u. """ def __init__(self, parent, prior_p): self._parent = parent self._children = {} # a map from action to TreeNode self._n_visits = 0 self._Q = 0 self._u = 0 self._P = prior_p
TreeNode 類裏初始化了一些數值,主要是 父節點,子節點,訪問節點的次數,Q值和u值,還有先驗機率。同時還定義了選擇評估函數(決定下一個子節點的生長方向),
def select(self, c_puct):
return max(self._children.items(), key=lambda act_node: act_node[1].get_value(c_puct))
def get_value(self, c_puct):
self._u = c_puct * self._P * np.sqrt(self._parent._n_visits) / (1 + self._n_visits)
return self._Q + self._u
選擇函數根據每一個動做(就是子節點)的UCB損失函數值,選擇最優的動做做爲下一個子節點生長方向。
expend()的輸入參數 action_priors 是一個包括的全部合法動做的列表(list),表示在當前局面我能夠在哪些地方落子。此函數爲當前節點擴展了子節點。
def expand(self, action_priors): """Expand tree by creating new children. action_priors -- output from policy function - a list of tuples of actions and their prior probability according to the policy function. """ for action, prob in action_priors: if action not in self._children: self._children[action] = TreeNode(self, prob)
這裏實現了一個基本的對弈遊戲類,
class MCTS(object): def __init__(self, policy_value_fn, c_puct=5, n_playout=10000): self._root = TreeNode(None, 1.0) self._policy = policy_value_fn self._c_puct = c_puct self._n_playout = n_playout def _playout(self, state): node = self._root while True: if node.is_leaf(): break # Greedily select next move. action, node = node.select(self._c_puct) state.do_move(action) # Evaluate the leaf using a network which outputs a list of (action, probability) # tuples p and also a score v in [-1, 1] for the current player. action_probs, leaf_value = self._policy(state) # Check for end of game. end, winner = state.game_end() if not end: node.expand(action_probs) else: # for end state,return the "true" leaf_value if winner == -1: # tie leaf_value = 0.0 else: leaf_value = 1.0 if winner == state.get_current_player() else -1.0 # Update value and visit count of nodes in this traversal. node.update_recursive(-leaf_value) def get_move_probs(self, state, temp=1e-3): for n in range(self._n_playout): state_copy = copy.deepcopy(state) self._playout(state_copy) act_visits = [(act, node._n_visits) for act, node in self._root._children.items()] acts, visits = zip(*act_visits) act_probs = softmax(1.0/temp * np.log(visits)) return acts, act_probs def update_with_move(self, last_move): if last_move in self._root._children: self._root = self._root._children[last_move] self._root._parent = None else: self._root = TreeNode(None, 1.0) def __str__(self): return "MCTS"
MCTS類的初始輸入參數:
_playout(self, state):
此函數有一個輸入參數:state, 它表示當前的狀態。
這個函數的功能就是 模擬。它根據當前的狀態進行遊戲,用貪心算法一條路走到黑,直到葉子節點,再判斷遊戲結束與否。若是遊戲沒有結束,則 擴展 節點,不然 回溯 更新葉子節點和全部祖先的值。
get_move_probs(self, state, temp):
它的功能是從當前狀態開始得到全部可行行動以及它們的機率。也就是說它能根據棋盤的狀態,結合以前介紹的代碼,告訴你它計算的結果,在棋盤的各個位置落子的勝率是多少。有了它,咱們就能讓計算機學會下棋。
update_with_move(self, last_move):
自我對弈時,每走一步以後更新MCTS的子樹。
與玩家對弈時,每個回合都要重置子樹。
將子節點的評估值反向傳播更新父節點,每傳播一次,來自初始子節點的評估值影響力就逐漸減弱。
def update(self, leaf_value): # Count visit. self._n_visits += 1 # Update Q, a running average of values for all visits. self._Q += 1.0*(leaf_value - self._Q) / self._n_visits def update_recursive(self, leaf_value): # If it is not root, this node's parent should be updated first. if self._parent: self._parent.update_recursive(-leaf_value) self.update(leaf_value)
update_recursive() 的功能是回溯,從該節點開始,自上而下地更新全部的父節點。
class MCTSPlayer(object): """AI player based on MCTS""" def __init__(self, policy_value_function, c_puct=5, n_playout=2000, is_selfplay=0): self.mcts = MCTS(policy_value_function, c_puct, n_playout) self._is_selfplay = is_selfplay def set_player_ind(self, p): self.player = p def reset_player(self): self.mcts.update_with_move(-1) def get_action(self, board, temp=1e-3, return_prob=0): sensible_moves = board.availables move_probs = np.zeros(board.width * board.height) # the pi vector returned by MCTS as in the alphaGo Zero paper if len(sensible_moves) > 0: acts, probs = self.mcts.get_move_probs(board, temp) move_probs[list(acts)] = probs if self._is_selfplay: # add Dirichlet Noise for exploration (needed for self-play training) move = np.random.choice(acts, p=0.75 * probs + 0.25 * np.random.dirichlet(0.3 * np.ones(len(probs)))) self.mcts.update_with_move(move) # update the root node and reuse the search tree else: # with the default temp=1e-3, this is almost equivalent to choosing the move with the highest prob move = np.random.choice(acts, p=probs) # reset the root node self.mcts.update_with_move(-1) if return_prob: return move, move_probs else: return move else: print("WARNING: the board is full")
MCTSPlayer類的主要功能在函數get_action(self, board, temp=1e-3, return_prob=0)裏實現。自我對弈的時候會有必定的探索概率,用來訓練。與人類下棋是老是選擇最優策略。
# -*- coding: utf-8 -*- import numpy as np import copy def softmax(x): probs = np.exp(x - np.max(x)) probs /= np.sum(probs) return probs class TreeNode(object): """A node in the MCTS tree. Each node keeps track of its own value Q, prior probability P, and its visit-count-adjusted prior score u. """ def __init__(self, parent, prior_p): self._parent = parent self._children = {} # a map from action to TreeNode self._n_visits = 0 self._Q = 0 self._u = 0 self._P = prior_p def expand(self, action_priors): """Expand tree by creating new children. action_priors -- output from policy function - a list of tuples of actions and their prior probability according to the policy function. """ for action, prob in action_priors: if action not in self._children: self._children[action] = TreeNode(self, prob) def select(self, c_puct): """Select action among children that gives maximum action value, Q plus bonus u(P). Returns: A tuple of (action, next_node) """ return max(self._children.items(), key=lambda act_node: act_node[1].get_value(c_puct)) def update(self, leaf_value): """Update node values from leaf evaluation. """ # Count visit. self._n_visits += 1 # Update Q, a running average of values for all visits. self._Q += 1.0 * (leaf_value - self._Q) / self._n_visits def update_recursive(self, leaf_value): """Like a call to update(), but applied recursively for all ancestors. """ # If it is not root, this node's parent should be updated first. if self._parent: self._parent.update_recursive(-leaf_value) self.update(leaf_value) def get_value(self, c_puct): """Calculate and return the value for this node: a combination of leaf evaluations, Q, and this node's prior adjusted for its visit count, u c_puct -- a number in (0, inf) controlling the relative impact of values, Q, and prior probability, P, on this node's score. """ self._u = c_puct * self._P * np.sqrt(self._parent._n_visits) / (1 + self._n_visits) return self._Q + self._u def is_leaf(self): """Check if leaf node (i.e. no nodes below this have been expanded). """ return self._children == {} def is_root(self): return self._parent is None class MCTS(object): """A simple implementation of Monte Carlo Tree Search. """ def __init__(self, policy_value_fn, c_puct=5, n_playout=10000): """Arguments: policy_value_fn -- a function that takes in a board state and outputs a list of (action, probability) tuples and also a score in [-1, 1] (i.e. the expected value of the end game score from the current player's perspective) for the current player. c_puct -- a number in (0, inf) that controls how quickly exploration converges to the maximum-value policy, where a higher value means relying on the prior more """ self._root = TreeNode(None, 1.0) self._policy = policy_value_fn self._c_puct = c_puct self._n_playout = n_playout def _playout(self, state): """Run a single playout from the root to the leaf, getting a value at the leaf and propagating it back through its parents. State is modified in-place, so a copy must be provided. Arguments: state -- a copy of the state. """ node = self._root while True: if node.is_leaf(): break # Greedily select next move. action, node = node.select(self._c_puct) state.do_move(action) # Evaluate the leaf using a network which outputs a list of (action, probability) # tuples p and also a score v in [-1, 1] for the current player. action_probs, leaf_value = self._policy(state) # Check for end of game. end, winner = state.game_end() if not end: node.expand(action_probs) else: # for end state,return the "true" leaf_value if winner == -1: # tie leaf_value = 0.0 else: leaf_value = 1.0 if winner == state.get_current_player() else -1.0 # Update value and visit count of nodes in this traversal. node.update_recursive(-leaf_value) def get_move_probs(self, state, temp=1e-3): """Runs all playouts sequentially and returns the available actions and their corresponding probabilities Arguments: state -- the current state, including both game state and the current player. temp -- temperature parameter in (0, 1] that controls the level of exploration Returns: the available actions and the corresponding probabilities """ for n in range(self._n_playout): state_copy = copy.deepcopy(state) self._playout(state_copy) # calc the move probabilities based on the visit counts at the root node act_visits = [(act, node._n_visits) for act, node in self._root._children.items()] acts, visits = zip(*act_visits) act_probs = softmax(1.0 / temp * np.log(visits)) return acts, act_probs def update_with_move(self, last_move): """Step forward in the tree, keeping everything we already know about the subtree. """ if last_move in self._root._children: self._root = self._root._children[last_move] self._root._parent = None else: self._root = TreeNode(None, 1.0) def __str__(self): return "MCTS" class MCTSPlayer(object): """AI player based on MCTS""" def __init__(self, policy_value_function, c_puct=5, n_playout=2000, is_selfplay=0): self.mcts = MCTS(policy_value_function, c_puct, n_playout) self._is_selfplay = is_selfplay def set_player_ind(self, p): self.player = p def reset_player(self): self.mcts.update_with_move(-1) def get_action(self, board, temp=1e-3, return_prob=0): sensible_moves = board.availables # the pi vector returned by MCTS as in the alphaGo Zero paper move_probs = np.zeros(board.width * board.height) if len(sensible_moves) > 0: acts, probs = self.mcts.get_move_probs(board, temp) move_probs[list(acts)] = probs if self._is_selfplay: # add Dirichlet Noise for exploration (needed for # self-play training) move = np.random.choice( acts, p=0.75 * probs + 0.25 * np.random.dirichlet(0.3 * np.ones(len(probs))) ) # update the root node and reuse the search tree self.mcts.update_with_move(move) else: # with the default temp=1e-3, it is almost equivalent # to choosing the move with the highest prob move = np.random.choice(acts, p=probs) # reset the root node self.mcts.update_with_move(-1) # location = board.move_to_location(move) # print("AI move: %d,%d\n" % (location[0], location[1])) if return_prob: return move, move_probs else: return move else: print("WARNING: the board is full")
Relevant Link:
https://www.zhihu.com/question/39916945 https://zhuanlan.zhihu.com/p/30316076?group_id=904839486052737024 [1]:Browne C B, Powley E, Whitehouse D, et al. A Survey of Monte Carlo Tree Search Methods[J]. IEEE Transactions on Computational Intelligence & Ai in Games, 2012, 4:1(1):1-43. [2]:P. Auer, N. Cesa-Bianchi, and P. Fischer, 「Finite-time Analysis of the Multiarmed Bandit Problem,」 Mach. Learn., vol. 47, no. 2, pp. 235–256, 2002. https://blog.csdn.net/windowsyun/article/details/88770799 https://blog.csdn.net/weixin_39878297/article/details/85235694
咱們知道,下棋其實就是一個馬爾科夫決策過程(MDP),根據當前棋面狀態,肯定下一步動做。問題在於,該下哪步才能保證後續贏棋的機率比較大呢?
對於這個問題,人類世界演化出了不少圍棋流派,例如,
匪竄流:表明人物古力
殭屍流:表明人物石頭 成名做豐田杯大戰常昊
韓國正統棒子流:最先應追溯到90年代初徐奉洙,曾有人詢問加藤正夫你認爲誰是最好攻殺的棋手答曰:徐奉洙,來人問爲啥不是曹薰鉉和劉昌赫,答曰徐的棋最純粹,沒有一絲雜質,可表明最典型的棒子棋
追殺流:崔哲翰 崔的先行者應該追溯到劉昌赫,不過崔有過之而無不及,追殺流特色是序盤高舉大棒滿盤追殺,遇到心浮氣躁者硬碰硬必中其下懷,典型就是李麻成了給追殺流暨大旗第一人
面面流:表明人物 常昊 面面流顧名思義就是面喜愛摳摳搜搜,小來小去,典型上海人做風
拱豬流:表明人物 羅洗河 風格挖地三尺,三星杯羅洗河一路神拱,頭不擡眼不正專門走下三路,不過這一招也十分奏效另韓國人很不適應,不過應該說這種流派也是中國正統流派
宇宙流:鼻阻武宮正樹,繼承者 木木,李哲。這也是老流派了,乃武宮正樹親自爲其命名,典型表明小林光一趙治勳,就是在棋盤平行的兩條邊上爬,很形象。
這些流派充滿了領域先驗主義的味道,徹底是個別的領域專家經過本身長期的實戰實踐中經過概括總結獲得的一種指導性方法論。
如今轉換視角,咱們嘗試用現代計算機思惟來解決下圍棋問題,最容易想到的就是枚舉以後的每一種下法,而後計算每步贏棋的機率,選擇機率最高的就行了:
可是,對於圍棋而言,狀態空間實在是太大了,沒有辦法徹底枚舉。
這個時候就須要蒙特卡洛樹搜索進行啓發式地搜索,這裏所謂的啓發式搜索,就是一種小範圍嘗試性探索的優化思路,隨機朝一個方向前進,若是效果好就繼續,若是效果很差就退回來。
用通俗的話說就是,「沒病走兩步」
Relevant Link:
https://link.zhihu.com/?target=https%3A//mp.weixin.qq.com/s%3F__biz%3DMzA5MDE2MjQ0OQ%3D%3D%26mid%3D2652786766%26idx%3D1%26sn%3Dbf6f3189e4a16b9f71f985392c9dc70b%26chksm%3D8be52430bc92ad2644838a9728d808d000286fb9ca7ced056392f1210300286f63bd991bde84%23rd https://zhuanlan.zhihu.com/p/25345778