強化學習基礎：蒙特卡羅和時序差分

時間 2019-12-08

標籤強化學習基礎時序差分简体版

原文原文鏈接

dom

$v_{\pi}$ corresponding to a policy $\pi$
- First-visit MC estimates $v_{\pi}(s)$
- Every-visit MC estimates $v_{\pi}(s)$
問題二（右圖）：estimate the action-value function $q_{\pi}$
- First-visit MC estimates $q_{\pi}(s,a)$
- Every-visit MC estimates $q_{\pi}(s,a)$

問題三（左圖）：get the optimal policy $\pi_*$
- relationship between the mean and individual return: $\bar{Q}_k=\frac{\sum_{i=1}^kG_i}{k}=\bar{Q}_{k-1}+\frac{1}{k}(G_k-\bar{Q}_{k-1})$
- $\epsilon$-greedy: Exploration vs Exploitation
  - with probability $1-\epsilon$, select the greedy action ${\pi}(s)=\arg \max _{a \in \mathcal{A}(s)} Q(s, a)$ (Exploitation)
  - with probability $\epsilon$, select an action (uniformly) at random ${\pi}(a|s)=\frac{1}{|\mathcal{A}(s)|}$ (Exploration)　　
問題四（右圖）：modify the algorithm to put more weights to the most recent returns

求解方法：Temporal Difference學習

Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.lua

問題一（左圖）：estimate the state-value function $v_{\pi}$ (the estimation of $q_{\pi}$ is similar)
問題二（右圖）：get the optimal action value function $q_*$
- On policy: the agent interact with the environment by following the same policy $\pi$ that it seeks to evaluate (or improve)
- Sarsa(0) is an on-policy method

問題三：modified algorithm to get the optimal action value function $q_*$
- Off poliy: the agent interact with the environment by following a policy $b$ $$\pi$ that it seeks to evaluate (or improve)$