問題描述html
目標:最大化指望累加獎勵(expected cumulative reward)app
- Episodic Task: 在某個時間步結束,$S_0,A_0,R_1,S_1,A_1,R_2,S_2,\cdots,A_{T-1},R_T,S_T(i.e., terminal\text{ }states)$
- Continuing Task: 沒有明確的結束信號,$S_0,A_0,R_1,S_1,A_1,R_2,S_2,\cdots$
- The discounted return(cumulative reward) at time step t: $G_{t}=R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\ldots,\text{ }0\leq\gamma\leq{1}$
馬爾可夫決策過程MDP學習
- $\mathcal{S}$: the set of all (nonterminal) states; $\mathcal{S}^+$: S+{terminal states} (only for episodic task)
- $\mathcal{A}$: the set of possible actions; $\mathcal{A}(s)$: the set of possible actions available in state $s\in\mathcal{S}$
- $\mathcal{R}$: the set of rewards; $\gamma$: discount rate, $0\leq\gamma\leq{1}$
- the one-step dynamics: $p\left(s^{\prime}, r | s, a\right) = \mathbb{P}\left(S_{t+1}=s^{\prime}, R_{t+1}=r | S_{t}=s, A_{t}=a\right)\text{ for each possible }s^{\prime}, r, s, \text { and } a$
問題求解ui
- Deterministic Policy $\pi$:a mapping $\mathcal{S} \rightarrow \mathcal{A}$; Stochastic Policy $\pi$: a mapping $\mathcal{S} \times \mathcal{A} \rightarrow[0,1], \text{ i.e., }\pi(a | s)=\mathbb{P}\left(A_{t}=a | S_{t}=s\right)$
- State-Value Function以及Action-Value Function
求解方法:Dynamic Programminglua
假設條件:智能體(Agent)事先知道馬爾科夫決策過程(MDP)的信息,不須要從與環境(Environment)的交互中逐漸學習spa
方法1:Policy Iterationorm
- 問題一(左圖):estimate the state-value function $v_{\pi}$ corresponding to a policy $\pi$
- 問題二(右圖):obtain the action-value function $q_{\pi}$ from the state-value function $v_{\pi}$
- 問題三(左圖):take an estimate of the state-value function $v_{\pi}$ corresponding to a policy $\pi$, returns a new policy $\pi^{\prime}\geq\pi$
- 問題四(右圖):solve an MDP in the dynamic programming setting
方法2:Truncated Policy Iterationhtm
In this approach, the evaluation step is stopped after a fixed number of sweeps through the state spaceblog
方法3:Value Iteration
In this approach, each sweep over the state space simultaneously performs policy evaluation and policy improvement