強化學習實例2:MDP

紅色塊移動到黃色,黑色爲障礙物python 馬爾科夫鏈,算法 預測最好的路徑,值函數爲回報r(reward)和the discounted value of the ending statecanvas SARSA表明state, action, reward, next state和next action。it is known as an own policy Reinforcement Le
相關文章
相關標籤/搜索