Why does policy gradiet method has high variance?

策略梯度方法 策略梯度方法中,目標函數是使得整個episode得到的reward的均值最大: maximizeθEπθ[∑t=0T−1γtrt] 由於: ∇θE[f(x)]=∇θ∫pθ(x)f(x)dx=∫pθ(x)pθ(x)∇θpθ(x)f(x)dx=∫pθ(x)∇θlogpθ(x)f(x)dx=E[f(x)∇θlogpθ(x)] 以及: ∇θlogpθ(τ)=∇log(μ(s0)∏t=0T−1
相關文章
相關標籤/搜索