DeepMind：所謂SACX學習範式

時間 2019-12-06

原文原文鏈接

機器人是否能應用於服務最終仍是那兩條腿值多少錢，而與人交互，能真正地作「服務」工做，仍是看那兩條胳膊怎麼工做。大腦的智能化仍是很是遙遠的，仍是先把感覺器和效應器作好纔是王道。
app

關於強化學習，根據Agent對策略的主動性不一樣劃分爲主動強化學習（學習策略：必須本身決定採起什麼行動）和被動強化學習（固定的策略決定其行爲，爲評價學習，即Agent如何從成功與失敗中、回報與懲罰中進行學習，學習效用函數）。async

DeepMind提出調度輔助控制（Scheduled Auxiliary Control，SACX），這是強化學習（RL）上下文中一種新型的學習範式。SAC-X可以在存在多個稀疏獎勵信號的狀況下，從頭開始（from scratch）學習複雜行爲。爲此，智能體配備了一套通用的輔助任務，它試圖經過off-policy強化學習同時從中進行學習。優化

這個長向量的形式化以及優化爲論文的亮點。
this

In this paper, we introduce a new method dubbed Scheduled Auxiliary Control (SAC-X), as a first step towards such an approach. It is based on four main principles:

1. Every state-action pair is paired with a vector of rewards, consisting of ( typically sparse ) externally provided rewards and (typically sparse) internal auxiliary rewards.

2. Each reward entry has an assigned policy, called intention in the following, which is trained to maximize its corresponding cumulative reward.

3. There is a high-level scheduler which selects and executes the individual intentions with the goal of improving performance of the agent on the external tasks.

     4. Learning is performed off-policy ( and asynchronouslyfrom policy execution ) and the experience between intentions is shared – to use information effectively. Although the approach proposed in this paper is generally applicable to a wider range of problems, we discuss our method in the light of a typical robotics manipulation applica tion with sparse rewards: stacking various objects and cleaning a table。
        由四個基本準則組成：狀態配備多個稀疏獎懲向量-一個稀疏的長向量；每一個獎懲被分配策略-稱爲意圖，經過最大化累計獎懲向量反饋；創建一個高層的選擇執行特定意圖的機制用以提升Agent的表現；學習是基於off-policy（新策略，Q值更新使用新策略），且意圖之間的經驗共享增長效率。整體方法能夠應用於通用領域，在此咱們以典型的機器人任務進行演示。
        基於Off-Play的好處：https://www.zhihu.com/question/57159315