1804.03235-Large scale distributed neural network training through online distillation.md

時間 2019-12-10

標籤 1804.03235 large scale distributed neural network training online distillation.md distillation 欄目 CSS 简体版

原文原文鏈接

現有分佈式模型訓練的模式

分佈式SGD
- 並行SGD：大規模訓練中，一次的最長時間取決於最慢的機器
- 異步SGD：不一樣步的數據，有可能致使權重更新向着未知方向
並行多模型：多個集羣訓練不一樣的模型，再組合最終模型，可是會消耗inference運行時
蒸餾：流程複雜
- student訓練數據集的選擇
  - unlabeled的數據
  - 原始數據
  - 留出來的數據

協同蒸餾

using the same architecture for all the models;
using the same dataset to train all the models; and
using the distillation loss during training before any model has fully converged.

特色
- 就算thacher和student是徹底相同的模型設置，只要其內容足夠不一樣，也是可以得到有效的提高的
- 便是模型未收斂，收益也是有的
- 丟掉teacher和student的區分，互相訓練，也是有好處的
- 不是同步的模型也是能夠的。算法

算法簡單易懂，並且步驟看上去不是很複雜。app

使用out of state模型權重的解釋：框架

every change in weights leads to a change in gradients, but as training progresses towards convergence, weight updates should substantially change only the predictions on a small subset of the training data;
weights (and gradients) are not statistically identifiable as different copies of the weights might have arbitrary scaling differences, permuted hidden units, or otherwise rotated or transformed hidden layer feature space so that averaging gradients does not make sense unless models are extremely similar;
sufficiently out-of-sync copies of the weights will have completely arbitrary differences that change the meaning of individual directions in feature space that are not distinguishable by measuring the loss on the training set;
in contrast, output units have a clear and consistent meaning enforced by the loss function and the training data.

因此這裏彷佛是說，隨機性的好處？less

一種指導性的實用框架設計：異步

Each worker trains an independent version of the model on a locally available subset of the training data.
Occasionally, workers checkpoint their parameters.
Once this happens, other workers can load the freshest available checkpoints into memory and perform codistillation.
再加上，能夠在小一些的集羣上使用分佈式SGD。

另外論文中提到，這種方式，比起每次直接發送梯度和權重，只須要偶爾載入checkpoint，並且各個模型集羣在運算上是徹底相互獨立的。這個卻是確實能減小一些問題。
可是，若是某個模型垮掉了，徹底沒收斂呢？分佈式

另外，沒看出來這種框架哪裏簡單了，管理模型和checkpoint不是一個簡單的事情。ide

實驗結論

20TB的數據，有錢任性性能

論文中提到，並非機器越多，最終模型效果越好，彷佛32-128是比較合適的，更多了，模型收斂速度和性能不會更好，有時反而會有降低。
ui

論文中的實驗結果2a，最好的仍是雙模型並行，其次是協同蒸餾，最差的是unigram的smooth0.9，label smooth 0.99跟直接訓練表現差很少，畢竟只是一個隨機噪聲。
另外，經過對比相同數據的協同蒸餾2b，和隨機數據的協同整理，實驗發現，隨機數據實際上讓模型有更好的表現
3在imagenet上的實驗，出現了跟2a差很少的結果。
4中雖然不用非得用最新的模型，可是，協同蒸餾，使用過久遠的checkpoint仍是會顯著下降訓練效率的。this

欠擬合的模型是有用的，可是過擬合的模型在蒸餾中可能不太有價值。
協同蒸餾比雙步蒸餾能更快的收斂，並且更有效率。

3.5中介紹的，也是不少時候面臨的問題，由於初始化，訓練過程的參數不同等問題，可能致使兩次訓練出來的模型的輸出有很大區別。例如分類模型，可能上次訓練的在某些分類上準確，而此次訓練的，在這些分類上就不許確了。模型平均或者蒸餾法能有效避免這個問題。