知識蒸餾基本知識及其實現庫介紹

時間 2020-10-07

標籤 git github 算法架構 app 框架 dom ide 函數學習欄目 Git 简体版

原文原文鏈接

1 前言

知識蒸餾，其目的是爲了讓小模型學到大模型的知識，通俗說，讓student模型的輸出接近(擬合)teacher模型的輸出。因此知識蒸餾的重點在於擬合二字，即咱們要定義一個方法去衡量student模型和teacher模型接近程度，說白了就是損失函數。git

爲何咱們須要知識蒸餾？由於大模型推理慢難以應用到工業界。小模型直接進行訓練，效果較差。github

下面介紹四個比較熱門的蒸餾文章，這四個本人均有實踐，但願能幫到你們。算法

2 知識蒸餾的開山之做

Hinton 在論文: Distilling the Knowledge in a Neural Network 提出了知識蒸餾的方法。網上關於這方面的資料實在是太多了，我就簡單總結下吧。
損失函數：$$Loss = aL_{soft} + (1-a)L_{hard}$$
其中$L_{soft}$是StudentModel和TeacherModel的輸出的交叉熵，$L_{hard}$是StudentModel輸出和真實標籤的交叉熵。
再細說一下$L_{soft}$。咱們知道TeacherModel的輸出是通過Softmax處理的，指數e拉大了各個類別之間的差距，最終輸出結果特別像一個one-hot向量，這樣不利於StudentModel的學習，所以咱們但願輸出更加軟一些。所以咱們須要改一下softmax函數：架構

\[loss= \frac{exp(z_i/T)}{\sum^{}_jexp(z_j/T)} \]

顯然T越大輸出越軟。這樣改完以後，對比原始softmax，梯度至關於乘了$1/T^2$，所以$L_{soft}$須要再乘以$T^2$來與$L_{hard}$在一個數量級上。app

算法的總體框架圖以下：(圖片來自https://blog.csdn.net/nature553863/article/details/80568658)框架

3 TinyBert

3.1 基本思路介紹

說到對Bert的蒸餾, 首先想到的方法就是用微調好的Bert做爲TeacherModel去訓練一個StudentModel，這正是TinyBert的作法。那麼下面的問題就是選取什麼模型做爲StudentModel，這個已經有一些嘗試了，好比有人使用BiLSTM，可是更多的人仍是繼續使用了Bert，只不過這個Bert會比原始的Bert小。在TinyBert中，StudentModel使用的是減小了embedding size、hidden size和num hidden layers的小bert。dom

那麼怎麼初始化StudentModel呢？最簡單的辦法就是隨機化模型參數，可是更好的方法是用預訓練模型，所以咱們須要一個預訓練的StudentModel。TinyBert的作法是用預訓練好的Bert蒸餾出一個預訓練好的StudentModel。ide

Ok，TinyBert基本講完了，簡單總結下，TinyBert一共分爲兩步：函數

用pretrained bert蒸餾一個pretrained TinyBert
用fine-tuned bert蒸餾一個fine-tuned TinyBert( 它的初始化參數就是第一步裏pretrained TinyBert)

3.2 損失函數

下面說一說TinyBert的損失函數。學習

公式以下：

解釋下這個公式：

$m$：整數，0到StudentModel層數之間
$S_m$：StudentModel第m層的輸出
$g(m)$:映射函數，實際意義是讓StudentModel的第m層學習TeacherModel第g(m)層的輸出
$T_{g(m)}$:TeacherModel的第g(m)層的輸出
$M$：StudentModel隱層數量，那麼StudentModel第M+1層就是預測層的輸出了(logits)
$L_{embd}(S_0,T_0)$：word embedding層的損失函數，用的是MSE
$L_{hidden}和L_{attn}$：hidden層和attention層的損失函數，都是MSE
$L_{pred}$：預測層損失函數，用的交叉熵，其餘文獻也有用KL-Distance的，其實反向傳播的時候都同樣。

再補充一句：在進行蒸餾的時候，會先進行隱層蒸餾(即m<=M)，而後再執行m=M+1時的蒸餾。
總結一下，有助於你們理解：TinyBERT在蒸餾的時候，不只要讓StudentModel學到最後一層的輸出，還要學到中間幾層的輸出。換言之，StudentModel的某一隱層能夠學到TeacherModel若干隱層的輸出。感受蒸餾的粒度比較細，我以爲能夠叫作LayerBasedDistillation。

3.3 實戰經驗

在硬件和數據有限的條件下，咱們很難作預訓練模型的蒸餾，可是能夠借鑑TinyBERT的思路，直接作TaskSpecific的蒸餾，至於如何初始化模型，我有兩個建議：要不直接拿原始Teacher模型初始化，要不找一個別人預訓練好的小模型進行初始化。我直接用的RBT3模型初始化，效果很好。
蒸餾完StudentModel，必定要測StudentModel的泛化能力。
靈活一些，蒸餾學習目前沒有一個統一的方法，有不少地方能夠本身改一改試一試。

4 DistilBert

4.1 基本思路

說完了TinyBert，想再和你們聊一聊DistilBert，DistilBert要比TinyBert簡單很多，我就少用些文字，DistilBert使用預訓練好的Bert做爲TeacherModel訓練了一個StudentModel，這裏的StudentModel就是層數少的Bert，注意這裏獲得的DistilBERT本質上仍是一個預訓練模型，所以用到具體下游任務上時，仍是須要用專門的數據去微調，這裏就是純粹的微調，不須要考慮再用蒸餾學習輔助。HuggingFace已經提供了若干蒸餾好的預訓練模型，你們直接拿過來當Bert用就好了。

4.2 損失函數

DistillBERT的損失函數：$L_{ce} + L_{mlm} + L_{cos}$。

$L_{ce}$，StudentModel和TeacherModel logits的交叉熵
$L_{mlm}$，StudentModel中遮擋語言模型的損失函數
$L_{cos}$，StudentModel和TeacherModel hidden states的餘弦損失函數，注意在TinyBERT裏用的是MSE，並且還考慮了attention的MSE。

5 BERT-of-Theseus

這個準確的來講不是知識蒸餾，可是它確實減少了模型體積，並且思路和TinyBERT、DistillBERT都有相似，所以就放到這裏講了。這個思路很是優雅，它經過隨機使用小模型的一層替換大模型中若干層，來完成訓練。我來舉一個例子：假設大模型是input->tfc1->tfc2->tfc3->tfc4->tfc5->tfc6->output，而後再定義一個小模型input->sfc1->sfc2->sfc3->output。再訓練過程當中仍是要訓練大模型，只是在每一步中，會隨機的將(tfc1,tfc2),(tfc3,tfc4),(tfc5,tfc6)替換爲sfc1，sfc2，sfc3，並且隨着訓練的進行，替換的機率不斷變大，所以最後就是在訓練一個小模型。
放一張圖便於你們理解

方式優雅，做者提供了源碼，強烈推薦你們用一用。

6 MiniLM

剛剛發佈的一篇新論文，也是關於BERT蒸餾的，我簡單總結下三個創新點：

先用TeacherModel蒸餾一箇中等模型，再用中等模型蒸餾一個較小的StudentModel。只有在StudentModel很小的時候纔會這麼作。
只對最後一個隱層作蒸餾，做者認爲這樣可讓StudentModel有更大的自由空間，並且這樣對StudentModel架構的要求就變得寬鬆了
對於最後一個隱層主要是對attention權重作學習，具體能夠去看論文

放一下圖以便你們理解：

7 知識蒸餾通用框架

7.1 KnowledgeDistillation庫

本人實現了一個基於Pytorch的知識蒸餾框架，有興趣的朋友能夠試一試。該框架儘量抽象了多層模型的蒸餾方法，能夠實現TInyBERT、DistillBERT等算法。後續在維護過程當中發現知識蒸餾還不夠成熟，常常出現新的蒸餾算法，沒辦法制定一個統一的框架把各種算法集成進去。所以本人稍微調整該庫，將該庫分爲兩個部分：

基於多層模型的知識蒸餾框架：便於新手閱讀源碼、學習入門（再也不維護）
examples：存放各種新的知識蒸餾算法範例代碼（繼續維護）

歡迎給位上傳新的知識蒸餾算法示例代碼，示例代碼儘可能簡潔易懂，便於執行，最好是算法做者沒有提供源碼的。項目地址：
Pypi：https://pypi.org/project/KnowledgeDistillation/
Github：https://github.com/DunZhang/KnowledgeDistillation

給你們提供一個使用基於多層模型的知識蒸餾框架的範例代碼，使用12層bert蒸餾3層bert，使用TinyBERT的損失函數，代碼完整能夠直接運行，不須要外部數據：

# import packages  
import torch  
import logging  
import numpy as np  
from transformers import BertModel, BertConfig  
from torch.utils.data import DataLoader, RandomSampler, TensorDataset  
  
from knowledge_distillation import KnowledgeDistiller, MultiLayerBasedDistillationLoss  
from knowledge_distillation import MultiLayerBasedDistillationEvaluator  
  
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')  
# Some global variables  
train_batch_size = 40  
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  
learning_rate = 1e-5  
num_epoch = 10  
  
# define student and teacher model  
# Teacher Model  
bert_config = BertConfig(num_hidden_layers=12, hidden_size=60, intermediate_size=60, output_hidden_states=True,  
                         output_attentions=True)  
teacher_model = BertModel(bert_config)  
# Student Model  
bert_config = BertConfig(num_hidden_layers=3, hidden_size=60, intermediate_size=60, output_hidden_states=True,  
                         output_attentions=True)  
student_model = BertModel(bert_config)  
  
### Train data loader  
input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 50)))  
attention_mask = torch.LongTensor(np.ones((100000, 50)))  
token_type_ids = torch.LongTensor(np.zeros((100000, 50)))  
train_data = TensorDataset(input_ids, attention_mask, token_type_ids)  
train_sampler = RandomSampler(train_data)  
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size)  
  
  
### Train data adaptor  
### It is a function that turn batch_data (from train_dataloader) to the inputs of teacher_model and student_model  
### You can define your own train_data_adaptor. Remember the input must be device and batch_data.  
###  The output is either dict or tuple, but must be consistent with you model's input  
def train_data_adaptor(device, batch_data):  
    batch_data = tuple(t.to(device) for t in batch_data)  
    batch_data_dict = {"input_ids": batch_data[0],  
                       "attention_mask": batch_data[1],  
                       "token_type_ids": batch_data[2], }  
    # In this case, the teacher and student use the same input  
  return batch_data_dict, batch_data_dict  
  
  
### The loss model is the key for this generation.  
### We have already provided a general loss model for distilling multi bert layer  
### In most cases, you can directly use this model.  
#### First, we should define a distill_config which indicates how to compute ths loss between teacher and student.  
#### distill_config is a list-object, each item indicates how to calculate loss.  
#### It also defines which output of which layer to calculate loss.  
#### It shoulde be consistent with your output_adaptor  
distill_config = [  
    # means that compute a loss by their embedding_layer's embedding  
  {"teacher_layer_name": "embedding_layer", "teacher_layer_output_name": "embedding",  
     "student_layer_name": "embedding_layer", "student_layer_output_name": "embedding",  
     "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0  
  },  
    # means that compute a loss between teacher's bert_layer12's hidden_states and student's bert_layer3's hidden_states  
  {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "hidden_states",  
     "student_layer_name": "bert_layer3", "student_layer_output_name": "hidden_states",  
     "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0  
  },  
    {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "attention",  
     "student_layer_name": "bert_layer3", "student_layer_output_name": "attention",  
     "loss": {"loss_function": "attention_mse_with_mask", "args": {}}, "weight": 1.0  
  },  
    {"teacher_layer_name": "pred_layer", "teacher_layer_output_name": "pooler_output",  
     "student_layer_name": "pred_layer", "student_layer_output_name": "pooler_output",  
     "loss": {"loss_function": "mse", "args": {}}, "weight": 1.0  
  },  
]  
  
  
### teacher_output_adaptor and student_output_adaptor  
### In most cases, model's output is tuple-object, However, in our package, we need the output is dict-object,  
### like: { "layer_name":{"output_name":value} .... }  
### Hence, the output adaptor is to turn your model's output to dict-object output  
### In my case, teacher and student can use one adaptor  
def output_adaptor(model_output):  
    last_hidden_state, pooler_output, hidden_states, attentions = model_output  
    output = {"embedding_layer": {"embedding": hidden_states[0]}}  
    for idx in range(len(attentions)):  
        output["bert_layer" + str(idx + 1)] = {"hidden_states": hidden_states[idx + 1],  
                                               "attention": attentions[idx]}  
    output["pred_layer"] = {"pooler_output": pooler_output}  
    return output  
  
  
# loss_model  
loss_model = MultiLayerBasedDistillationLoss(distill_config=distill_config,  
                                             teacher_output_adaptor=output_adaptor,  
                                             student_output_adaptor=output_adaptor)  
# optimizer  
param_optimizer = list(student_model.named_parameters())  
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']  
optimizer_grouped_parameters = [  
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},  
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}  
]  
optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=learning_rate)  
# evaluator  
# this is a basic evalator, it can output loss value and save models  
# You can define you own evaluator class that implements the interface IEvaluator  
  
evaluator = MultiLayerBasedDistillationEvaluator(save_dir="save_model", save_step=1000, print_loss_step=20)  
# Get a KnowledgeDistiller  
distiller = KnowledgeDistiller(teacher_model=teacher_model, student_model=student_model,  
                               train_dataloader=train_dataloader, dev_dataloader=None,  
                               train_data_adaptor=train_data_adaptor, dev_data_adaptor=None,  
                               device=device, loss_model=loss_model, optimizer=optimizer,  
                               evaluator=evaluator, num_epoch=num_epoch)  
# start distillate  
distiller.distillate()

7.2 TextBrewer庫

再介紹一個知識蒸餾庫TextBrewer，該庫由哈工大實現，和本人的庫相比實現算法更多，運行更爲穩定，推薦你們使用。
Github地址：https://github.com/airaria/TextBrewer

在這裏一樣的也提供一個完整可運行的代碼，且不須要任何外部數據：

import torch
import numpy as np
import pickle
import textbrewer
from textbrewer import GeneralDistiller
from textbrewer import TrainingConfig, DistillationConfig
from transformers import BertConfig, BertModel
from torch.utils.data import DataLoader, RandomSampler, TensorDataset

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
## 定義模型
bert_config = BertConfig(num_hidden_layers=12, output_hidden_states=True, output_attentions=True)
teacher_model = BertModel(bert_config).to(device)
bert_config = BertConfig(num_hidden_layers=3, output_hidden_states=True, output_attentions=True)
student_model = BertModel(bert_config).to(device)

# optimizer
param_optimizer = list(student_model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=2e-5)

### data
input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 64)))
attention_mask = torch.LongTensor(np.ones((100000, 64)))
token_type_ids = torch.LongTensor(np.zeros((100000, 64)))
train_data = TensorDataset(input_ids, attention_mask, token_type_ids)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=16)


# Define an adaptor for translating the model inputs and outputs
# 整合成蒸餾器須要的數據格式
# key須要是固定的？？？

def bert_adaptor(batch, model_outputs):
    last_hidden_state, pooler_output, hidden_states, attentions = model_outputs
    hidden_states = list(hidden_states)
    hidden_states.append(pooler_output)
    output = {"inputs_mask": batch[1],
              "attention": attentions,
              "hidden": hidden_states}
    return output


# Training configuration
train_config = TrainingConfig(gradient_accumulation_steps=1,
                              ckpt_frequency=10,
                              ckpt_epoch_frequency=1,
                              log_dir='logs',
                              output_dir='saved_models',
                              device='cuda')
# Distillation configuration
# Matching different layers of the student and the teacher
# 重要，如何蒸餾的定義
# 不支持自定義損失函數
# 不支持CLS LOSS，可是能夠強行寫在hidden loss裏面
distill_config = DistillationConfig(
    intermediate_matches=[
        {'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},  # embedding loss
        {'layer_T': 4, 'layer_S': 1, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},  # hidden loss
        {'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},
        {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},
        {'layer_T': 3, 'layer_S': 0, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},  # attention loss
        {'layer_T': 7, 'layer_S': 1, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},
        {'layer_T': 11, 'layer_S': 2, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},
        {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},  # 實際上是CLS loss
    ]

)

# Build distiller
distiller = GeneralDistiller(
    train_config=train_config, distill_config=distill_config,
    model_T=teacher_model, model_S=student_model,
    adaptor_T=bert_adaptor, adaptor_S=bert_adaptor)

# Start!
# callbacker 能夠在dev上進行評估
# 注意存的是state_dict
with distiller:
    distiller.train(optimizer=optimizer, scheduler=None, dataloader=train_dataloader, num_epochs=10, callback=None)

8 其它加速BERT的方法

還有不少其餘加速BERT的方法，我就不細說了，有興趣的能夠研究下：

提高硬件，目前看性價比較高的是RTX30系列顯卡
提高深度學習框架版本必然能提高訓練和推理速度。好比高版本的TensorFlow會支持mkldnn，AVX指令集。
ONNXRuntime(這個主要用在推理中)
BERT的量化
StructedDropout瞭解一下，可是這個最好用在預訓練上，那否則效果很差，ICLR2020的最新論文：reducing transformer depth on demand with structured dropout

文章能夠轉載, 但請註明出處: