知識蒸餾,其目的是爲了讓小模型學到大模型的知識,通俗說,讓student模型的輸出接近(擬合)teacher模型的輸出。因此知識蒸餾的重點在於擬合二字,即咱們要定義一個方法去衡量student模型和teacher模型接近程度,說白了就是損失函數。git
爲何咱們須要知識蒸餾?由於大模型推理慢難以應用到工業界。小模型直接進行訓練,效果較差。github
下面介紹四個比較熱門的蒸餾文章,這四個本人均有實踐,但願能幫到你們。算法
Hinton 在論文: Distilling the Knowledge in a Neural Network 提出了知識蒸餾的方法。網上關於這方面的資料實在是太多了,我就簡單總結下吧。
損失函數:$$Loss = aL_{soft} + (1-a)L_{hard}$$
其中\(L_{soft}\)是StudentModel和TeacherModel的輸出的交叉熵,\(L_{hard}\)是StudentModel輸出和真實標籤的交叉熵。
再細說一下\(L_{soft}\)。咱們知道TeacherModel的輸出是通過Softmax處理的,指數e拉大了各個類別之間的差距,最終輸出結果特別像一個one-hot向量,這樣不利於StudentModel的學習,所以咱們但願輸出更加軟一些。所以咱們須要改一下softmax函數:架構
顯然T越大輸出越軟。這樣改完以後,對比原始softmax,梯度至關於乘了\(1/T^2\),所以\(L_{soft}\)須要再乘以\(T^2\)來與\(L_{hard}\)在一個數量級上。app
算法的總體框架圖以下:(圖片來自https://blog.csdn.net/nature553863/article/details/80568658)框架
說到對Bert的蒸餾, 首先想到的方法就是用微調好的Bert做爲TeacherModel去訓練一個StudentModel,這正是TinyBert的作法。那麼下面的問題就是選取什麼模型做爲StudentModel,這個已經有一些嘗試了,好比有人使用BiLSTM,可是更多的人仍是繼續使用了Bert,只不過這個Bert會比原始的Bert小。在TinyBert中,StudentModel使用的是減小了embedding size、hidden size和num hidden layers的小bert。dom
那麼怎麼初始化StudentModel呢?最簡單的辦法就是隨機化模型參數,可是更好的方法是用預訓練模型,所以咱們須要一個預訓練的StudentModel。TinyBert的作法是用預訓練好的Bert蒸餾出一個預訓練好的StudentModel。ide
Ok,TinyBert基本講完了,簡單總結下,TinyBert一共分爲兩步:函數
下面說一說TinyBert的損失函數。學習
公式以下:
解釋下這個公式:
再補充一句:在進行蒸餾的時候,會先進行隱層蒸餾(即m<=M),而後再執行m=M+1時的蒸餾。
總結一下,有助於你們理解:TinyBERT在蒸餾的時候,不只要讓StudentModel學到最後一層的輸出,還要學到中間幾層的輸出。換言之,StudentModel的某一隱層能夠學到TeacherModel若干隱層的輸出。感受蒸餾的粒度比較細,我以爲能夠叫作LayerBasedDistillation。
說完了TinyBert,想再和你們聊一聊DistilBert,DistilBert要比TinyBert簡單很多,我就少用些文字,DistilBert使用預訓練好的Bert做爲TeacherModel訓練了一個StudentModel,這裏的StudentModel就是層數少的Bert,注意這裏獲得的DistilBERT本質上仍是一個預訓練模型,所以用到具體下游任務上時,仍是須要用專門的數據去微調,這裏就是純粹的微調,不須要考慮再用蒸餾學習輔助。HuggingFace已經提供了若干蒸餾好的預訓練模型,你們直接拿過來當Bert用就好了。
DistillBERT的損失函數:\(L_{ce} + L_{mlm} + L_{cos}\)。
這個準確的來講不是知識蒸餾,可是它確實減少了模型體積,並且思路和TinyBERT、DistillBERT都有相似,所以就放到這裏講了。這個思路很是優雅,它經過隨機使用小模型的一層替換大模型中若干層,來完成訓練。我來舉一個例子:假設大模型是input->tfc1->tfc2->tfc3->tfc4->tfc5->tfc6->output,而後再定義一個小模型input->sfc1->sfc2->sfc3->output。再訓練過程當中仍是要訓練大模型,只是在每一步中,會隨機的將(tfc1,tfc2),(tfc3,tfc4),(tfc5,tfc6)替換爲sfc1,sfc2,sfc3,並且隨着訓練的進行,替換的機率不斷變大,所以最後就是在訓練一個小模型。
放一張圖便於你們理解
方式優雅,做者提供了源碼,強烈推薦你們用一用。
剛剛發佈的一篇新論文, 也是關於BERT蒸餾的,我簡單總結下三個創新點:
放一下圖以便你們理解:
本人實現了一個基於Pytorch的知識蒸餾框架,有興趣的朋友能夠試一試。該框架儘量抽象了多層模型的蒸餾方法,能夠實現TInyBERT、DistillBERT等算法。後續在維護過程當中發現知識蒸餾還不夠成熟,常常出現新的蒸餾算法,沒辦法制定一個統一的框架把各種算法集成進去。所以本人稍微調整該庫,將該庫分爲兩個部分:
歡迎給位上傳新的知識蒸餾算法示例代碼,示例代碼儘可能簡潔易懂,便於執行,最好是算法做者沒有提供源碼的。項目地址:
Pypi:https://pypi.org/project/KnowledgeDistillation/
Github:https://github.com/DunZhang/KnowledgeDistillation
給你們提供一個使用基於多層模型的知識蒸餾框架的範例代碼,使用12層bert蒸餾3層bert,使用TinyBERT的損失函數,代碼完整能夠直接運行,不須要外部數據:
# import packages import torch import logging import numpy as np from transformers import BertModel, BertConfig from torch.utils.data import DataLoader, RandomSampler, TensorDataset from knowledge_distillation import KnowledgeDistiller, MultiLayerBasedDistillationLoss from knowledge_distillation import MultiLayerBasedDistillationEvaluator logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') # Some global variables train_batch_size = 40 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") learning_rate = 1e-5 num_epoch = 10 # define student and teacher model # Teacher Model bert_config = BertConfig(num_hidden_layers=12, hidden_size=60, intermediate_size=60, output_hidden_states=True, output_attentions=True) teacher_model = BertModel(bert_config) # Student Model bert_config = BertConfig(num_hidden_layers=3, hidden_size=60, intermediate_size=60, output_hidden_states=True, output_attentions=True) student_model = BertModel(bert_config) ### Train data loader input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 50))) attention_mask = torch.LongTensor(np.ones((100000, 50))) token_type_ids = torch.LongTensor(np.zeros((100000, 50))) train_data = TensorDataset(input_ids, attention_mask, token_type_ids) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size) ### Train data adaptor ### It is a function that turn batch_data (from train_dataloader) to the inputs of teacher_model and student_model ### You can define your own train_data_adaptor. Remember the input must be device and batch_data. ### The output is either dict or tuple, but must be consistent with you model's input def train_data_adaptor(device, batch_data): batch_data = tuple(t.to(device) for t in batch_data) batch_data_dict = {"input_ids": batch_data[0], "attention_mask": batch_data[1], "token_type_ids": batch_data[2], } # In this case, the teacher and student use the same input return batch_data_dict, batch_data_dict ### The loss model is the key for this generation. ### We have already provided a general loss model for distilling multi bert layer ### In most cases, you can directly use this model. #### First, we should define a distill_config which indicates how to compute ths loss between teacher and student. #### distill_config is a list-object, each item indicates how to calculate loss. #### It also defines which output of which layer to calculate loss. #### It shoulde be consistent with your output_adaptor distill_config = [ # means that compute a loss by their embedding_layer's embedding {"teacher_layer_name": "embedding_layer", "teacher_layer_output_name": "embedding", "student_layer_name": "embedding_layer", "student_layer_output_name": "embedding", "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0 }, # means that compute a loss between teacher's bert_layer12's hidden_states and student's bert_layer3's hidden_states {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "hidden_states", "student_layer_name": "bert_layer3", "student_layer_output_name": "hidden_states", "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0 }, {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "attention", "student_layer_name": "bert_layer3", "student_layer_output_name": "attention", "loss": {"loss_function": "attention_mse_with_mask", "args": {}}, "weight": 1.0 }, {"teacher_layer_name": "pred_layer", "teacher_layer_output_name": "pooler_output", "student_layer_name": "pred_layer", "student_layer_output_name": "pooler_output", "loss": {"loss_function": "mse", "args": {}}, "weight": 1.0 }, ] ### teacher_output_adaptor and student_output_adaptor ### In most cases, model's output is tuple-object, However, in our package, we need the output is dict-object, ### like: { "layer_name":{"output_name":value} .... } ### Hence, the output adaptor is to turn your model's output to dict-object output ### In my case, teacher and student can use one adaptor def output_adaptor(model_output): last_hidden_state, pooler_output, hidden_states, attentions = model_output output = {"embedding_layer": {"embedding": hidden_states[0]}} for idx in range(len(attentions)): output["bert_layer" + str(idx + 1)] = {"hidden_states": hidden_states[idx + 1], "attention": attentions[idx]} output["pred_layer"] = {"pooler_output": pooler_output} return output # loss_model loss_model = MultiLayerBasedDistillationLoss(distill_config=distill_config, teacher_output_adaptor=output_adaptor, student_output_adaptor=output_adaptor) # optimizer param_optimizer = list(student_model.named_parameters()) no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=learning_rate) # evaluator # this is a basic evalator, it can output loss value and save models # You can define you own evaluator class that implements the interface IEvaluator evaluator = MultiLayerBasedDistillationEvaluator(save_dir="save_model", save_step=1000, print_loss_step=20) # Get a KnowledgeDistiller distiller = KnowledgeDistiller(teacher_model=teacher_model, student_model=student_model, train_dataloader=train_dataloader, dev_dataloader=None, train_data_adaptor=train_data_adaptor, dev_data_adaptor=None, device=device, loss_model=loss_model, optimizer=optimizer, evaluator=evaluator, num_epoch=num_epoch) # start distillate distiller.distillate()
再介紹一個知識蒸餾庫TextBrewer,該庫由哈工大實現,和本人的庫相比實現算法更多,運行更爲穩定,推薦你們使用。
Github地址:https://github.com/airaria/TextBrewer
在這裏一樣的也提供一個完整可運行的代碼,且不須要任何外部數據:
import torch import numpy as np import pickle import textbrewer from textbrewer import GeneralDistiller from textbrewer import TrainingConfig, DistillationConfig from transformers import BertConfig, BertModel from torch.utils.data import DataLoader, RandomSampler, TensorDataset device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") ## 定義模型 bert_config = BertConfig(num_hidden_layers=12, output_hidden_states=True, output_attentions=True) teacher_model = BertModel(bert_config).to(device) bert_config = BertConfig(num_hidden_layers=3, output_hidden_states=True, output_attentions=True) student_model = BertModel(bert_config).to(device) # optimizer param_optimizer = list(student_model.named_parameters()) no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=2e-5) ### data input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 64))) attention_mask = torch.LongTensor(np.ones((100000, 64))) token_type_ids = torch.LongTensor(np.zeros((100000, 64))) train_data = TensorDataset(input_ids, attention_mask, token_type_ids) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=16) # Define an adaptor for translating the model inputs and outputs # 整合成蒸餾器須要的數據格式 # key須要是固定的??? def bert_adaptor(batch, model_outputs): last_hidden_state, pooler_output, hidden_states, attentions = model_outputs hidden_states = list(hidden_states) hidden_states.append(pooler_output) output = {"inputs_mask": batch[1], "attention": attentions, "hidden": hidden_states} return output # Training configuration train_config = TrainingConfig(gradient_accumulation_steps=1, ckpt_frequency=10, ckpt_epoch_frequency=1, log_dir='logs', output_dir='saved_models', device='cuda') # Distillation configuration # Matching different layers of the student and the teacher # 重要,如何蒸餾的定義 # 不支持自定義損失函數 # 不支持CLS LOSS,可是能夠強行寫在hidden loss裏面 distill_config = DistillationConfig( intermediate_matches=[ {'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # embedding loss {'layer_T': 4, 'layer_S': 1, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # hidden loss {'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 3, 'layer_S': 0, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, # attention loss {'layer_T': 7, 'layer_S': 1, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, {'layer_T': 11, 'layer_S': 2, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # 實際上是CLS loss ] ) # Build distiller distiller = GeneralDistiller( train_config=train_config, distill_config=distill_config, model_T=teacher_model, model_S=student_model, adaptor_T=bert_adaptor, adaptor_S=bert_adaptor) # Start! # callbacker 能夠在dev上進行評估 # 注意存的是state_dict with distiller: distiller.train(optimizer=optimizer, scheduler=None, dataloader=train_dataloader, num_epochs=10, callback=None)
還有不少其餘加速BERT的方法,我就不細說了,有興趣的能夠研究下:
文章能夠轉載, 但請註明出處: