Hinton 在論文: Distilling the Knowledge in a Neural Network 提出了知識蒸餾的方法。網上關於這方面的資料實在是太多了,我就簡單總結下吧。
損失函數:$$Loss = aL_{soft} + (1-a)L_{hard}$$
說到對Bert的蒸餾, 首先想到的方法就是用微調好的Bert做爲TeacherModel去訓練一個StudentModel,這正是TinyBert的作法。那麼下面的問題就是選取什麼模型做爲StudentModel,這個已經有一些嘗試了,好比有人使用BiLSTM,可是更多的人仍是繼續使用了Bert,只不過這個Bert會比原始的Bert小。在TinyBert中,StudentModel使用的是減小了embedding size、hidden size和num hidden layers的小bert。dom
DistillBERT的損失函數:\(L_{ce} + L_{mlm} + L_{cos}\)。
剛剛發佈的一篇新論文, 也是關於BERT蒸餾的,我簡單總結下三個創新點:
# import packages import torch import logging import numpy as np from transformers import BertModel, BertConfig from torch.utils.data import DataLoader, RandomSampler, TensorDataset from knowledge_distillation import KnowledgeDistiller, MultiLayerBasedDistillationLoss from knowledge_distillation import MultiLayerBasedDistillationEvaluator logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') # Some global variables train_batch_size = 40 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") learning_rate = 1e-5 num_epoch = 10 # define student and teacher model # Teacher Model bert_config = BertConfig(num_hidden_layers=12, hidden_size=60, intermediate_size=60, output_hidden_states=True, output_attentions=True) teacher_model = BertModel(bert_config) # Student Model bert_config = BertConfig(num_hidden_layers=3, hidden_size=60, intermediate_size=60, output_hidden_states=True, output_attentions=True) student_model = BertModel(bert_config) ### Train data loader input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 50))) attention_mask = torch.LongTensor(np.ones((100000, 50))) token_type_ids = torch.LongTensor(np.zeros((100000, 50))) train_data = TensorDataset(input_ids, attention_mask, token_type_ids) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size) ### Train data adaptor ### It is a function that turn batch_data (from train_dataloader) to the inputs of teacher_model and student_model ### You can define your own train_data_adaptor. Remember the input must be device and batch_data. ### The output is either dict or tuple, but must be consistent with you model's input def train_data_adaptor(device, batch_data): batch_data = tuple(t.to(device) for t in batch_data) batch_data_dict = {"input_ids": batch_data[0], "attention_mask": batch_data[1], "token_type_ids": batch_data[2], } # In this case, the teacher and student use the same input return batch_data_dict, batch_data_dict ### The loss model is the key for this generation. ### We have already provided a general loss model for distilling multi bert layer ### In most cases, you can directly use this model. #### First, we should define a distill_config which indicates how to compute ths loss between teacher and student. #### distill_config is a list-object, each item indicates how to calculate loss. #### It also defines which output of which layer to calculate loss. #### It shoulde be consistent with your output_adaptor distill_config = [ # means that compute a loss by their embedding_layer's embedding {"teacher_layer_name": "embedding_layer", "teacher_layer_output_name": "embedding", "student_layer_name": "embedding_layer", "student_layer_output_name": "embedding", "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0 }, # means that compute a loss between teacher's bert_layer12's hidden_states and student's bert_layer3's hidden_states {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "hidden_states", "student_layer_name": "bert_layer3", "student_layer_output_name": "hidden_states", "loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0 }, {"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "attention", "student_layer_name": "bert_layer3", "student_layer_output_name": "attention", "loss": {"loss_function": "attention_mse_with_mask", "args": {}}, "weight": 1.0 }, {"teacher_layer_name": "pred_layer", "teacher_layer_output_name": "pooler_output", "student_layer_name": "pred_layer", "student_layer_output_name": "pooler_output", "loss": {"loss_function": "mse", "args": {}}, "weight": 1.0 }, ] ### teacher_output_adaptor and student_output_adaptor ### In most cases, model's output is tuple-object, However, in our package, we need the output is dict-object, ### like: { "layer_name":{"output_name":value} .... } ### Hence, the output adaptor is to turn your model's output to dict-object output ### In my case, teacher and student can use one adaptor def output_adaptor(model_output): last_hidden_state, pooler_output, hidden_states, attentions = model_output output = {"embedding_layer": {"embedding": hidden_states[0]}} for idx in range(len(attentions)): output["bert_layer" + str(idx + 1)] = {"hidden_states": hidden_states[idx + 1], "attention": attentions[idx]} output["pred_layer"] = {"pooler_output": pooler_output} return output # loss_model loss_model = MultiLayerBasedDistillationLoss(distill_config=distill_config, teacher_output_adaptor=output_adaptor, student_output_adaptor=output_adaptor) # optimizer param_optimizer = list(student_model.named_parameters()) no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=learning_rate) # evaluator # this is a basic evalator, it can output loss value and save models # You can define you own evaluator class that implements the interface IEvaluator evaluator = MultiLayerBasedDistillationEvaluator(save_dir="save_model", save_step=1000, print_loss_step=20) # Get a KnowledgeDistiller distiller = KnowledgeDistiller(teacher_model=teacher_model, student_model=student_model, train_dataloader=train_dataloader, dev_dataloader=None, train_data_adaptor=train_data_adaptor, dev_data_adaptor=None, device=device, loss_model=loss_model, optimizer=optimizer, evaluator=evaluator, num_epoch=num_epoch) # start distillate distiller.distillate()
import torch import numpy as np import pickle import textbrewer from textbrewer import GeneralDistiller from textbrewer import TrainingConfig, DistillationConfig from transformers import BertConfig, BertModel from torch.utils.data import DataLoader, RandomSampler, TensorDataset device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") ## 定義模型 bert_config = BertConfig(num_hidden_layers=12, output_hidden_states=True, output_attentions=True) teacher_model = BertModel(bert_config).to(device) bert_config = BertConfig(num_hidden_layers=3, output_hidden_states=True, output_attentions=True) student_model = BertModel(bert_config).to(device) # optimizer param_optimizer = list(student_model.named_parameters()) no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} ] optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=2e-5) ### data input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 64))) attention_mask = torch.LongTensor(np.ones((100000, 64))) token_type_ids = torch.LongTensor(np.zeros((100000, 64))) train_data = TensorDataset(input_ids, attention_mask, token_type_ids) train_sampler = RandomSampler(train_data) train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=16) # Define an adaptor for translating the model inputs and outputs # 整合成蒸餾器須要的數據格式 # key須要是固定的??? def bert_adaptor(batch, model_outputs): last_hidden_state, pooler_output, hidden_states, attentions = model_outputs hidden_states = list(hidden_states) hidden_states.append(pooler_output) output = {"inputs_mask": batch[1], "attention": attentions, "hidden": hidden_states} return output # Training configuration train_config = TrainingConfig(gradient_accumulation_steps=1, ckpt_frequency=10, ckpt_epoch_frequency=1, log_dir='logs', output_dir='saved_models', device='cuda') # Distillation configuration # Matching different layers of the student and the teacher # 重要,如何蒸餾的定義 # 不支持自定義損失函數 # 不支持CLS LOSS,可是能夠強行寫在hidden loss裏面 distill_config = DistillationConfig( intermediate_matches=[ {'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # embedding loss {'layer_T': 4, 'layer_S': 1, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # hidden loss {'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, {'layer_T': 3, 'layer_S': 0, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, # attention loss {'layer_T': 7, 'layer_S': 1, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, {'layer_T': 11, 'layer_S': 2, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, {'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # 實際上是CLS loss ] ) # Build distiller distiller = GeneralDistiller( train_config=train_config, distill_config=distill_config, model_T=teacher_model, model_S=student_model, adaptor_T=bert_adaptor, adaptor_S=bert_adaptor) # Start! # callbacker 能夠在dev上進行評估 # 注意存的是state_dict with distiller: distiller.train(optimizer=optimizer, scheduler=None, dataloader=train_dataloader, num_epochs=10, callback=None)
文章能夠轉載, 但請註明出處: