官方介紹,XLA(加速線性代數)是一種針對特定領域的線性代數編譯器,可以優化 TensorFlow 計算,它能夠提升服務器和移動平臺的運行速度,並改進內存使用狀況和可移植性。XLA 框架是實驗性框架,仍處於積極開發階段。python
因而乎,我就想看看XLA對BERT模型的加速的狀況。我選了BERT的中文模型,在情感分類任務上作測試。git
import tensorflow as tf from transformers import * from band.dataset import ChnSentiCorp from band.progress import classification_convert_examples_to_features USE_XLA = False USE_AMP = False EPOCHS = 5 BATCH_SIZE = 16 EVAL_BATCH_SIZE = 16 TEST_BATCH_SIZE = 1 MAX_SEQ_LEN = 128 LEARNING_RATE = 3e-5 tf.config.optimizer.set_jit(USE_XLA) tf.config.optimizer.set_experimental_options({"auto_mixed_precision": USE_AMP}) dataset = ChnSentiCorp(save_path="/tmp/band") data, label = dataset.data, dataset.label dataset.dataset_information() train_number, eval_number, test_number = dataset.train_examples_num, dataset.eval_examples_num, dataset.test_examples_num tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') train_dataset = classification_convert_examples_to_features(data['train'], tokenizer, max_length=MAX_SEQ_LEN, label_list=label, output_mode="classification") valid_dataset = classification_convert_examples_to_features(data['validation'], tokenizer, max_length=MAX_SEQ_LEN, label_list=label, output_mode="classification") train_dataset = train_dataset.shuffle(100).batch(BATCH_SIZE, drop_remainder=True).repeat(EPOCHS) train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE) valid_dataset = valid_dataset.batch(EVAL_BATCH_SIZE) valid_dataset = valid_dataset.prefetch(tf.data.experimental.AUTOTUNE) config = BertConfig.from_pretrained("bert-base-chinese", num_labels=dataset.num_labels) model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese', config=config) optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE, epsilon=1e-08) if USE_AMP: optimizer = tf.keras.mixed_precision.experimental.LossScaleOptimizer(optimizer, 'dynamic') loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy') model.compile(optimizer=optimizer, loss=loss, metrics=[metric]) history = model.fit(train_dataset, epochs=EPOCHS, steps_per_epoch=train_number // BATCH_SIZE, validation_data=valid_dataset, validation_steps=eval_number // EVAL_BATCH_SIZE)
其中band是我本身寫的一個BERT的庫,還在開發中。用不用XLA只須要設置USE_XLA
便可。跑的實驗結果以下:bash
不使用XLA服務器
Epoch 1/5 600/600 [==============================] - 355s 592ms/step - loss: 0.2685 - accuracy: 0.8976 - val_loss: 0.2427 - val_accuracy: 0.9142 Epoch 2/5 600/600 [==============================] - 332s 554ms/step - loss: 0.1707 - accuracy: 0.9420 - val_loss: 0.1824 - val_accuracy: 0.9258 Epoch 3/5 600/600 [==============================] - 332s 554ms/step - loss: 0.0934 - accuracy: 0.9686 - val_loss: 0.1995 - val_accuracy: 0.9383 Epoch 4/5 600/600 [==============================] - 333s 554ms/step - loss: 0.0768 - accuracy: 0.9747 - val_loss: 0.2288 - val_accuracy: 0.9442 Epoch 5/5 600/600 [==============================] - 333s 555ms/step - loss: 0.0564 - accuracy: 0.9807 - val_loss: 0.2247 - val_accuracy: 0.9408
使用XLA框架
Epoch 1/5 600/600 [==============================] - 573s 955ms/step - loss: 0.2824 - accuracy: 0.8940 - val_loss: 0.2162 - val_accuracy: 0.9192 Epoch 2/5 600/600 [==============================] - 309s 515ms/step - loss: 0.1577 - accuracy: 0.9444 - val_loss: 0.2361 - val_accuracy: 0.9233 Epoch 3/5 600/600 [==============================] - 309s 514ms/step - loss: 0.0993 - accuracy: 0.9678 - val_loss: 0.2270 - val_accuracy: 0.9333 Epoch 4/5 600/600 [==============================] - 307s 512ms/step - loss: 0.0702 - accuracy: 0.9780 - val_loss: 0.2492 - val_accuracy: 0.9300 Epoch 5/5 600/600 [==============================] - 310s 516ms/step - loss: 0.0572 - accuracy: 0.9815 - val_loss: 0.2675 - val_accuracy: 0.9300
具體運行表格以下:
| 比較 | Epoch1 | Epoch2~5 |
| :----------: | :------: | :---------------------------: |
| 不使用XLA| 355s | 332s |
| 使用XLA| 573s | 309s |測試
解釋就是GPU在第一個Epoch須要完成GPU的一些初始化操做(能夠理解爲熱身),第二個Epoch後才能視爲正常運行。fetch
XLA是編譯器,因此第一個Epoch在編譯代碼,會比較慢。優化
我沒有設置運行seed,XLA只是編譯,應該不會對代碼運行結果產生什麼影響。code
因此總結一下,XLA第一個Epoch在編譯代碼,因此運行時間額外的長,第一個Epoch後,表現穩定並且快於普通運行, 本實驗來看大概快十分之一。orm
官方說對下降資源佔用也有幫助,這個不太比如較,暫且認爲是對的吧。