【動手學Paddle2.0系列】淺談混合精度訓練html
你們好,本次教程爲你們介紹一下如何在Paddle2.0中開啓混合精度訓練,並對模型進行測試。python
下載安裝命令 ## CPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle ## GPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu
1 混合精度訓練
混合精度訓練最初是由百度和英偉達聯和提出的,在論文Mixed Precision Training中,對混合精度訓練進行了詳細的闡述,並對其實現進行了講解,有興趣的同窗能夠看看這篇論文。網絡
1.1 半精度與單精度
半精度(也被稱爲FP16)對比高精度的FP32與FP64下降了神經網絡的顯存佔用,使得咱們能夠訓練部署更大的網絡,而且FP16在數據轉換時比FP32或者FP64更節省時間。app
單精度(也被稱爲32-bit)是通用的浮點數格式(在C擴展語言中表示爲float),64-bit被稱爲雙精度(double)。dom
如圖所示,咱們可以很直觀的看到半精度的存儲空間是單精度存儲空間的一半。ide
1.2 爲何使用混合精度訓練
混合精度訓練,指代的是單精度 float和半精度 float16 混合訓練。函數
float16和float相比恰裏,總結下來就是兩個緣由:內存佔用更少,計算更快。測試
內存佔用更少:這個是顯然可見的,通用的模型 fp16 佔用的內存只需原來的一半。memory-bandwidth 減半所帶來的好處:優化
模型佔用的內存更小,訓練的時候能夠用更大的batchsize。url
模型訓練時,通訊量(特別是多卡,或者多機多卡)大幅減小,大幅減小等待時間,加快數據的流通。
計算更快:目前的很多GPU都有針對 fp16 的計算進行優化。論文指出:在近期的GPU中,半精度的計算吞吐量能夠是單精度的 2-8 倍;
損失控制原理:
2 實驗設計
本次實驗主要從兩個方面進行測試,分別在精度和速度兩個部分進行對比。實驗中採用ResNet-18做爲測試對象,使用的數據集爲美食數據集,共五種類別。
# 解壓數據集 !cd data/data64280/ && unzip -q trainset.zip
2.1 數據集預處理
import pandas as pd import numpy as np import os all_file_dir = 'data/data64280/trainset' img_list = [] label_list = [] label_id = 0 class_list = [c for c in os.listdir(all_file_dir) if os.path.isdir(os.path.join(all_file_dir, c))] for class_dir in class_list: image_path_pre = os.path.join(all_file_dir, class_dir) for img in os.listdir(image_path_pre): img_list.append(os.path.join(image_path_pre, img)) label_list.append(label_id) label_id += 1 img_df = pd.DataFrame(img_list) label_df = pd.DataFrame(label_list) img_df.columns = ['images'] label_df.columns = ['label'] df = pd.concat([img_df, label_df], axis=1) df = df.reindex(np.random.permutation(df.index)) df.to_csv('food_data.csv', index=0)
import pandas as pd # 讀取數據 df = pd.read_csv('food_data.csv') image_path_list = df['images'].values label_list = df['label'].values # 劃分訓練集和校驗集 all_size = len(image_path_list) train_size = int(all_size * 0.8) train_image_path_list = image_path_list[:train_size] train_label_list = label_list[:train_size] val_image_path_list = image_path_list[train_size:] val_label_list = label_list[train_size:]
2.2 自定義數據集
import numpy as np from PIL import Image from paddle.io import Dataset import paddle.vision.transforms as T import paddle as pd class MyDataset(Dataset): """ 步驟一:繼承paddle.io.Dataset類 """ def __init__(self, image, label, transform=None): """ 步驟二:實現構造函數,定義數據讀取方式,劃分訓練和測試數據集 """ super(MyDataset, self).__init__() imgs = image labels = label self.labels = labels self.imgs = imgs self.transform = transform # self.loader = loader def __getitem__(self, index): # 這個方法是必需要有的,用於按照索引讀取每一個元素的具體內容 fn = self.imgs label = self.labels # fn是圖片path #fn和label分別得到imgs[index]也便是剛纔每行中word[0]和word[1]的信息 for im,la in zip(fn, label): img = Image.open(im) img = img.convert("RGB") img = np.array(img).astype('float32') / 255.0 label = np.array([la]).astype(dtype='int64') # 按照路徑讀取圖片 if self.transform is not None: img = self.transform(img) # 數據標籤轉換爲Tensor return img, label # return回哪些內容,那麼咱們在訓練時循環讀取每一個batch時,就能得到哪些內容 # ********************************** #使用__len__()初始化一些須要傳入的參數及數據集的調用********************** def __len__(self): # 這個函數也必需要寫,它返回的是數據集的長度,也就是多少張圖片,要和loader的長度做區分 return len(self.imgs)
2.3 訓練準備
import paddle from paddle.metric import Accuracy import warnings warnings.filterwarnings("ignore") import paddle.vision.transforms as T transform = T.Compose([ T.Resize([224, 224]), T.ToTensor(), # T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]), # T.Transpose(), ]) train_dataset = MyDataset(image=train_image_path_list, label=train_label_list ,transform=transform) train_loader = paddle.io.DataLoader(train_dataset, places=paddle.CPUPlace(), batch_size=16, shuffle=True)
from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np import paddle from paddle import ParamAttr import paddle.nn as nn import paddle.nn.functional as F from paddle.nn import Conv2D, BatchNorm, Linear, Dropout from paddle.nn import AdaptiveAvgPool2D, MaxPool2D, AvgPool2D from paddle.nn.initializer import Uniform import math __all__ = ["ResNet18", "ResNet34", "ResNet50", "ResNet101", "ResNet152"] class ConvBNLayer(nn.Layer): def __init__(self, num_channels, num_filters, filter_size, stride=1, groups=1, act=None, name=None, data_format="NCHW"): super(ConvBNLayer, self).__init__() self._conv = Conv2D( in_channels=num_channels, out_channels=num_filters, kernel_size=filter_size, stride=stride, padding=(filter_size - 1) // 2, groups=groups, weight_attr=ParamAttr(name=name + "_weights"), bias_attr=False, data_format=data_format) if name == "conv1": bn_name = "bn_" + name else: bn_name = "bn" + name[3:] self._batch_norm = BatchNorm( num_filters, act=act, param_attr=ParamAttr(name=bn_name + "_scale"), bias_attr=ParamAttr(bn_name + "_offset"), moving_mean_name=bn_name + "_mean", moving_variance_name=bn_name + "_variance", data_layout=data_format) def forward(self, inputs): y = self._conv(inputs) y = self._batch_norm(y) return y class BottleneckBlock(nn.Layer): def __init__(self, num_channels, num_filters, stride, shortcut=True, name=None, data_format="NCHW"): super(BottleneckBlock, self).__init__() self.conv0 = ConvBNLayer( num_channels=num_channels, num_filters=num_filters, filter_size=1, act="relu", name=name + "_branch2a", data_format=data_format) self.conv1 = ConvBNLayer( num_channels=num_filters, num_filters=num_filters, filter_size=3, stride=stride, act="relu", name=name + "_branch2b", data_format=data_format) self.conv2 = ConvBNLayer( num_channels=num_filters, num_filters=num_filters * 4, filter_size=1, act=None, name=name + "_branch2c", data_format=data_format) if not shortcut: self.short = ConvBNLayer( num_channels=num_channels, num_filters=num_filters * 4, filter_size=1, stride=stride, name=name + "_branch1", data_format=data_format) self.shortcut = shortcut self._num_channels_out = num_filters * 4 def forward(self, inputs): y = self.conv0(inputs) conv1 = self.conv1(y) conv2 = self.conv2(conv1) if self.shortcut: short = inputs else: short = self.short(inputs) y = paddle.add(x=short, y=conv2) y = F.relu(y) return y class BasicBlock(nn.Layer): def __init__(self, num_channels, num_filters, stride, shortcut=True, name=None, data_format="NCHW"): super(BasicBlock, self).__init__() self.stride = stride self.conv0 = ConvBNLayer( num_channels=num_channels, num_filters=num_filters, filter_size=3, stride=stride, act="relu", name=name + "_branch2a", data_format=data_format) self.conv1 = ConvBNLayer( num_channels=num_filters, num_filters=num_filters, filter_size=3, act=None, name=name + "_branch2b", data_format=data_format) if not shortcut: self.short = ConvBNLayer( num_channels=num_channels, num_filters=num_filters, filter_size=1, stride=stride, name=name + "_branch1", data_format=data_format) self.shortcut = shortcut def forward(self, inputs): y = self.conv0(inputs) conv1 = self.conv1(y) if self.shortcut: short = inputs else: short = self.short(inputs) y = paddle.add(x=short, y=conv1) y = F.relu(y) return y class ResNet(nn.Layer): def __init__(self, layers=50, class_dim=1000, input_image_channel=3, data_format="NCHW"): super(ResNet, self).__init__() self.layers = layers self.data_format = data_format self.input_image_channel = input_image_channel supported_layers = [18, 34, 50, 101, 152] assert layers in supported_layers, \ "supported layers are {} but input layer is {}".format( supported_layers, layers) if layers == 18: depth = [2, 2, 2, 2] elif layers == 34 or layers == 50: depth = [3, 4, 6, 3] elif layers == 101: depth = [3, 4, 23, 3] elif layers == 152: depth = [3, 8, 36, 3] num_channels = [64, 256, 512, 1024] if layers >= 50 else [64, 64, 128, 256] num_filters = [64, 128, 256, 512] self.conv = ConvBNLayer( num_channels=self.input_image_channel, num_filters=64, filter_size=7, stride=2, act="relu", name="conv1", data_format=self.data_format) self.pool2d_max = MaxPool2D( kernel_size=3, stride=2, padding=1, data_format=self.data_format) self.block_list = [] if layers >= 50: for block in range(len(depth)): shortcut = False for i in range(depth[block]): if layers in [101, 152] and block == 2: if i == 0: conv_name = "res" + str(block + 2) + "a" else: conv_name = "res" + str(block + 2) + "b" + str(i) else: conv_name = "res" + str(block + 2) + chr(97 + i) bottleneck_block = self.add_sublayer( conv_name, BottleneckBlock( num_channels=num_channels[block] if i == 0 else num_filters[block] * 4, num_filters=num_filters[block], stride=2 if i == 0 and block != 0 else 1, shortcut=shortcut, name=conv_name, data_format=self.data_format)) self.block_list.append(bottleneck_block) shortcut = True else: for block in range(len(depth)): shortcut = False for i in range(depth[block]): conv_name = "res" + str(block + 2) + chr(97 + i) basic_block = self.add_sublayer( conv_name, BasicBlock( num_channels=num_channels[block] if i == 0 else num_filters[block], num_filters=num_filters[block], stride=2 if i == 0 and block != 0 else 1, shortcut=shortcut, name=conv_name, data_format=self.data_format)) self.block_list.append(basic_block) shortcut = True self.pool2d_avg = AdaptiveAvgPool2D(1, data_format=self.data_format) self.pool2d_avg_channels = num_channels[-1] * 2 stdv = 1.0 / math.sqrt(self.pool2d_avg_channels * 1.0) self.out = Linear( self.pool2d_avg_channels, class_dim, weight_attr=ParamAttr( initializer=Uniform(-stdv, stdv), name="fc_0.w_0"), bias_attr=ParamAttr(name="fc_0.b_0")) def forward(self, inputs): y = self.conv(inputs) y = self.pool2d_max(y) for block in self.block_list: y = block(y) y = self.pool2d_avg(y) y = paddle.reshape(y, shape=[-1, self.pool2d_avg_channels]) y = self.out(y) return y def ResNet18(**args): model = ResNet(layers=18, **args) return model
2.4 訓練過程定義
import paddle import numpy import paddle.nn.functional as F import time def train(model): model.train() epochs = 5 optim = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()) # 用Adam做爲優化函數 for epoch in range(epochs): for batch_id, data in enumerate(train_loader()): x_data = data[0] y_data = data[1] # print(y_data) predicts = model(x_data) loss = F.cross_entropy(predicts, y_data) # 計算損失 acc = paddle.metric.accuracy(predicts, y_data, k=2) loss.backward() if batch_id % 10 == 0: print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), acc.numpy())) optim.step() optim.clear_grad()
import paddle import numpy import paddle.nn.functional as F import time def train_amp(model): model.train() epochs = 5 optim = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()) # 用Adam做爲優化函數 for epoch in range(epochs): for batch_id, data in enumerate(train_loader()): x_data = data[0].astype('float16') y_data = data[1] scaler = paddle.amp.GradScaler(init_loss_scaling=1024) with paddle.amp.auto_cast(): predicts = model(x_data) loss = F.cross_entropy(predicts, y_data) scaled = scaler.scale(loss) # scale the loss scaled.backward() # do backward acc = paddle.metric.accuracy(predicts, y_data, k=2) if batch_id % 10 == 0: print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), acc.numpy())) optim.step() optim.clear_grad()
2.5 開啓訓練
此部分,分別對兩種訓練方式進行對比,主要關注模型的訓練速度
model = ResNet18(class_dim=2)
strat = time.time() train(model) end = time.time() print('no_amp:', end-strat)
epoch: 0, batch_id: 0, loss is: [0.21116894], acc is: [1.] epoch: 1, batch_id: 0, loss is: [0.00010776], acc is: [1.] epoch: 2, batch_id: 0, loss is: [2.5868081e-05], acc is: [1.] epoch: 3, batch_id: 0, loss is: [1.442422e-05], acc is: [1.] epoch: 4, batch_id: 0, loss is: [1.1086402e-05], acc is: [1.] no_amp: 740.6813971996307
strat1 = time.time() train_amp(model) end1 = time.time() 'no_amp:', end-strat)
epoch: 0, batch_id: 0, loss is: [0.21116894], acc is: [1.] epoch: 1, batch_id: 0, loss is: [0.00010776], acc is: [1.] epoch: 2, batch_id: 0, loss is: [2.5868081e-05], acc is: [1.] epoch: 3, batch_id: 0, loss is: [1.442422e-05], acc is: [1.] epoch: 4, batch_id: 0, loss is: [1.1086402e-05], acc is: [1.] no_amp: 740.6813971996307
strat1 = time.time() train_amp(model) end1 = time.time() print('with amp:', end1-strat1)
epoch: 0, batch_id: 0, loss is: [0.512834], acc is: [1.] epoch: 1, batch_id: 0, loss is: [0.00025519], acc is: [1.] epoch: 2, batch_id: 0, loss is: [5.9364465e-05], acc is: [1.] epoch: 3, batch_id: 0, loss is: [3.2305197e-05], acc is: [1.] epoch: 4, batch_id: 0, loss is: [2.4556812e-05], acc is: [1.] with amp: 740.9603228569031
總結
對於本次實驗,因爲迭代輪數較少,只迭代了5次,故時間上的優點沒有體現出來,你們有興趣的能夠增長迭代次數,或者換更深的網絡進行測試。
從訓練的結果來看,使用混合精度訓練,其loss值是高於未使用混合精度訓練模型的。
對於混合精度訓練,介紹得還不夠詳細,你們有興趣的能夠詳細的閱讀論文,而且之後我對這些有了更深的認識也會和你們分享。相關資料
下載安裝命令 ## CPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle ## GPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu
本文同步分享在 博客「Mowglee」(CSDN)。
若有侵權,請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」,歡迎正在閱讀的你也加入,一塊兒分享。