PyTorch系列 | 如何加快你的模型訓練速度呢？

時間 2019-11-06

標籤 pytorch 系列如何加快模型訓練速度简体版

原文原文鏈接

原題 | Speed Up your Algorithms Part 1 — PyTorchhtml

做者 | Puneet Groverpython

譯者 | kbsc13("算法猿的成長"公衆號做者)git

原文 | towardsdatascience.com/speed-up-yo…github

聲明 | 翻譯是出於交流學習的目的，歡迎轉載，但請保留本文出於，請勿用做商業或者非法用途算法

前言

本文將主要介紹如何採用 cuda 和 pycuda 檢查、初始化 GPU 設備，並讓你的算法跑得更快。緩存

PyTorch 是 torch 的 python 版本，它是 Facebook AI 研究組開發並開源的一個深度學習框架，也是目前很是流行的框架，特別是在研究人員中，短短几年已經有追上 Tensorflow 的趨勢了。這主要是由於其簡單、動態計算圖的優勢。微信

pycuda 是一個 python 第三方庫，用於處理 Nvidia 的 CUDA 並行計算 API 。app

本文目錄以下：框架

如何檢查 cuda 是否可用？
如何獲取更多 cuda 設備的信息？
在 GPU 上存儲 Tensors 和運行模型的方法
有多個 GPU 的時候，如何選擇和使用它們
數據並行
數據並行的比較
torch.multiprocessing

本文的代碼是用 Jupyter notebook，Github 地址爲：異步

nbviewer.jupyter.org/github/Pune…

1. 如何檢查 cuda 是否可用？

檢查 cuda 是否可用的代碼很是簡單，以下所示：

import torch
torch.cuda.is_available()
# True
複製代碼

2. 如何獲取更多 cuda 設備的信息？

獲取基本的設備信息，採用 torch.cuda 便可，但若是想獲得更詳細的信息，須要採用 pycuda 。

實現的代碼以下所示：

import torch
import pycuda.driver as cuda
cuda.init()
## Get Id of default device
torch.cuda.current_device()
# 0
cuda.Device(0).name() # '0' is the id of your GPU
# Tesla K80
複製代碼

或者以下所示：

torch.cuda.get_device_name(0) # Get name device with ID '0'
# 'Tesla K80'
複製代碼

這裏寫了一個簡單的類來獲取 cuda 的信息：

# A simple class to know about your cuda devices
import pycuda.driver as cuda
import pycuda.autoinit # Necessary for using its functions
cuda.init() # Necesarry for using its functions

class aboutCudaDevices():
    def __init__(self):
        pass
    
    def num_devices(self):
        """返回 cuda 設備的數量"""
        return cuda.Device.count()
    
    def devices(self):
        """獲取全部可用的設備的名稱"""
        num = cuda.Device.count()
        print("%d device(s) found:"%num)
        for i in range(num):
            print(cuda.Device(i).name(), "(Id: %d)"%i)
            
    def mem_info(self):
        """獲取全部設備的總內存和可用內存"""
        available, total = cuda.mem_get_info()
        print("Available: %.2f GB\nTotal: %.2f GB"%(available/1e9, total/1e9))
        
    def attributes(self, device_id=0):
        """返回指定 id 的設備的屬性信息"""
        return cuda.Device(device_id).get_attributes()
    
    def __repr__(self):
        """輸出設備的數量和其id、內存信息"""
        num = cuda.Device.count()
        string = ""
        string += ("%d device(s) found:\n"%num)
        for i in range(num):
            string += ( " %d) %s (Id: %d)\n"%((i+1),cuda.Device(i).name(),i))
            string += (" Memory: %.2f GB\n"%(cuda.Device(i).total_memory()/1e9))
        return string

# You can print output just by typing its name (__repr__):
aboutCudaDevices()
# 1 device(s) found:
# 1) Tesla K80 (Id: 0)
# Memory: 12.00 GB
複製代碼

若是想知道當前內存的使用狀況，查詢代碼以下所示：

import torch
# Returns the current GPU memory usage by 
# tensors in bytes for a given device
# 返回當前使用的 GPU 內存，單位是字節
torch.cuda.memory_allocated()
# Returns the current GPU memory managed by the
# caching allocator in bytes for a given device
# 返回當前緩存分配器中的 GPU 內存
torch.cuda.memory_cached()
複製代碼

清空 cuda 緩存的代碼以下所示：

# Releases all unoccupied cached memory currently held by
# the caching allocator so that those can be used in other
# GPU application and visible in nvidia-smi
# 釋放全部非佔用的內存
torch.cuda.empty_cache()
複製代碼

但須要注意的是，上述函數並不會釋放被 tensors 佔用的 GPU 內存，所以並不能增長當前可用的 GPU 內存。

3. 在 GPU 上存儲 Tensors 和運行模型的方法

若是是想存儲變量在 cpu 上，能夠按下面代碼所示這麼寫：

a = torch.DoubleTensor([1., 2.])
複製代碼

變量 a 將保持在 cpu 上，並在 cpu 上進行各類運算，若是但願將它轉換到 gpu 上，須要採用 .cuda ，能夠有如下兩種實現方法

# 方法1
a = torch.FloatTensor([1., 2.]).cuda()
# 方法2
a = torch.cuda.FloatTensor([1., 2.])
複製代碼

這種作法會選擇默認的第一個 GPU，查看方式有下面兩種：

# 方法1
torch.cuda.current_device()
# 0

# 方法2
a.get_device()
# 0
複製代碼

另外，也能夠在 GPU 上運行模型，例子以下所示，簡單使用 nn.Sequential 定義一個模型：

sq = nn.Sequential(
         nn.Linear(20, 20),
         nn.ReLU(),
         nn.Linear(20, 4),
         nn.Softmax()
)
複製代碼

而後放到 GPU 上運行：

model = sq.cuda()
複製代碼

怎麼判斷模型是否在 GPU 上運行呢，能夠經過下述方法查看模型的參數是否在 GPU 上來判斷：

# From the discussions here: discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda
# 參考 https://discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda/180

next(model.parameters()).is_cuda
# True
複製代碼

4. 有多個 GPU 的時候，如何選擇和使用它們

假設有 3 個 GPU ，咱們能夠初始化和分配 tensors 到任意一個指定的 GPU 上，代碼以下所示，這裏分配 tensors 到指定 GPU 上，有 3 種方法：

初始化 tensor 時，指定參數 device
.to(cuda_id)
.cuda(cuda_id)

cuda0 = torch.device('cuda:0')
cuda1 = torch.device('cuda:1')
cuda2 = torch.device('cuda:2')

# 若是隻是採用 .cuda() 方法，默認是放到 cuda:0 的 GPU 上
# 下面是 3 種實現方法
x = torch.Tensor([1., 2.], device=cuda1)
# Or
x = torch.Tensor([1., 2.]).to(cuda1)
# Or
x = torch.Tensor([1., 2.]).cuda(cuda1)

# 修改默認的設備方法，輸入但願設置爲默認設備的 id
torch.cuda.set_device(2) 
# 調用環境變量 CUDA_VISIBLE_DEVICES，能夠設置想採用的 GPU 的數量和哪幾個 GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,2"
複製代碼

當你有多個 GPU 的時候，就能夠將應用的工做劃分，但這裏存在相互之間交流的問題，不過若是不須要頻繁的交換信息，那麼這個問題就能夠忽略。

實際上，還有另外一個問題，在 PyTorch 中全部 GPU 的運算默認都是異步操做。但在 CPU 和 GPU 或者兩個 GPU 之間的數據複製是須要同步的，當你經過函數 torch.cuda.Stream() 建立本身的流時，你必須注意這個同步問題。

下面是官方文檔上一個錯誤的示例：

cuda = torch.device('cuda')
# 建立一個流
s = torch.cuda.Stream()  
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # because sum() may start execution before normal_() finishes!
    # sum() 操做可能在 normal_() 結束前就能夠執行了
    B = torch.sum(A)
複製代碼

若是想徹底利用好多 GPU，應該按照以下作法：

將全部 GPU 用於不一樣的任務或者應用；
在多模型中，每一個 GPU 應用單獨一個模型，而且各自有預處理操做都完成好的一份數據拷貝；
每一個 GPU 採用切片輸入和模型的拷貝，每一個 GPU 將單獨計算結果，並將結果都發送到同一個 GPU 上進行進一步的運算操做。

5. 數據並行

數據並行的操做要求咱們將數據劃分紅多份，而後發送給多個 GPU 進行並行的計算。

PyTorch 中實現數據並行的操做能夠經過使用 torch.nn.DataParallel。

下面是一個簡單的示例。要實現數據並行，第一個方法是採用 nn.parallel 中的幾個函數，分別實現的功能以下所示：

複製(Replicate)：將模型拷貝到多個 GPU 上；
分發(Scatter)：將輸入數據根據其第一個維度(一般就是 batch 大小)劃分多份，並傳送到多個 GPU 上；
收集(Gather)：從多個 GPU 上傳送回來的數據，再次鏈接回一塊兒；
並行的應用(parallel_apply)：將第三步獲得的分佈式的輸入數據應用到第一步中拷貝的多個模型上。

實現代碼以下所示：

# Replicate module to devices in device_ids
replicas = nn.parallel.replicate(module, device_ids)
# Distribute input to devices in device_ids
inputs = nn.parallel.scatter(input, device_ids)
# Apply the models to corresponding inputs
outputs = nn.parallel.parallel_apply(replicas, inputs)
# Gather result from all devices to output_device
result = nn.parallel.gather(outputs, output_device)
複製代碼

實際上，還有一個更簡單的也是經常使用的實現方法，以下所示，只需一行代碼便可：

model = nn.DataParallel(model, device_ids=device_ids)
result = model(input)
複製代碼

6. 數據並行的比較

根據文章 medium.com/@iliakarman… 以及 Github：github.com/ilkarman/De… 獲得的不一樣框架在採用單個 GPU 和 4 個 GPU 時運算速度的對比結果，以下所示：

從圖中能夠看到數據並行操做盡管存在多 GPU 之間交流的問題，可是提高的速度仍是很明顯的。而 PyTorch 的運算速度僅次於 Chainer ，但它的數據並行方式很是簡單，一行代碼便可實現。

7. torch.multiprocessing

torch.multiprocessing 是對 Python 的 multiprocessing 模塊的一個封裝，而且百分比兼容原始模塊，也就是能夠採用原始模塊中的如 Queue 、Pipe、Array 等方法。而且爲了加快速度，還添加了一個新的方法--share_memory_()，它容許數據處於一種特殊的狀態，能夠在不須要拷貝的狀況下，任何進程均可以直接使用該數據。

經過該方法，能夠共享 Tensors 、模型的參數 parameters ，能夠在 CPU 或者 GPU 之間共享它們。

下面展現一個採用多進程訓練模型的例子：

# Training a model using multiple processes:
import torch.multiprocessing as mp
def train(model):
    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters
model = nn.Sequential(nn.Linear(n_in, n_h1),
                      nn.ReLU(),
                      nn.Linear(n_h1, n_out))
model.share_memory() # Required for 'fork' method to work
processes = []
for i in range(4): # No. of processes
    p = mp.Process(target=train, args=(model,))
    p.start()
    processes.append(p)
for p in processes: 
    p.join()
複製代碼