apex的實踐

時間 2019-11-30

標籤 apex 實踐简体版

原文原文鏈接

apex是NVIDIA開源的用於在PyTorch框架下實現混合精度訓練的模塊，可以方便地進行FP16訓練。c++

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.git

其API地址爲 nvidia.github.io/apexgithub

安裝中踩的坑

我在編譯安裝apex的過程當中遇到了一些問題，經過查issues來解決的。框架

使用時碰到segmentation fault學習

能夠試試gcc5，能夠用 conda install -c psi4 gcc-5 來安裝，參考 github.com/NVIDIA/apex…ui

若是碰到"GLIBCXX_3.4.20' not found"這個問題code

能夠試試找到 path_to_anaconda3/lib/libstdc++.so.6，而後鏈接到apex引用的路徑下，或者本身加一個lib PATH。orm

若是碰到FusedLayerNorm有關的錯誤ip

多是和沒裝cuda的擴展，能夠

Try a full pip uninstall apex, then cd apex_repo_dir; rm-rf build; python setup.py install --cuda_ext --cpp_ext and see if the segfault persists."

參考https://github.com/huggingface/pytorch-pretrained-BERT/issues/284

使用時的坑

AttributeError: 'NoneType' object has no attribute 'contiguous'

模型中有無用的layers(weights)(例子: github.com/FDecaYed/py…)，致使反向傳遞梯度後，這些weights的梯度爲none，就會報「AttributeError: 'NoneType' object has no attribute 'contiguous'」的錯誤，能夠參考https://github.com/NVIDIA/apex/issues/131

解決方案：1. 改apex的源碼，讓其判斷梯度是否爲none，2. 改模型，去掉無用的weights，第二種方法更好一些，或者等apex更新吧。

p.type().is_cuda() ASSERT FAILED at csrc/fused_adam_cuda.cpp:12

這個錯誤是我本身的問題，model.cuda() 應該在 FusedAdam的聲明以前，否則會報這個錯誤。

cuda runtime error (77) : an illegal memory access

個人這個問題經過該issue的方法解決了，github.com/NVIDIA/apex… 目前還沒找到緣由。