AMD cpu 下 Pytorch 多卡並行卡死問題解決

dataparallel not working on nvidia gpus and amd cpus

 
 
問題:
 
多卡運行時, 網絡會卡在那裏不能運行.
系統是 AMD Ryzen5 1600x 和 兩張taitanXP
以前兩張卡是2070+taitanXP是能夠多卡運行的, 只不過是顯存不同大...
 
看了下日誌, 都是下面的錯誤
 
these error messages were found in the dmesg log:

[1118468.873266] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea13a000 flags=0x0020]
[1118468.942145] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea139068 flags=0x0020]
[1118468.942189] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000040 flags=0x0020]
[1118468.942227] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00007c0 flags=0x0020]
[1118468.942265] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001040 flags=0x0020]
[1118468.942303] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000f40 flags=0x0020]
[1118468.942340] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00016c0 flags=0x0020]
[1118468.942377] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0002040 flags=0x0020]

 

搜了一下, 彷佛是一個bug . . .
 
臨時解決辦法:
 
修改 /etc/default/grub
 
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX="iommu=soft" # 注意修改這一行 ...

 

而後
sudo update grub
最後重啓
 
這樣就能夠正常運行了
相關文章
相關標籤/搜索