ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
出現這個錯誤的狀況是,在服務器上的docker中運行訓練代碼時,batch size設置得過大,shared memory不夠(由於docker限制了shm).git
根據PyTorch README:github
Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.
1.這裏說明PyTorch的IPC會利用共享內存,因此共享內存必須足夠大,能夠經過docker run --shm-size
進行修改
2.經過設置 --ipc=host
3.將Dataloader的num_workers設置爲0.但訓練會變慢docker