1.在主機上插上GPU以後,查看設備:python
$ nvidia-smi Tue Dec 5 10:36:43 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A | | 0% 34C P8 8W / 200W | 284MiB / 8112MiB | 1% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1008 G /usr/lib/xorg/Xorg 117MiB | | 0 1614 G compiz 155MiB | | 0 1886 G fcitx-qimpanel 9MiB | +-----------------------------------------------------------------------------+
可見系統已經檢測到GeForce GTX 1080.linux
另外,這臺機器以前搭載過1060,從上面的結果還能夠看到對應的驅動NVIDIA 375.66還在;而使用GTX1080對應要裝驅動NVIDIA 367.27ubuntu
$ sudo add-apt-repository ppa:graphics-drivers/ppa $ sudo apt-get update
中間過程遇到Y/n時候直接回車繼續vim
而後裝驅動nvidia-367服務器
$ sudo apt-get install nvidia-367
在這一步,由於與以前的驅動nvidia375存在衝突,會報錯:app
Building initial module for 4.10.0-32-generic ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-384.0.crash' Error! Bad return status for module build on kernel: 4.10.0-32-generic (x86_64) Consult /var/lib/dkms/nvidia-384/384.98/build/make.log for more information. dpkg: error processing package nvidia-384 (--configure): subprocess installed post-installation script returned error exit status 10 dpkg: dependency problems prevent configuration of libcuda1-384: libcuda1-384 depends on nvidia-384 (>= 384.98); however: Package nvidia-384 is not configured yet. dpkg: error processing package libcuda1-384 (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of nvidia-367: nvidia-367 depends on nvidia-384; however: Package nvidia-384 is not configured yet. dpkg: error processing package nvidia-367 (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of nvidia-opencl-icd-384: nvidia-opencl-icd-384 depends on nvidia-384 (>= 384.98); however: Package nvidia-384 is not configured yet. dpkg: error processing package nvidia-opencl-icd-384 (--configure): dependency problems - leaving unconfigured Setting up nvidia-prime (0.8.2) ... No apport report written because the error message indicates its a followup error from a previous failure. No apport report written because the error message indicates its a followup error from a previous failure. No apport report written because MaxReports is reached already Processing triggers for libc-bin (2.23-0ubuntu9) ... Processing triggers for initramfs-tools (0.122ubuntu8.8) ... update-initramfs: Generating /boot/initrd.img-4.10.0-32-generic Errors were encountered while processing: nvidia-384 nvidia-375 libcuda1-384 libcuda1-375 nvidia-367 nvidia-opencl-icd-384 nvidia-opencl-icd-375 E: Sub-process /usr/bin/dpkg returned an error code (1)
對於這個問題,先把以前的驅動卸掉frontend
$ sudo apt-get remove --purge nvidia-375
而後看log文件爲何編譯內核報錯ide
$ vim /var/lib/dkms/nvidia-384/384.98/build/make.log ...... CONFTEST: drm_atomic_available CONFTEST: drm_atomic_modeset_nonblocking_commit_available CONFTEST: is_export_symbol_gpl_refcount_inc CONFTEST: is_export_symbol_gpl_refcount_dec_and_test CC [M] /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-instance.o CC [M] /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-gpu-numa.o cc: error: unrecognized command line option ‘-fstack-protector-strong’ scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-instance.o' failed make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-instance.o] Error 1 make[2]: *** Waiting for unfinished jobs.... CC [M] /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv.o CC [M] /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-frontend.o cc: error: unrecognized command line option ‘-fstack-protector-strong’ scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-gpu-numa.o' failed make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-gpu-numa.o] Error 1 cc: error: unrecognized command line option ‘-fstack-protector-strong’ cc: error: unrecognized command line option ‘-fstack-protector-strong’ scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-frontend.o' failed make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-frontend.o] Error 1 scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv.o' failed make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv.o] Error 1 Makefile:1524: recipe for target '_module_/var/lib/dkms/nvidia-384/384.98/build' failed make[1]: *** [_module_/var/lib/dkms/nvidia-384/384.98/build] Error 2 make[1]: Leaving directory '/usr/src/linux-headers-4.10.0-32-generic' Makefile:81: recipe for target 'modules' failed make: *** [modules] Error 2
從網上查了一下,得知‘-fstack-protector-strong’ 選項是gcc4.9之後的版本才加入的,也就是說須要安裝gcc4.9之後的版本才能夠編譯經過.post
經過 gcc -v 命令查看機器上的gcc是4.8版本,確認是gcc版本問題,因此升級gcc到4.9版本:學習
$ sudo apt-get install gcc-4.9 $ cd /usr/bin/ $ sudo ln -s /usr/bin/gcc-4.9 /usr/bin/gcc -f $ gcc -v
而後繼續驅動安裝
$ sudo apt-get install nvidia-367 $ sudo apt-get install mesa-common-dev $ sudo apt-get install freeglut3-dev
以後重啓系統讓GTX1080顯卡驅動生效.
2.CUDA8(支持GTX1080)的下載安裝
(由於本機器以前已經裝過,因此這裏先直接測試,過段時間有空再從新搞機器踩一下坑再更新)
3.測試
經過nvidia-smi看到驅動改成了nvidia384(有些人顯示的是nvidia367,雖然這裏顯示不一樣,可是從編譯過程當中看到nvidia367依賴於nvidia384,並且後面的測試和使用也沒問題,因此沒影響)
$ nvidia-smi Tue Dec 5 15:27:51 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.98 Driver Version: 384.98 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A | | 33% 62C P2 139W / 200W | 7898MiB / 8112MiB | 57% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1008 G /usr/lib/xorg/Xorg 188MiB | | 0 1508 G compiz 110MiB | | 0 4491 C python 7587MiB | +-----------------------------------------------------------------------------+
樣例測試1:
$ cd NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery $ make $ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 1080" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8113 MBytes (8506769408 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1848 MHz (1.85 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080 Result = PASS
樣例測試2:
$ cd NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody $ make $ ./nbody -benchmark -numbodies=256000 -device=0 Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance. -fullscreen (run n-body simulation in fullscreen mode) -fp64 (use double precision floating point values for simulation) -hostmem (stores simulation data in host memory) -benchmark (run benchmark to measure performance) -numbodies=<N> (number of bodies (>= 1) to run in simulation) -device=<d> (where d=0,1,2.... for the CUDA device to use) -numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation) -compare (compares simulation results running once on the default GPU and once on the CPU) -cpu (run n-body simulation on the CPU) -tipsy=<file.bin> (load a tipsy model file for simulation) NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. > Windowed mode > Simulation data stored in video memory > Single precision floating point simulation > 1 Devices used for simulation gpuDeviceInit() CUDA Device [0]: "GeForce GTX 1080 > Compute 6.1 CUDA device: [GeForce GTX 1080] number of bodies = 256000 256000 bodies, total time for 10 iterations: 2981.761 ms = 219.790 billion interactions per second = 4395.792 single-precision GFLOP/s at 20 flops per interaction
4. 查看GPU工做狀態
使用nvidia-smi命令便可。
若是要週期性顯示,例如每10s 顯示一次GPU的狀況:
$ watch -n 10 nvidia-smi
具體以下所示:重要的參數主要是溫度、內存使用、GPU佔有率,具體以下紅框所示。
======================================================================================
補充 2018.2.3
最近在另外一臺服務器裝GTX1060以後遇到的問題:
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
緣由:驅動裝了cuda8.0和cudnn8.0版本,而tensorflow-gpu1.5的版本要求cuda9.0
解決方法:回滾tensorflow-gpu到1.4版本
pip install tensorflow-gpu==1.4 -i https://pypi.tuna.tsinghua.edu.cn/simple gevent
參考: