基於Ubuntu16.04的GeForce GTX 1080驅動安裝,遇到的問題及對應的解決方法

1.在主機上插上GPU以後,查看設備:python

$ nvidia-smi
Tue Dec  5 10:36:43 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0      On |                  N/A |
|  0%   34C    P8     8W / 200W |    284MiB /  8112MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1008    G   /usr/lib/xorg/Xorg                             117MiB |
|    0      1614    G   compiz                                         155MiB |
|    0      1886    G   fcitx-qimpanel                                   9MiB |
+-----------------------------------------------------------------------------+

 

可見系統已經檢測到GeForce GTX 1080.linux

另外,這臺機器以前搭載過1060,從上面的結果還能夠看到對應的驅動NVIDIA 375.66還在;而使用GTX1080對應要裝驅動NVIDIA 367.27ubuntu

 

$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt-get update

中間過程遇到Y/n時候直接回車繼續vim

 

而後裝驅動nvidia-367服務器

$ sudo apt-get install nvidia-367

在這一步,由於與以前的驅動nvidia375存在衝突,會報錯:app

Building initial module for 4.10.0-32-generic
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-384.0.crash'
Error! Bad return status for module build on kernel: 4.10.0-32-generic (x86_64)
Consult /var/lib/dkms/nvidia-384/384.98/build/make.log for more information.
dpkg: error processing package nvidia-384 (--configure):
 subprocess installed post-installation script returned error exit status 10
dpkg: dependency problems prevent configuration of libcuda1-384:
 libcuda1-384 depends on nvidia-384 (>= 384.98); however:
  Package nvidia-384 is not configured yet.

dpkg: error processing package libcuda1-384 (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of nvidia-367:
 nvidia-367 depends on nvidia-384; however:
  Package nvidia-384 is not configured yet.

dpkg: error processing package nvidia-367 (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of nvidia-opencl-icd-384:
 nvidia-opencl-icd-384 depends on nvidia-384 (>= 384.98); however:
  Package nvidia-384 is not configured yet.

dpkg: error processing package nvidia-opencl-icd-384 (--configure):
 dependency problems - leaving unconfigured
Setting up nvidia-prime (0.8.2) ...
No apport report written because the error message indicates its a followup error from a previous failure.
No apport report written because the error message indicates its a followup error from a previous failure.
No apport report written because MaxReports is reached already
Processing triggers for libc-bin (2.23-0ubuntu9) ...
Processing triggers for initramfs-tools (0.122ubuntu8.8) ...
update-initramfs: Generating /boot/initrd.img-4.10.0-32-generic
Errors were encountered while processing:
 nvidia-384
 nvidia-375
 libcuda1-384
 libcuda1-375
 nvidia-367
 nvidia-opencl-icd-384
 nvidia-opencl-icd-375
E: Sub-process /usr/bin/dpkg returned an error code (1)

 

對於這個問題,先把以前的驅動卸掉frontend

$ sudo apt-get remove --purge nvidia-375

 

而後看log文件爲何編譯內核報錯ide

$ vim /var/lib/dkms/nvidia-384/384.98/build/make.log
......
 CONFTEST: drm_atomic_available
 CONFTEST: drm_atomic_modeset_nonblocking_commit_available
 CONFTEST: is_export_symbol_gpl_refcount_inc
 CONFTEST: is_export_symbol_gpl_refcount_dec_and_test
  CC [M]  /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-instance.o
  CC [M]  /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-gpu-numa.o
cc: error: unrecognized command line option ‘-fstack-protector-strong’
scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-instance.o' failed
make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-instance.o] Error 1
make[2]: *** Waiting for unfinished jobs....
  CC [M]  /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv.o
  CC [M]  /var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-frontend.o
cc: error: unrecognized command line option ‘-fstack-protector-strong’
scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-gpu-numa.o' failed
make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-gpu-numa.o] Error 1
cc: error: unrecognized command line option ‘-fstack-protector-strong’
cc: error: unrecognized command line option ‘-fstack-protector-strong’
scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-frontend.o' failed
make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv-frontend.o] Error 1
scripts/Makefile.build:294: recipe for target '/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv.o' failed
make[2]: *** [/var/lib/dkms/nvidia-384/384.98/build/nvidia/nv.o] Error 1
Makefile:1524: recipe for target '_module_/var/lib/dkms/nvidia-384/384.98/build' failed
make[1]: *** [_module_/var/lib/dkms/nvidia-384/384.98/build] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-4.10.0-32-generic'
Makefile:81: recipe for target 'modules' failed
make: *** [modules] Error 2

從網上查了一下,得知‘-fstack-protector-strong’ 選項是gcc4.9之後的版本才加入的,也就是說須要安裝gcc4.9之後的版本才能夠編譯經過.post

經過 gcc -v 命令查看機器上的gcc是4.8版本,確認是gcc版本問題,因此升級gcc到4.9版本:學習

$ sudo apt-get install gcc-4.9
$ cd /usr/bin/
$ sudo ln -s /usr/bin/gcc-4.9 /usr/bin/gcc -f
$ gcc -v

 

而後繼續驅動安裝

$ sudo apt-get install nvidia-367
$ sudo apt-get install mesa-common-dev
$ sudo apt-get install freeglut3-dev

以後重啓系統讓GTX1080顯卡驅動生效.

 

2.CUDA8(支持GTX1080)的下載安裝

(由於本機器以前已經裝過,因此這裏先直接測試,過段時間有空再從新搞機器踩一下坑再更新)

 

3.測試

經過nvidia-smi看到驅動改成了nvidia384(有些人顯示的是nvidia367,雖然這裏顯示不一樣,可是從編譯過程當中看到nvidia367依賴於nvidia384,並且後面的測試和使用也沒問題,因此沒影響)

$ nvidia-smi
Tue Dec  5 15:27:51 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
| 33%   62C    P2   139W / 200W |   7898MiB /  8112MiB |     57%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1008      G   /usr/lib/xorg/Xorg                           188MiB |
|    0      1508      G   compiz                                       110MiB |
|    0      4491      C   python                                      7587MiB |
+-----------------------------------------------------------------------------+

 

樣例測試1:

$ cd NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          9.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8113 MBytes (8506769408 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1848 MHz (1.85 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS

 

樣例測試2:

$ cd NVIDIA_CUDA-8.0_Samples/5_Simulations/nbody
$ make
$ ./nbody -benchmark -numbodies=256000 -device=0
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
    -fullscreen       (run n-body simulation in fullscreen mode)
    -fp64             (use double precision floating point values for simulation)
    -hostmem          (stores simulation data in host memory)
    -benchmark        (run benchmark to measure performance) 
    -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
    -device=<d>       (where d=0,1,2.... for the CUDA device to use)
    -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
    -compare          (compares simulation results running once on the default GPU and once on the CPU)
    -cpu              (run n-body simulation on the CPU)
    -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "GeForce GTX 1080
> Compute 6.1 CUDA device: [GeForce GTX 1080]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 2981.761 ms
= 219.790 billion interactions per second
= 4395.792 single-precision GFLOP/s at 20 flops per interaction

 

4. 查看GPU工做狀態

使用nvidia-smi命令便可。

若是要週期性顯示,例如每10s 顯示一次GPU的狀況:

$ watch -n 10 nvidia-smi

具體以下所示:重要的參數主要是溫度、內存使用、GPU佔有率,具體以下紅框所示。

另附:nvidia-smi 命令解讀 

 

======================================================================================

補充 2018.2.3

最近在另外一臺服務器裝GTX1060以後遇到的問題:

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

緣由:驅動裝了cuda8.0和cudnn8.0版本,而tensorflow-gpu1.5的版本要求cuda9.0
解決方法:回滾tensorflow-gpu到1.4版本

pip install tensorflow-gpu==1.4 -i https://pypi.tuna.tsinghua.edu.cn/simple gevent

 

參考:

深度學習主機環境配置: Ubuntu16.04+Nvidia GTX 1080+CUDA8.0

ubuntu 16.04 更新 gcc/g++ 4.9.2

Linux下監視NVIDIA的GPU使用狀況

查看GPU實時工做狀態的命令

http://blog.csdn.net/w5688414/article/details/79187499

相關文章
相關標籤/搜索