對,這是一個高大上的技術,終於要作老崔當年作過的事情了,生活很傳奇。css
1. CUDAhtml
是英偉達公司推出的,專門針對 N 卡進行 GPU 編程的接口。文檔資料很齊全,幾乎適用於全部 N 卡。python
本專欄講述的 GPU 編程技術均基於此接口。git
2. Open CLgithub
開源的 GPU 編程接口,使用範圍最廣,幾乎適用於全部的顯卡。算法
但相對 CUDA,其掌握較難一些,建議先學 CUDA,在此基礎上進行 Open CL 的學習則會很是簡單輕鬆。編程
3. DirectComputeubuntu
微軟開發出來的 GPU 編程接口。功能很強大,學習起來也最爲簡單,但只能用於 Windows 系統,在許多高端服務器都是 UNIX 系統沒法使用。api
總結,這幾種接口各有優劣,須要根據實際狀況選用。但它們使用起來方法很是相近,掌握了其中一種再學習其餘兩種會很容易。服務器
Ref: http://www.javashuo.com/article/p-nidqtawy-ep.html
關於GPU並行編程,北美有系統的課程。
CME 213 Introduction to parallel computing using MPI, openMP, and CUDA
Eric Darve, Stanford University
感受MPI, openMP都被mapReduce幹掉了,cuda的部分還有些價值。
加州理工 Computing + Mathematical Sciences,2018年課程【推薦】
驅動安裝:Installing Ubuntu 16.04 with CUDA 9.0 and cuDNN 7.3 for deep learning【寫的比較詳細】
參考博文:How to Setup Ubuntu 16.04 with CUDA, GPU, and other requirements for Deep Learning
X server issue: How to install NVIDIA.run?
出現返回登錄問題的恢復辦法:https://www.jianshu.com/p/34236a9c4a2f
sudo apt-get remove --purge nvidia-* sudo apt-get install ubuntu-desktop sudo rm /etc/X11/xorg.conf echo 'nouveau' | sudo tee -a /etc/modules
#重啓系統 sudo reboot
sudo apt-get install nvidia-cuda-toolkit
cpu one thread : Time cost: 30.723241 sec, data[100] is -0.207107
gpu full threads : Time cost: 0.107630 sec, data[100] is -0.207107
這是在筆記本的測試,即便cpu四線程全開,也是70倍的加速。
當前貌似尚未一套完整的方案,可是不一樣的算法貌似有不一樣的庫支持相應的GPU加速版本,也就是說,這一領域暫時處於「三國分立」的階段。
沒錯,一切皆套路
• Setup inputs on the host (CPU-accessible memory)
• Allocate memory for outputs on the host
• Allocate memory for inputs on the GPU
• Allocate memory for outputs on the GPU
• Copy inputs from host to GPU
• Start GPU kernel (function that executed on gpu)
• Copy output from GPU to host
Link: jcjohnson/cnn-benchmarks
Ref: Build a super fast deep learning machine for under $1,000
Perhaps the most important attribute to look at for deep learning is the available RAM on the card. If TensorFlow can’t fit the model and the current batch of training data into the GPU’s RAM it will fail over to the CPU—making the GPU pointless.
至少CPU先穩定,再談GPU的事兒。
Another key consideration is the architecture of the graphics card. The last few architectures NVIDIA has put out have been called 「Kepler,」 「Maxwell,」 and 「Pascal」—in that order. The difference between the architectures really matters for speed; for example, the Pascal Titan X is twice the speed of a Maxwell Titan X according to this benchmark.
GPUs are critical: The Pascal Titan X with cuDNN is 49x to 74x faster than dual Xeon E5-2630 v3 CPUs.
Most of the papers on machine learning use the TITAN X card, which is fantastic but costs at least $1,000, even for an older version. Most people doing machine learning without infinite budget use the NVIDIA GTX 900 series (Maxwell) or the NVIDIA GTX 1000 series (Pascal).
To figure out the architecture of a card, you can look at the spectacularly confusing naming conventions of NVIDIA: the 9XX cards use the Maxwell architecture while the 10XX cards use the Pascal architecture.
But a 980 card is still probably significantly faster than a 1060 due to higher clock speed and more RAM.
You will have to set different flags for NVIDIA cards based on the architecture of the GPU you get. But the most important thing is any 9XX or 10XX card will be an order of magnitude faster than your laptop.
Don’t be paralyzed by the options; if you haven’t worked with a GPU, they will all be much better than what you have now.
I went with the GeForce GTX 1060 3GB for $195, and it runs models about 20 times faster than my MacBook, but it occasionally runs out of memory for some applications, so I probably should have gotten the GeForce GTX 1060 6GB for an additional $60.
Ref: Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning
2017-04-09
Generally, I would recommend the GTX 1080 Ti, GTX 1080 or GTX 1070.
They are all excellent cards and if you have the money for a GTX 1080 Ti you should go ahead with that.
The GTX 1070 is a bit cheaper and still faster than a regular GTX Titan X (Maxwell).
The GTX 1080 was bit less cost efficient than the GTX 1070 but since the GTX 1080 Ti was introduced the price fell significantly and now the GTX 1080 is able to compete with the GTX 1070.
All these three cards should be preferred over the GTX 980 Ti due to their increased memory of 11GB and 8GB (instead of 6GB).
【GPU顯存最好大於6GB】
I personally would go with multiple GTX 1070 or GTX 1080 for research. I rather run a few more experiments which are a bit slower than running just one experiment which is faster.
In NLP the memory constraints are not as tight as in computer vision and so a GTX 1070/GTX 1080 is just fine for me. The tasks I work on and how I run my experiments determines the best choice for me, which is either a GTX 1070 or GTX 1080.
Best GPU overall (by a small margin): Titan Xp
Cost efficient but expensive: GTX 1080 Ti, GTX 1070, GTX 1080
Cost efficient and cheap: GTX 1060 (6GB)
I work with data sets > 250GB: GTX Titan X (Maxwell), NVIDIA Titan X Pascal, or NVIDIA Titan Xp
I have little money: GTX 1060 (6GB)
I have almost no money: GTX 1050 Ti (4GB)
I do Kaggle: GTX 1060 (6GB) for any 「normal」 competition, or GTX 1080 Ti for 「deep learning competitions」
I am a competitive computer vision researcher: NVIDIA Titan Xp; do not upgrade from existing Titan X (Pascal or Maxwell)
I am a researcher: GTX 1080 Ti. In some cases, like natural language processing, a GTX 1070 or GTX 1080 might also be a solid choice — check the memory requirements of your current models
I want to build a GPU cluster: This is really complicated, you can get some ideas here
I started deep learning and I am serious about it: Start with a GTX 1060 (6GB). Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate
I want to try deep learning, but I am not serious about it: GTX 1050 Ti (4 or 2GB)
電源要好,易於顯卡擴展。
內存建議16GB。
CPU Intel i5
From: Tensorflow中使用指定的GPU及GPU顯存
(1) 終端執行程序時設置使用的GPU
若是電腦有多個GPU,tensorflow默認所有使用。
若是想只使用部分GPU,能夠設置CUDA_VISIBLE_DEVICES。在調用python程序時,可使用(Link 中 Franck Dernoncourt的回覆):
CUDA_VISIBLE_DEVICES=1 python my_script.py
Environment Variable Syntax Results
CUDA_VISIBLE_DEVICES=1 Only device 1 will be seen
CUDA_VISIBLE_DEVICES=0,1 Devices 0 and 1 will be visible
CUDA_VISIBLE_DEVICES="0,1" Same as above, quotation marks are optional
CUDA_VISIBLE_DEVICES=0,2,3 Devices 0, 2, 3 will be visible; device 1 is masked
CUDA_VISIBLE_DEVICES="" No GPU will be visible
(2) python代碼中設置使用的GPU
若是要在python代碼中設置使用的GPU(如使用pycharm進行調試時),可使用下面的代碼(Link 中 Yaroslav Bulatov的回覆):
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
(3) 設置tensorflow使用的顯存大小
<3.1> 定量設置顯存
默認tensorflow是使用GPU儘量多的顯存。能夠經過下面的方式,來設置使用的GPU顯存:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7) // 分配:GPU實際顯存*0.7
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
<3.2> 按需設置顯存
上面的只能設置固定的大小。若是想按需分配,可使用allow_growth
參數(參考網址:http://blog.csdn.net/cq361106306/article/details/52950081):
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
Android GPU: https://blog.csdn.net/u011723240/article/details/30109763
培訓大綱:https://blog.csdn.net/PCb4jR/article/details/78890915
Sobel算子對比,表中是一些測量獲得的結果:
從上述結果能夠看出,在上述實驗平臺上,隨着圖片大小的增大(數據處理更加複雜),並行化的加速比會更加明顯。
End.