GPU Memory Usage佔滿而GPU-Util卻爲0的調試

時間 2019-11-19

標籤 gpu memory usage 佔滿 util 卻爲調試简体版

原文原文鏈接

最近使用github上的一個開源項目訓練基於CNN的翻譯模型，使用THEANO_FLAGS='floatX=float32,device=gpu2,lib.cnmem=1' python run_nnet.py -w data/exp1/，運行時報錯，打印"The image and the kernel must have the same type. inputs(float64), kerns(float32)"的錯誤，而後使用THEANO_FLAGS='floatX=float64,device=gpu2,lib.cnmem=1' python run_nnet.py -w data/exp1/，運行成功。但幾百個訓練數據卻須要十幾分鍾，運行十分緩慢。html

使用nvidia-smi -l查看GPU狀況，發現GPU memory usage 是滿了，而GPU-Util倒是0，top命令看CPU倒是1600%（16核CPU），這與跑其餘任務很不相同（GPU-Util接近100%，CPU不到100%）。看起來是CPU被打滿了，而GPU空着，運算徹底在CPU上進行。查找緣由，Google這個問題，卻沒有找到什麼知足需求的解答，只好回過頭來閱讀官方文檔。python

首先使用assert_no_cpu_op=raise:git

THEANO_FLAGS="floatX=float64,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1,assert_no_cpu_op=raise" python run_nnet.pygithub

按照官方文檔，若是設置了這個參數，有在CPU上執行的操做，是應該拋異常的。然而實際狀況並無。數組

仔細閱讀官方文檔，發如今theano的FAQ文檔（refer: http://deeplearning.net/software/theano/faq.html）中說：網絡

「It should be noted that using float32 and int{32, 64} together inside a function would provide float64 as output.dom

Since the GPU can’t compute this kind of output, it would be preferable not to use those dtypes together.ide

To help you find where float64 are created, see the warn_float64 Theano flag.」this

也就是float64的話，GPU是不能計算的，因此就是CPU計算。進一步的，在使用GPU的文檔中(refer: http://deeplearning.net/software/theano/tutorial/using_gpu.html):spa

Only computations with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but float64 computations are still relatively slow (Jan 2010).

Prefer constructors like matrix, vector and scalar to dmatrix, dvector and dscalar because the former will give you float32 variables when floatX=float32.
Ensure that your output variables have a float32 dtype and not float64. The more float32 variables are in your graph, the more work the GPU can do for you."

因此緣由就是代碼中有float64的輸入。根據文檔中建議，可使用config的warn_float64來幫助尋找float64的輸入，因此執行：

THEANO_FLAGS="floatX=float64,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1,warn_float64=raise" python run_nnet.py -w data/exp1/

異常棧以下：

Traceback (most recent call last):
File "run_nnet.py", line 570, in <module>
main()
File "run_nnet.py", line 208, in main
nnet_q.set_input((x_q, x_q_overlap))
File ".../nn_layers.py", line 65, in set_input
self.output = self.output_func(input)
File ".../nn_layers.py", line 89, in output_func
layer.set_input(cur_input)

這個棧信息之反映了在網絡set_input的時候有float64，可是float64的變量但是在此以前早就建立好的，因此仍是沒法定位到問題，這是一個然並卵的參數。

因爲代碼中的輸入幾乎都是由numpy生成或者load的，查閱numpy的文檔，發現numpy創建數組的操做，若是沒有指定dtype，那麼默認就是float64，例如numpy.ones, numpy.zero, numpy.random.RandomState.randn等，theano的config(refer: http://deeplearning.net/software/theano/library/config.html)中有一個cast_policy參數，按照文檔的說法，當設定floatX=float32，同時設置cast_policy=numpy+floatX時，執行過程當中會自動的把numpy產生的數組轉換成float32的。因而執行：

THEANO_FLAGS="floatX=float32,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1,cast_policy=numpy+floatX" python run_nnet.py -w data/exp1/

結果。。。依然報錯："NotImplementedError: The image and the kernel must have the same type.inputs(float64), kerns(float32)"

這個參數的說明：

" Note that ‘numpy+floatX’ is not currently behaving exactly as planned (it is a work-in-progress), and thus you should consider it as experimental. "

好吧，果真仍是實驗性質的，有些狀況搞不定。

那麼最後一招，仔細的檢查全部numpy的調用，把全部建立數組的地方都顯示的指定dtype=numpy.float32，所有改好後，執行：

THEANO_FLAGS="floatX=float32,device=gpu2,force_device=True,mode=FAST_RUN,lib.cnmem=1" python run_nnet.py -w data/exp1/

成功的把GPU-Util打滿，CPU也降到了100%，訓練幾百條數據的時間一降低到秒殺！

經驗總結：

遇到問題閱讀官方文檔是十分有效的方法，每每常見問題在這些文檔中已經說得很明確了，能夠幫你明確的瞭解問題所在，進而找到解決方案。