A Full Hardware Guide to Deep Learning深度學習電腦配置

時間 2019-11-06

標籤 hardware guide deep learning 深度學習電腦配置欄目網絡硬件简体版

原文原文鏈接

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（歡迎關注博主主頁，學習python視頻資源，還有大量免費python經典文章）html

https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/python

深度學習的完整硬件指南

深度學習是計算密集型的，所以您須要具備多個內核的快速CPU，對吧？或者購買快速CPU是否浪費？在構建深度學習系統時，您能夠作的最糟糕的事情之一就是在沒必要要的硬件上浪費金錢。在這裏，我將逐步指導您使用廉價高性能系統所需的硬件。算法

多年來，我總共創建了7個不一樣的深度學習工做站，儘管通過仔細的研究和推理，但我在選擇硬件部件方面犯了很大的錯誤。在本指南中，我想分享一下我多年來積累的經驗，這樣你就不會犯一樣的錯誤。express

博客帖子按錯誤嚴重程度排序。這意味着人們一般浪費最多錢的錯誤首先出現。編程

GPU

這篇博文假設您將使用GPU進行深度學習。若是您正在構建或升級系統以進行深度學習，那麼忽略GPU是不明智的。GPU只是深度學習應用程序的核心 - 處理速度的提升太大了，不容忽視。windows

我在GPU推薦博客文章中詳細討論了GPU的選擇，而GPU的選擇多是深度學習系統最關鍵的選擇。選擇GPU時可能會遇到三個主要錯誤：（1）成本/性能不佳，（2）內存不足，（3）散熱不良。安全

爲了得到良好的性價比，我建議使用RTX 2070或RTX 2080 Ti。若是使用這些卡，則應使用16位模型。不然，來自eBay的GTX 1070，GTX 1080，GTX 1070 Ti和GTX 1080 Ti是公平的選擇，您可使用這些具備32位（但不是16位）的GPU。服務器

選擇GPU時要當心內存要求。能夠運行16位的RTX卡能夠訓練相比GTX卡具備相同內存大兩倍的型號。所以，RTX卡具備內存優點，而且選擇RTX卡並學習如何有效地使用16位模型將帶您走很長的路。一般，對內存的要求大體以下：網絡

正在尋找最早進分數的研究：> = 11 GB
正在尋找有趣架構的研究：> = 8 GB
任何其餘研究：8 GB
Kaggle：4 - 8 GB
啓動：8 GB（但檢查特定應用領域的型號尺寸）
公司：8 GB用於原型設計，> = 11 GB用於培訓

須要注意的另外一個問題是，特別是若是你購買多個RTX卡就是冷卻。若是您想將GPU固定在彼此相鄰的PCIe插槽中，您應該確保使用鼓風機式風扇得到GPU。不然，您可能會遇到溫度問題，而且您的GPU速度會變慢（約30％）而且死得更快。架構

懷疑陣容
您可否識別出因性能不佳而出現故障的硬件部分？其中一個GPU？或者也許這畢竟是CPU的錯？

內存

RAM的主要錯誤是購買時鐘頻率太高的RAM。第二個錯誤是購買不夠的RAM以得到平滑的原型製做體驗。

須要的RAM時鐘速率

RAM時鐘速率是市場營銷的一種狀況，RAM公司會引誘你購買「更快」的RAM，實際上幾乎沒有產生任何性能提高。最好的解釋是「 RAM速度真的很重要嗎？「關於RAM von Linus技術提示的視頻。

此外，重要的是要知道RAM速度與快速CPU RAM-> GPU RAM傳輸幾乎無關。這是由於（1）若是您使用了固定內存，您的迷你批次將被轉移到GPU而不須要CPU的參與，以及（2）若是您不使用固定內存，快速與慢速RAM的性能提高是關於0-3％ - 把錢花在別的地方！

RAM大小

RAM大小不會影響深度學習性能。可是，它可能會阻礙您輕鬆執行GPU代碼（無需交換到磁盤）。你應該有足夠的內存來溫馨地使用你的GPU。這意味着您應該至少擁有與最大GPU匹配的RAM量。例如，若是你有一個24 GB內存的Titan RTX，你應該至少有24 GB的RAM。可是，若是您有更多的GPU，則不必定須要更多RAM。

這種「在RAM中匹配最大GPU內存」策略的問題在於，若是處理大型數據集，您可能仍然沒法使用RAM。這裏最好的策略是匹配你的GPU，若是你以爲你沒有足夠的RAM，只需再購買一些。

一種不一樣的策略受到心理學的影響：心理學告訴咱們，注意力是一種隨着時間推移而耗盡的資源。RAM是爲數很少的硬件之一，可讓您節省集中資源，解決更困難的編程問題。若是你有更多的RAM，你能夠將注意力集中在更緊迫的事情上，而不是花費大量時間來環繞RAM瓶頸。有了大量的RAM，您能夠避免這些瓶頸，節省時間並提升生產率，解決更緊迫的問題。特別是在Kaggle比賽中，我發現額外的RAM對於特徵工程很是有用。所以，若是您有錢並進行大量預處理，那麼額外的RAM多是一個不錯的選擇。所以，使用此策略，您但願如今擁有更多，更便宜的RAM而不是更晚。

中央處理器

人們犯的主要錯誤是人們過度關注CPU的PCIe通道。您不該該太在乎PCIe通道。相反，只需查看您的CPU和主板組合是否支持您要運行的GPU數量。第二個最多見的錯誤是得到一個功能太強大的CPU。

CPU和PCI-Express

人們對PCIe車道感到瘋狂！然而，問題是它對深度學習表現幾乎沒有影響。若是您只有一個GPU，則只須要PCIe通道便可快速將數據從CPU RAM傳輸到GPU RAM。然而，ImageNet批次的32個圖像（32x225x225x3）和32位須要1.1毫秒，16個通道，2.3毫秒，8個通道，4.5毫秒，4個通道。這些是理論數字，在實踐中，你常常看到PCIe的速度是它的兩倍 - 但這仍然是閃電般的快速！PCIe通道一般具備納秒範圍內的延遲，所以能夠忽略延遲。

把這個放在一塊兒咱們有一個ImageNet迷你批次的32張圖像和一個ResNet-152如下時間：

前向和後向傳遞：216毫秒（ms）
16個PCIe通道CPU-> GPU傳輸：大約2 ms（理論上爲1.1 ms）
8個PCIe通道CPU-> GPU傳輸：大約5毫秒（2.3毫秒）
4個PCIe通道CPU-> GPU傳輸：大約9毫秒（4.5毫秒）

所以，從4到16個PCIe通道將使性能提高約3.2％。可是，若是你使用帶固定內存的PyTorch數據加載器，你能夠得到0％的性能。所以，若是您使用單個GPU，請不要在PCIe通道上浪費金錢！

選擇CPU PCIe通道和主板PCIe通道時，請確保選擇支持所需GPU數量的組合。若是您購買的主板支持2個GPU，而且您但願最終擁有2個GPU，請確保購買支持2個GPU的CPU，但不必定要查看PCIe通道。

PCIe通道和多GPU並行

若是您在具備數據並行性的多個GPU上訓練網絡，PCIe通道是否重要？我已經在ICLR2016上發表了一篇論文，我能夠告訴你，若是你有96個GPU，那麼PCIe通道很是重要。可是，若是您有4個或更少的GPU，這並不重要。若是您在2-3個GPU之間並行化，我根本不關心PCIe通道。有了4個GPU，我確保每一個GPU能夠得到8個PCIe通道的支持（總共32個PCIe通道）。由於幾乎沒有人運行超過4個GPU的系統做爲經驗法則：不要花費額外的錢來得到每GPU更多的PCIe通道 - 這不要緊！

須要CPU核心

爲了可以爲CPU作出明智的選擇，咱們首先須要瞭解CPU以及它與深度學習的關係。CPU爲深度學習作了什麼？當您在GPU上運行深網時，CPU幾乎不會進行任何計算。主要是它（1）啓動GPU函數調用，（2）執行CPU函數。

到目前爲止，CPU最有用的應用程序是數據預處理。有兩種不一樣的常見數據處理策略，它們具備不一樣的CPU需求。

第一個策略是在訓練時進行預處理：

環：

加載小批量
預處理小批量
小批量訓練

第二個策略是在任何培訓以前進行預處理：

預處理數據
環：
1. 加載預處理的小批量
2. 小批量訓練

對於第一個策略，具備多個內核的良好CPU能夠顯着提升性能。對於第二種策略，您不須要很是好的CPU。對於第一個策略，我建議每一個GPU至少有4個線程 - 一般每一個GPU有兩個核心。我沒有對此進行過硬測試，但每增長一個核心/ GPU，你應該得到大約0-5％的額外性能。

對於第二種策略，我建議每一個GPU至少有2個線程 - 一般是每一個GPU一個核心。若是您使用第二個策略，當您擁有更多內核時，您將不會看到性能的顯着提高。

須要的CPU時鐘頻率（頻率）

當人們考慮快速CPU時，他們一般首先考慮時鐘頻率。4GHz優於3.5GHz，仍是它？這對於比較具備相同架構的處理器（例如「Ivy Bridge」）一般是正確的，但它在處理器之間不能很好地比較。此外，它並不老是衡量性能的最佳方法。

在深度學習的狀況下，CPU幾乎沒有計算：在這裏增長一些變量，在那裏評估一些布爾表達式，在GPU或程序內進行一些函數調用 - 全部這些都取決於CPU核心時鐘率。

雖然這種推理彷佛很合理，可是當我運行深度學習程序時，CPU有100％的使用率，那麼這裏的問題是什麼？我作了一些CPU核心速率的低頻實驗來找出答案。

MNIST和ImageNet上的CPU降頻：性能測量爲200個曆元MNIST或ImageNet上具備不一樣CPU核心時鐘速率的四分之一時間，其中最大時鐘速率被視爲每一個CPU的基線。做爲比較：從GTX 680升級到GTX Titan的性能約爲+ 15％; 從GTX Titan到GTX 980另外+ 20％的性能; GPU超頻可爲任何GPU帶來約+ 5％的性能

請注意，這些實驗是在過期的硬件上進行的，可是，對於現代CPU / GPU，這些結果應該仍然相同。

硬盤/ SSD

硬盤一般不是深度學習的瓶頸。可是，若是你作了愚蠢的事情會對你形成傷害：若是你在須要時從磁盤讀取你的數據（阻塞等待），那麼一個100 MB / s的硬盤驅動器將花費你大約185毫秒的時間爲32的ImageNet迷你批次 - 哎喲！可是，若是您在使用數據以前異步獲取數據（例如火炬視覺加載器），那麼您將在185毫秒內加載小批量，而ImageNet上大多數深度神經網絡的計算時間約爲200毫秒。所以，在當前仍處於計算狀態時加載下一個小批量，您將不會面臨任何性能損失。

可是，我建議使用SSD以提升溫馨度和工做效率：程序啓動和響應速度更快，使用大文件進行預處理要快得多。若是您購買NVMe SSD，與普通SSD相比，您將得到更加平滑的體驗。

所以，理想的設置是爲數據集和SSD提供大而慢的硬盤驅動器，以提升生產率和溫馨度。

電源裝置（PSU）

一般，您須要一個足以容納全部將來GPU的PSU。GPU隨着時間的推移一般會變得更加節能; 所以，雖然其餘組件須要更換，但PSU應該持續很長時間，所以良好的PSU是一項很好的投資。

您能夠經過將CPU和GPU的功耗與其餘組件的額外10％瓦特相加來計算所需的功率，並做爲功率峯值的緩衝器。例如，若是您有4個GPU，每一個250瓦TDP和一個150瓦TDP的CPU，那麼您將須要一個最小爲4×250 + 150 + 100 = 1250瓦的PSU。我一般會添加另外10％，以確保一切正常，在這種狀況下將致使總共1375瓦特。在這種狀況下，我會獲得一個1400瓦的PSU。

須要注意的一個重要部分是，即便PSU具備所需的功率，它也可能沒有足夠的PCIe 8針或6針鏈接器。確保PSU上有足夠的鏈接器以支持全部GPU！

另外一個重要的事情是購買具備高功率效率的PSU - 特別是若是你運行許多GPU並將運行它們更長的時間。

以全功率（1000-1500瓦）運行4 GPU系統來訓練卷積網兩週將達到300-500千瓦時，在德國 - 至關高的電力成本爲每千瓦時20美分 - 將達到60- 100歐元（66-111美圓）。若是這個價格是100％的效率，那麼用80％的電源進行這樣的網絡培訓會增長18-26歐元的成本 - 哎喲！對於單個GPU而言，這一點要少得多，但重點仍然存在 - 在高效電源上投入更多資金是有道理的。

全天候使用幾個GPU將大大增長您的碳足跡，它將使運輸（主要是飛機）和其餘有助於您的足跡的因素蒙上陰影。若是你想要負責，請考慮像紐約大學機器學習語言組（ML2）那樣實現碳中性 - 它很容易作到，價格便宜，應該是深度學習研究人員的標準。

CPU和GPU冷卻

冷卻很重要，它多是一個重要的瓶頸，它會比糟糕的硬件選擇下降性能。對於CPU來講，使用標準散熱器或一體化（AIO）水冷卻解決方案應該沒問題，可是對於GPU來講，須要特別注意。

風冷GPU

對於單個GPU，空氣冷倒是安全可靠的，或者若是您有多個GPU之間有空間（在3-4 GPU狀況下爲2個GPU）。可是，當您嘗試冷卻3-4個GPU時，可能會出現一個最大的錯誤，在這種狀況下您須要仔細考慮您的選項。

現代GPU在運行算法時會將速度 - 以及功耗 - 提升到最大值，但一旦GPU達到溫度障礙 - 一般爲80°C - GPU將下降速度，以便溫度閾值爲沒有違反。這樣能夠在保持GPU過熱的同時實現最佳性能。

然而，對於深度學習程序而言，典型的風扇速度預編程時間表設計得很糟糕，所以在開始深度學習程序後幾秒內就達到了這個溫度閾值。結果是性能降低（0-10％），這對於GPU相互加熱的多個GPU（10-25％）而言可能很重要。

因爲NVIDIA GPU首先是遊戲GPU，所以它們針對Windows進行了優化。您能夠在Windows中單擊幾回更改風扇計劃，但在Linux中不是這樣，而且由於大多數深度學習庫都是針對Linux編寫的，這是一個問題。

Linux下惟一的選擇是用於設置Xorg服務器（Ubuntu）的配置，您能夠在其中設置「coolbits」選項。這對於單個GPU很是有效，可是若是你有多個GPU，其中一些是無頭的，即它們沒有鏈接監視器，你必須模擬一個硬和黑客的監視器。我試了很長時間，而且使用實時啓動CD來恢復個人圖形設置使人沮喪 - 我沒法讓它在無頭GPU上正常運行。

若是在空氣冷卻下運行3-4個GPU，最重要的考慮因素是注意風扇設計。「鼓風機」風扇設計將空氣推出到機箱背面，以便將新鮮，涼爽的空氣推入GPU。非鼓風機風扇在GPU的虛擬性中吸入空氣並冷卻GPU。可是，若是你有多個相鄰的GPU，那麼周圍沒有冷空氣，帶有非鼓風機風扇的GPU會愈來愈多地加熱，直到它們本身下降溫度到達更低的溫度。不惜一切代價避免3-4個GPU設置中的非鼓風機風扇。

用於多個GPU的水冷GPU

另外一種更昂貴且更加工藝的選擇是使用水冷卻。若是您使用單個GPU或兩個GPU之間有空間（3-4 GPU板中有2個GPU），我不建議使用水冷。然而，水冷卻確保即便最強勁的GPU在4 GPU設置中也能保持涼爽，這在用空氣冷卻時是不可能的。水冷卻的另外一個優勢是它能夠更安靜地運行，若是您在其餘人工做的區域運行多個GPU，這是一個很大的優點。水冷卻每一個GPU花費大約100美圓，還有一些額外的前期成本（大約50美圓）。水冷還須要一些額外的工做來組裝你的計算機，但有不少詳細的指南，它應該只須要幾個小時的時間。維護不該該那麼複雜或費力。

冷卻的大案例？

我爲個人深度學習集羣購買了大型塔，由於他們爲GPU領域增長了粉絲，但我發現這在很大程度上是不相關的：大約2-5°C的下降，不值得投資和案件的龐大。最重要的部分是直接在GPU上的冷卻解決方案 - 不要爲其GPU冷卻功能選擇昂貴的外殼。在這裏便宜。這個案子應該適合你的GPU，但就是這樣！

結論冷卻

因此最後它很簡單：對於1 GPU，空氣冷倒是最好的。對於多個GPU，您應該得到鼓風式空氣冷卻並接受微小的性能損失（10-15％），或者您須要額外支付水冷卻，這也更難以正確設置而且您沒有性能損失。在某些狀況下，空氣和水冷卻都是合理的選擇。然而，我會建議空氣冷卻以簡化 - 若是您運行多個GPU，請得到鼓風式GPU。若是您想用水冷卻，請嘗試爲GPU找到一體化（AIO）水冷卻解決方案。

母板

您的主板應該有足夠的PCIe端口來支持您要運行的GPU數量（一般限制爲4個GPU，即便您有更多的PCIe插槽）; 請記住，大多數GPU的寬度爲兩個PCIe插槽，所以若是您打算使用多個GPU，請購買PCIe插槽之間有足夠空間的主板。確保您的主板不只具備PCIe插槽，並且實際上支持您要運行的GPU設置。若是您在newegg上搜索您選擇的主板並查看規格頁面上的PCIe部分，一般能夠在此找到相關信息。

電腦機箱

選擇外殼時，應確保它支持位於主板頂部的全長GPU。大多數狀況下都支持全長GPU，可是若是你購買一個小盒子，你應該懷疑。檢查其尺寸和規格; 你也能夠嘗試谷歌圖像搜索該模型，看看你是否找到了帶有GPU的圖片。

若是您使用自定義水冷卻，請確保您的外殼有足夠的空間放置散熱器。若是您爲GPU使用水冷卻，則尤爲如此。每一個GPU的散熱器都須要一些空間 - 確保您的設置實際上適合GPU。

顯示器

我首先想到關於顯示器也是愚蠢的，但它們會產生如此巨大的差別而且很是重要，我只須要寫下它們。

我在3臺27英寸顯示器上花的錢多是我用過的最好的錢。使用多臺顯示器時，生產力會大幅提高。若是我必須使用一臺顯示器，我會感到很是癱瘓。不要在這件事上作出改變。若是您沒法以有效的方式操做它，那麼快速深度學習系統有什麼用呢？

我深刻學習時的典型顯示器佈局：左：論文，谷歌搜索，gmail，stackoverflow; 中：代碼; right：輸出窗口，R，文件夾，系統監視器，GPU監視器，待辦事項列表和其餘小型應用程序。

關於構建PC的一些話

許多人懼怕建造電腦。硬件組件很昂貴，你不想作錯事。但它很是簡單，由於不屬於一塊兒的組件不能組合在一塊兒。主板手冊一般很是具體如何組裝全部內容，而且有大量的指南和分步視頻，若是您沒有經驗，它們將指導您完成整個過程。

構建計算機的好處在於，您知道有關構建計算機的全部知識，由於全部計算機都以相同的方式構建 - 所以構建計算機將成爲您的生活技能將可以一次又一次地申請。因此沒有理由退縮！

結論/ TL; DR

GPU：RTX 2070或RTX 2080 Ti。來自eBay的GTX 1070，GTX 1080，GTX 1070 Ti和GTX 1080 Ti也不錯！
CPU：每GPU 1-2個核心，具體取決於您預處理數據的方式。> 2GHz; CPU應該支持您要運行的GPU數量。PCIe通道並不重要。

RAM：
- 時鐘頻率可有可無 - 購買最便宜的RAM。
- 購買至少與最大GPU的RAM相匹配的CPU RAM。
- 僅在須要時購買更多RAM。
- 若是您常用大型數據集，則可使用更多RAM。

硬盤/ SSD：
- 用於數據的硬盤驅動器（> = 3TB）
- 使用SSD來提供溫馨性並預處理小型數據集。

PSU：
- 加上GPU + CPU的瓦數。而後將所需功率的總和乘以110％。
- 若是您使用多個GPU，請得到高效率。
- 確保PSU有足夠的PCIe鏈接器（6 + 8針）

散熱：
- CPU：得到標準CPU散熱器或一體化（AIO）水冷解決方案
- GPU：
- 使用空氣冷卻
- 若是您購買多個GPU
，請使用「鼓風式」風扇獲取GPU - 在您的Xorg中設置coolbits標誌配置控制風扇速度

主板：
- 爲您（將來）的GPU獲取儘量多的PCIe插槽（一個GPU須要兩個插槽;每一個系統最多4個GPU）

監視器：
- 額外的監視器可能會比其餘GPU更高效。

Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.

Over the years, I build a total of 7 different deep learning workstations and despite careful research and reasoning, I made my fair share of mistake in selecting hardware parts. In this guide, I want to share my experience that I gained over the years so that you do not make the same mistakes that I did before.

The blog post is ordered by mistake severity. This means the mistakes where people usually waste the most money come first.

GPU

This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge to ignore.

I talked at length about GPU choice in my GPU recommendations blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. There are three main mistakes that you can make when choosing a GPU: (1) bad cost/performance, (2) not enough memory, (3) poor cooling.

For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).

Be careful about the memory requirements when you pick your GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with the same memory compared to GTX cards. As such RTX cards have a memory advantage and picking RTX cards and learn how to use 16-bit models effectively will carry you a long way. In general, the requirements for memory are roughly the following:

Research that is hunting state-of-the-art scores: >=11 GB
Research that is hunting for interesting architectures: >=8 GB
Any other research: 8 GB
Kaggle: 4 – 8 GB
Startups: 8 GB (but check the specific application area for model sizes)
Companies: 8 GB for prototyping, >=11 GB for training

Another problem to watch out for, especially if you buy multiple RTX cards is cooling. If you want to stick GPUs into PCIe slots which are next to each other you should make sure that you get GPUs with a blower-style fan. Otherwise you might run into temperature issues and your GPUs will be slower (about 30%) and die faster.

Suspect line-up
Can you identify the hardware part which is at fault for bad performance? One of these GPUs? Or maybe it is the fault of the CPU after all?

RAM

The main mistakes with RAM is to buy RAM with a too high clock rate. The second mistake is to buy not enough RAM to have a smooth prototyping experience.

Needed RAM Clock Rate

RAM clock rates are marketing stints where RAM companies lure you into buying 「faster」 RAM which actually yields little to no performance gains. This is best explained by 「Does RAM speed REALLY matter?」 video on RAM von Linus Tech Tips.

Furthermore, it is important to know that RAM speed is pretty much irrelevant for fast CPU RAM->GPU RAM transfers. This is so because (1) if you used pinned memory, your mini-batches will be transferred to the GPU without involvement from the CPU, and (2) if you do not use pinned memory the performance gains of fast vs slow RAMs is about 0-3% — spend your money elsewhere!

RAM Size

RAM size does not affect deep learning performance. However, it might hinder you from executing your GPU code comfortably (without swapping to disk). You should have enough RAM to comfortable work with your GPU. This means you should have at least the amount of RAM that matches your biggest GPU. For example, if you have a Titan RTX with 24 GB of memory you should have at least 24 GB of RAM. However, if you have more GPUs you do not necessarily need more RAM.

The problem with this 「match largest GPU memory in RAM」 strategy is that you might still fall short of RAM if you are processing large datasets. The best strategy here is to match your GPU and if you feel that you do not have enough RAM just buy some more.

A different strategy is influenced by psychology: Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM. With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions, I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice. So with this strategy, you want to have more, cheap RAM now rather than later.

CPU

The main mistake that people make is that people pay too much attention to PCIe lanes of a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and motherboard combination supports the number of GPUs that you want to run. The second most common mistake is to get a CPU which is too powerful.

CPU and PCI-Express

People go crazy about PCIe lanes! However, the thing is that it has almost no effect on deep learning performance. If you have a single GPU, PCIe lanes are only needed to transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3 milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic numbers, and in practice you often see PCIe be twice as slow — but this is still lightning fast! PCIe lanes often have a latency in the nanosecond range and thus latency can be ignored.

Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-152 the following timing:

Forward and backward pass: 216 milliseconds (ms)
16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)

Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly 3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly 0% performance. So do not waste your money on PCIe lanes if you are using a single GPU!

When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you select a combination which supports the desired number of GPUs. If you buy a motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe lanes.

PCIe Lanes and Multi-GPU Parallelism

Are PCIe lanes important if you train networks on multiple GPUs with data parallelism? I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4 GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU — it does not matter!

Needed CPU Cores

To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU. Mostly it (1) initiates GPU function calls, (2) executes CPU functions.

By far the most useful application for your CPU is data preprocessing. There are two different common data processing strategies which have different CPU needs.

The first strategy is preprocessing while you train:

Loop:

Load mini-batch
Preprocess mini-batch
Train on mini-batch

The second strategy is preprocessing before any training:

Preprocess data
Loop:
1. Load preprocessed mini-batch
2. Train on mini-batch

For the first strategy, a good CPU with many cores can boost performance significantly. For the second strategy, you do not need a very good CPU. For the first strategy, I recommend a minimum of 4 threads per GPU — that is usually two cores per GPU. I have not done hard tests for this, but you should gain about 0-5% additional performance per additional core/GPU.

For the second strategy, I recommend a minimum of 2 threads per GPU — that is usually one core per GPU. You will not see significant gains in performance when you have more cores if you are using the second strategy.

Needed CPU Clock Rate (Frequency)

When people think about fast CPUs they usually first think about the clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. 「Ivy Bridge」, but it does not compare well between processors. Also, it is not always the best measure of performance.

In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.

While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 200 epochs MNIST or a quarter epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a baseline for each CPU. For comparison: Upgrading from a GTX 680 to a GTX Titan is about +15% performance; from GTX Titan to GTX 980 another +20% performance; GPU overclocking yields about +5% performance for any GPU

Note that these experiments are on a hardware that is dated, however, these results should still be the same for modern CPUs/GPUs.

Hard drive/SSD

The hard drive is not usually a bottleneck for deep learning. However, if you do stupid things it will hurt you: If you read your data from disk when they are needed (blocking wait) then a 100 MB/s hard drive will cost you about 185 milliseconds for an ImageNet mini-batch of size 32 — ouch! However, if you asynchronously fetch the data before it is used (for example torch vision loaders), then you will have loaded the mini-batch in 185 milliseconds while the compute time for most deep neural networks on ImageNet is about 200 milliseconds. Thus you will not face any performance penalty since you load the next mini-batch while the current is still computing.

However, I recommend an SSD for comfort and productivity: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster. If you buy an NVMe SSD you will have an even smoother experience when compared to a regular SSD.

Thus the ideal setup is to have a large and slow hard drive for datasets and an SSD for productivity and comfort.

Power supply unit (PSU)

Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.

You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 10% of watts for other components and as a buffer for power spikes. For example, if you have 4 GPUs with each 250 watts TDP and a CPU with 150 watts TDP, then you will need a PSU with a minimum of 4×250 + 150 + 100 = 1250 watts. I would usually add another 10% just to be sure everything works out, which in this case would result in a total of 1375 Watts. I would round up in this case an get a 1400 watts PSU.

One important part to be aware of is that even if a PSU has the required wattage, it might not have enough PCIe 8-pin or 6-pin connectors. Make sure you have enough connectors on the PSU to support all your GPUs!

Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.

Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a 100% efficiency, then training such a net with an 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.

Using a couple of GPUs around the clock will significantly increase your carbon footprint and it will overshadow transportation (mainly airplane) and other factors that contribute to your footprint. If you want to be responsible, please consider going carbon neutral like the NYU Machine Learning for Language Group (ML2) — it is easy to do, cheap, and should be standard for deep learning researchers.

CPU and GPU Cooling

Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink or all-in-one (AIO) water cooling solution for your CPU, but what for your GPU you will need to make special considerations.

Air Cooling GPUs

Air cooling is safe and solid for a single GPU or if you have multiple GPUs with space between them (2 GPUs in a 3-4 GPU case). However, one of the biggest mistakes can be made when you try to cool 3-4 GPUs and you need to think carefully about your options in this case.

Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables the best performance while keeping your GPU safe from overheating.

However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (0-10%) which can be significant for multiple GPUs (10-25%) where the GPU heat up each other.

Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.

The only option under Linux is to use to set a configuration for your Xorg server (Ubuntu) where you set the option 「coolbits」. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.

The most important point of consideration if you run 3-4 GPUs on air cooling is to pay attention to the fan design. The 「blower」 fan design pushes the air out to the back of the case so that fresh, cooler air is pushed into the GPU. Non-blower fans suck in air in the vincity of the GPU and cool the GPU. However, if you have multiple GPUs next to each other then there is no cool air around and GPUs with non-blower fans will heat up more and more until they throttle themselves down to reach cooler temperatures. Avoid non-blower fans in 3-4 GPU setups at all costs.

Water Cooling GPUs For Multiple GPUs

Another, more costly, and craftier option is to use water cooling. I do not recommend water cooling if you have a single GPU or if you have space between your two GPUs (2 GPUs in 3-4 GPU board). However, water cooling makes sure that even the beefiest GPU stay cool in a 4 GPU setup which is not possible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.

A Big Case for Cooling?

I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU — do not select an expensive case for its GPU cooling capability. Go cheap here. The case should fit your GPUs but thats it!

Conclusion Cooling

So in the end it is simple: For 1 GPU air cooling is best. For multiple GPUs, you should get blower-style air cooling and accept a tiny performance penalty (10-15%), or you pay extra for water cooling which is also more difficult to setup correctly and you have no performance penalty. Air and water cooling are all reasonable choices in certain situations. I would however recommend air cooling for simplicity in general — get a blower-style GPU if you run multiple GPUs. If you want to user water cooling try to find all-in-one (AIO) water cooling solutions for GPUs.

Motherboard

Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so buy a motherboard that has enough space between PCIe slots if you intend to use multiple GPUs. Make sure your motherboard not only has the PCIe slots, but actually supports the GPU setup that you want to run. You can usually find information in this if you search your motherboard of choice on newegg and look at PCIe section on the specification page.

Computer Case

When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.

If you use custom water cooling, make sure your case has enough space for the radiators. This is especially true if you use water cooling for your GPUs. The radiator of each GPU will need some space — make sure your setup actually fits into the GPU.

Monitors

I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.

The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor. Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?

Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

Some words on building a PC

Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.

The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!

Conclusion / TL;DR

GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz; CPU should support the number of GPUs that you want to run. PCIe lanes do not matter.

RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.

Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.

PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)

Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling solution
– GPU:
– Use air cooling
– Get GPUs with 「blower-style」 fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds

Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)

Monitors:
– An additional monitor might make you more productive than an additional GPU.