Google TPU 揭密——看TPU的架構框圖，矩陣加乘、Pool等處理模塊，CISC指令集，必然須要編譯器

時間 2019-11-26

標籤 google tpu 架構框圖矩陣 pool 處理模塊 cisc 指令必然須要編譯器欄目 Google 简体版

原文原文鏈接

Google TPU 揭密

轉自：https://mp.weixin.qq.com/s/Kf_L4u7JRxJ8kF3Pi8M5iw

Google TPU（Tensor Processing Unit）問世以後，你們一直在猜想它的架構和性能。Google的論文「In-Datacenter Performance Analysis of a Tensor Processing Unit」讓咱們有機會一探究竟。編程

首先讓咱們看看摘要：緩存

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a Tensor Processing Unit (TPU)—deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. 安全

這段摘要的信息量很是大，我把重點高亮出來了。首先，這個TPU芯片是面向datacenter inference應用。它的核心是由65,536個8-bit MAC組成的矩陣乘法單元（matrix multiply unit），峯值能夠達到92 TeraOps/second (TOPS) 。有一個很大的片上存儲器，一共28 MiB。它能夠支持MLP，CNN和LSTM這些常見的NN網絡，而且支持TensorFLow框架。摘要裏面還能夠看出，傳統CPU和GPU使用的技術（caches, out-of-order execution, multithreading, multiprocessing, prefetching）它都沒用，緣由是它面向的應用都是deterministic execution model，這也是它能夠實現高效的緣由。它的平均性能（TOPS）能夠達到CPU和GPU的15到30倍，能耗效率（TOPS/W）能到30到80倍。若是使用GPU的DDR5 memory，這兩個數值能夠達到大約GPU的70倍和CPU的200倍。網絡

到此咱們能夠看出，Google的TPU採用了一個專用處理器或者硬件加速器的架構，比以前的想象的在GPU架構上改進的方法要激進的多，固然這樣作實現的效率也高得多。架構

總體架構app

上圖是TPU的架構框圖，按文章的說法，這個芯片的目標是：框架

「The goal was to run whole inference models in the TPU to reduce interactions with the host CPU and to be flexible enough to match the NN needs of 2015 and beyond, instead of just what was required for 2013 NNs.」dom

具體的架構信息以下：佈局

「TPU指令經過PCIe Gen3 x16總線從主機發送到指令緩衝區（instruction buffer）。內部模塊通常是經過256 bit寬的路徑鏈接在一塊兒的。從右上角開始，矩陣乘法單元（Matrix Multiply Unit）是TPU的核心。它包含256x256 MAC，能夠對有符號或無符號整數執行8位乘法和加法。 16位運算結果由下面的4MiB 32bit的累加器處理。 4MiB表明4096個, 256-element, 32-bit累加器。矩陣單元每一個時鐘週期產生一個256-element的部分和（partial sum）。咱們選擇了4096，首先是注意到每一個字節的操做須要達到峯值性能爲大約1350，因此咱們將其舍入到2048；而後複製了它，以便編譯器能夠在峯值運行時使用雙緩衝方式。」性能

「當使用8 bit權重（weight）和16 bit激活（activation）（或反之亦然）的混合時，矩陣乘法單元以半速計算，而且當它們都是16bit時以四分之一速度計算。它每一個時鐘週期讀取和寫入256個值，能夠執行矩陣乘法或卷積。矩陣單元保存一個64KiB的權重塊（ tile of weights），並使用了雙緩衝（這樣能夠在處理時同時讀入權重）。該單元設計用於密集矩陣，而沒有考慮稀疏的架構支持（部署時間的緣由）。稀疏性將在將來的設計中佔有優先地位。」

「矩陣乘法單元的權重經過片上的權重FIFO（Weight FIFO）進行緩存，該FIFO從片外8 GiB DRAM讀取。因爲是用於推論，權重爲只讀。8 GiB能夠同時支持許多模型。權重FIFO的深度是4個tile。中間結果保存在24 MiB片上統一緩衝區（Unified Buffer）中，能夠做爲矩陣乘法單元的輸入。可編程DMA實現和CPU Host內存以及統一緩衝區傳輸數據。」

下圖給出了TPU芯片的佈局，能夠粗略看出各部分面積的比例。

指令集

按做者的說法，TPU的指令集是CISC類型（TPU instructions follow the CISC tradition），平均每一個指令的時鐘週期CPI（clock cycles per instruction）是10到20。指令總共應該有10幾個，重要的指令以下：

1. Read_Host_Memory reads data from the CPU host memory into the Unified Buffer (UB).

2. Read_Weights reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit.

3. MatrixMultiply/Convolve causes the Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the Accumulators. A matrix operation takes a variable-sized B*256 input, multiplies it by a 256x256 constant weight input, and produces a B*256 output, taking B pipelined cycles to complete.

4. Activate performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on. Its inputs are the Accumulators, and its output is the Unified Buffer. It can also perform the pooling operations needed for convolutions using the dedicated hardware on the die, as it is connected to nonlinear function logic.

5. Write_Host_Memory writes data from the Unified Buffer into the CPU host memory.

其它指令還包括：

The other instructions are alternate host memory read/write, set configuration, two versions of synchronization, interrupt host, debug-tag, nop, and halt. The CISC MatrixMultiply instruction is 12 bytes, of which 3 are Unified Buffer address; 2 are accumulator address; 4 are length (sometimes 2 dimensions for convolutions); and the rest are opcode and flags.

這是一個很是專用的指令集，主要就是運算和訪存指令。從另外一個角度也說明了TPU是一個很是專用的處理器。

微結構

首先是最重要的一句話「The philosophy of the TPU microarchitecture is to keep the matrix unit busy」，相信這也是全部設計NN加速器的同窗的philosophy。

「TPU爲CISC指令使用4級流水線，其中每條指令在單獨的階段中執行。設計是是經過將其它指令的執行與MatrixMultiply指令重疊，來隱藏其餘指令的執行時間。爲此，Read_Weights指令遵循解耦訪問/執行（decoupled-access/execute）原理。它能夠在發送地址以後，從權重存儲器中取出權重以前就完成。若是輸入激活或權重數據還沒有就緒，矩陣單元就會暫停（stall）。」

「由於TPU的CISC指令能夠佔用站數千個時鐘週期，這與傳統的RISC流水線不一樣（每一個階段一個時鐘週期）,因此TPU沒有一個很清晰的流水線overlap diagrams。若是一個NN網絡的層的激活操做必須在下一層的矩陣乘法開始以前完成，就會發生有趣的狀況；咱們看到一個「延遲槽」（delay slot），其中矩陣單元會在從Unified Buffer中安全讀取數據以前等待顯式同步信號（ explicit synchronization）。」

因爲讀取大的SRAM的能耗比算術運算高的多，因此矩陣單元經過減小統一緩衝區的讀寫來下降能耗，所謂脈動運行（systolic execution）。下圖顯示了數據從左側流入，權重從頂部加載。給定的256-element乘累加運算經過矩陣做爲對角波前（diagonal wavefront）移動。權重被預加載，而且與新塊的第一數據一塊兒在前進波上生效。控制和數據是流水線方式，因此好像給出256個輸入是一次讀取的，而且它們能夠當即更新256個累加器中的每個的一個位置。從正確性的角度來看，矩陣單元的脈動性對軟件的是透明的。可是對於性能，它的延遲是軟件須要考慮的因素。

軟件

「TPU軟件棧必須與爲CPU和GPU開發的軟件棧兼容，以便應用程序能夠快速移植到TPU。在TPU上運行的應用程序的一部分一般寫在TensorFlow中，並被編譯成能夠在GPU或TPU上運行的API。像GPU同樣，TPU棧分爲用戶空間驅動程序和內核驅動程序。內核驅動程序是輕量級的，只處理內存管理和中斷。它是爲長時間的穩定性而設計的。用戶空間驅動程序會常常變化。它設置和控制TPU執行，將數據從新格式化爲TPU命令，將API調用轉換爲TPU指令，並將其轉換爲應用程序二進制文件。用戶空間驅動程序在首次evaluate的時候編譯模型，緩存程序image並將權重image寫入TPU的權重存儲器；第二次和之後的evaluate就是全速運行。 TPU從輸入到輸出徹底運行大多數的模型，最大化TPU計算時間與I / O時間的比率。計算一般每次進行一層，重疊執行，容許矩陣乘法單元隱藏大多數非關鍵路徑操做。」

文章後面很大篇幅是介紹TPU和CPU，GPU性能對比的，你們能夠仔細看看，有機會再討論吧。說實話，Google在2015年就能部署這樣的ASIC張量處理器，不得不佩服啊。

T.S.