新AI芯片介紹（3）:tenstorrent - 知乎

今天咱們來看tenstorrent的芯片，這個是一個比較新的startup，沒有什麼正兒八經的paper，可是咱們能夠從各個地方蒐集來的信息看這個芯片的細節前端

https://www.tenstorrent.com/wp-content/uploads/2020/04/Tenstorrent-Scales-AI-Performance.pdf www.tenstorrent.com

https://www.youtube.com/watch?v=ME-6uxSoVm0 www.youtube.com

Tenstorrent主要的產品是這些架構

Tenstorrentt跟其餘架構最大的差異在於MAC核的數量。Tenstorrent有整整120個核，這些核都比咱們以前接觸的TPU、含光或者Groq要來的小的多，大概架構長這樣：flex

這個圖片裏面紫色的CPU不是咱們電腦上面的CPU，而是一個很小的RISC的核。小核有一個很大的優點，就是conditional computation。這個芯片相對別的玩家來講TDP要低。優化

At a peak rate of 368 TOPS, the chip runs on just 65W

一個小核裏面總數大概是一千個int8 的MAC（好比32*32），不過他們也支持fp16跟bf16this

Tenstorrent withheld further details, but to achieve the 3-TOPS rating, the tensor engine likely contains about a thousand 8-bit MAC units

他們也支持相似rowwise的quantization，一組數字共享scaleurl

To save memory space, the design implements a block FP format in which groups of 16 values share the same 8-bit exponent. Tensix defines block FP formats with 8-, 4-, or 2- bit mantissas, trading off throughput for precision. Once the core loads values from memory, it expands them to FP16 before any computation.

有這些能工做的小核了以後，咱們就能夠把這些小核串起來。spa

四個Synopsys ARC CPU來負責組織120個小核的工做，總共16G DRAM跟16x PCIE。還有注意這些內存是LPDDR，確定不能跟TPU或者Volta的HBM比較。這些小核之間的通訊也有模塊的3d

Although the compute unit can operate only on the local memory, each core can easily access data in other cores using the network-on-a-chip (NoC) interconnect.

這裏的NoC大體業務邏輯以下圖orm

因此這個NoC相似TPU的ICI，可是我看主要側重於芯片內部的溝通，並且應該小的多，因此估計沒有TPU之類的特別複雜的routing的邏輯，也作不了不一樣chip以前的溝通。NoC還負責壓縮，看描述應該是varint相似的壓縮方式視頻

The packet engine implements hardware data compression. It compresses data before transferring it across the NoC. Depending on the number of zeroes in the data, this compression typically shrinks the data by 50–75%, but the percentage can be even greater on sparse data.

鏈接方式是2D torus，以前在TPU那邊有介紹，只是這裏的2D Torus是小核之間，而以前TPU是卡跟卡之間

並行方面，Tenstorrent須要的邏輯相對來講比其餘芯片來講複雜的多。這種多核場景裏面並行處理的上線確定比幾個大核來的多，可是確定須要compiler更復雜的配合。視頻裏給了一個例子

軟件層來講onnx或者pytorch做爲前端都行

能夠看到我以前講的須要compiler複雜的配合。這個如今ppt作的很漂亮，我估計compiler的代碼會很是複雜，須要各類不一樣的計劃執行方式來優化不一樣的模型。

Tenstorrent跟Grop核Titan相對來講的比較：

IPS/Watt 大概是這樣的，能夠看一下含光仍是最高，可是Tenstorrent相對來講仍是很好的。並且含光爲了convolution我記得是有特殊的優化的，不單純是systolic array。

對於啥時候能賣，文章裏面說大概2021年。

我的感想

這個多核的架構在Tenstorrent這邊說的很好，可是其實以前TPU的paper裏面也討論過了大核跟小核的架構優劣，不熟悉的能夠看一下我以前的總結

陳宇飛：新AI芯片介紹（2）: TPUv2/v3 zhuanlan.zhihu.com

TPUv3的文章裏面有提到，

Sixteen 64x64 MXUs would have a little higher utilization (38%–52%) but would need more area. The reason is the MXU area is determined either by the logic for the multipliers or by the wires on its perimeter for the inputs, outputs, and control. In our technology, for 128x128 and larger the MXU's area is limited by the multipliers but area for 64x64 and smaller MXUs is limited by the I/O and control wires.

因此相對來講並非core越小越好，要具體問題具體看。對於我來講，過小的core其實優點不必定明顯，特別是大公司推薦系統或者圖像處理，應爲實在batch size不夠湊能夠不一樣的訪問coalesce起來。固然在汽車上面這個假設不必定成立，可是雲端的話仍是成立的。

小core還有一個使人擔憂的地方是，鑑於並行優化很是的重要，小core是否是可以適用於多種模型，這個是有待市場驗證的。

不過優點仍是很明顯的，conditional做爲一個功能自己絕對是一個特別好的創新。其實conditional 在如今的模型裏面不是特別的常見，我以爲很大一個緣由是硬件自己支持的很差，因此也省不下來什麼，最多見的相似conditional的邏輯仍是Mixture of expert，可是這個granularity相對來講就大的多。之後這類支持conditional的硬件出來的越多，越能幫助作模型的人創新。仍是期待Tenstorrent能夠賣了以後給你們帶來什麼樣的驚喜吧。