WebGPU學習（五）: 現代圖形API技術要點和WebGPU支持狀況調研

時間 2019-12-23

標籤 webgpu 學習現代圖形 api 技術要點支持狀況調研简体版

原文原文鏈接

你們好，本文整理了現代圖形API的技術要點，重點研究了並行和GPU Driven Render Pipeline相關的知識點，調查了WebGPU的相關支持狀況。html

另外，本文對實時光線追蹤也進行了簡要的分析。這是我很是感興趣的技術方向，也是圖形學的發展方向之一。本系列後續文章會圍繞這個方向進行更多的研究和實現相關的Demo。git

上一篇博文：
WebGPU學習（四）:Alpha To Coverage程序員

下一篇博文：
WebGPU學習（六）：學習「rotatingCube」示例github

本文內容

前置知識

現代圖形API包括哪些API？
包括DX十二、Vulkan、Metalweb
MVP是什麼？
是WebGPU的最小可用版本。
在1.0版本發佈前，先發布MVP版本。算法

技術要點

現代圖形API包含下面的技術要點：
chrome

下面依次進行分析：windows

並行

爲了提升多核CPU和GPU的利用率，現代圖形API充分支持了並行。promise

並行包含下面的技術要點：多線程

Multiple Queues

介紹

爲了提升GPU利用率，能夠將不一樣種類的任務對應的command buffer提交到3種隊列中：
graphics queue
copy queue
compute queue

不一樣隊列的任務可以在GPU中並行執行，從而實現Async Compute，提升利用率。

參考資料
Multi-engine synchronization

WebGPU支持狀況

根據Multiple Queues skeleton proposal，MVP只支持單隊列：

what single queue is exposed in the MVP

同步

介紹

有3個技術能夠實現CPU與GPU之間以及GPU內部的同步：

semaphores

我不瞭解它，它應該是用來同步隊列的

memory barrier

它用來避免GPU由於資源依賴關係形成等待，以及避免CPU和GPU之間發生Race Condition。

現代圖形API更加底層，之前GPU作的同步工做也交給了圖形程序員，更加靈活的同時也加劇了程序員的負擔。

參考資料
Breaking Down Barriers

fence

它用來在CPU和GPU之間同步。

這3個技術的關係能夠參考Vulkan Multi-Threading:

WebGPU支持狀況

semaphores

由於目前只支持單隊列，因此不須要它

memory barrier

你們都表示memory barrier不容易實現，因此barriers由WebGPU幫咱們作了（參考Memory barriers investigations、Memory Barriers portability、The case for passes -> Synchronization and validation），咱們只須要給WebGPU一些提示（如指定buffer的usage）

fence

支持以計數的方式實現fence。

參考資料
TimelineFences

多線程

介紹

能夠在線程中執行現代圖形API相關的渲染任務：

在線程中更新資源
如更新buffer
並行地編譯shader
並行地建立pipeline state
在線程中建立command buffer

參考Vulkan Multi-Threading:

WebGPU支持狀況

有兩種方法實現多線程：

經過OffscreenCanvas API，實現主線程與渲染線程分離

根據Rendering to OffscreenCanvas on non-yielding workers：
WebGPU支持OffScreenCanvas API，可是目前Chrome不能使用它。

建立worker，在worker中執行WebGPU相關的渲染任務

Create a proposal for multi-worker中提出了WebGPU如何在worker中執行渲染任務：

1.Asynchronous texture & buffer uploads
2.Asynchronous shader compilation
3.Asynchronous pipeline state creation
4.Using MTLParallelRenderEncoder
5.Each thread in a thread pool records into its own command buffer

根據Minutes for GPU Web meeting 2019-08-05 -> Multi threading:
其中的1,2,3正在實現中；
4, 5會最終實現（沒有說很久實現）；

根據我目前的調查：
1.shader編譯和建立pipeline state目前是同步的，還不是異步的。
2.在WebGPU 規範中，GPUTexture,GPUBuffer,GPUDevice,GPUComputePipeline,GPURenderPipeline,GPUShaderModule是Serializable的，意味着能夠傳給worker。
那是否是如今已經能夠在worker中使用它們，從而實現1,2,3呢？須要進一步驗證！

擴展閱讀

引擎對於多線程的封裝：
Parallelizing the Naughty Dog Engine using Fibers
Destiny’s Multi-threaded Renderer Architecture

內存管理

介紹

與memory barriers相似，現代圖形API須要程序員本身管理GPU的資源。

如Memory Management in Vulkan™ and DX12所示：

參考資料
Memory Management in Vulkan™ and DX12

WebGPU支持狀況

根據WebGPU as low level graphics API：

WebGPU compares closest to Metal (probably since Apple is the one that originally proposed it)--both don't require manual memory management while DX12 and Vulkan do

不須要手動管理memory，WebGPU會幫咱們管理

延遲渲染

defer shading

包括兩個步驟：
第一個pass遍歷gameObjects，建立gbuffer；
第二個pass遍歷lights，使用gbuffer計算光照。

相對於前向渲染，它的優勢是隻在屏幕上出現的像素中計算SHADING，從而使複雜度由O(M * N)將爲O(M) + O(N)

WebGPU支持狀況

由於支持MRT（多渲染目標），因此支持延遲渲染。

值得一提的是兩個優化的方向：

優化內存訪問

在Investigation: Managing on-chip memory中提到：
第一個pass建立gbuffer後，gbuffer的數據會從on-chip內存移到主內存中；
第二個pass讀取gbuffer時，將gbuffer的數據從主內存移到on-chip內存。

gbuffer的數據來回移動，形成了性能損失。
所以在Add render sub-passes中，建議增長render的子pass，在子pass中讀取gbuffer，從而實如今建立和讀取gbuffer期間，gbuffer的數據一直在on-chip內存中。

Minutes for GPU Web meeting 2019-10-28也討論了這一點。

WebGPU可能會在extension中支持這個優化。

針對tile-based defer shading，使用compute shader，在第二個pass中剔除光源，剩餘的光源參與光照計算

正如DirectX 11 Rendering in Battlefield 3所說：

Hybrid Graphics/Compute shading pipeline:
› Graphics pipeline rasterizes gbuffers for opaque surfaces
› Compute pipeline uses gbuffers, culls lights, computes lighting &
combines with shading
› Graphics pipeline renders transparent surfaces on top

參考資料

延遲着色法
 Optimizing tile-based light culling
DirectX 11 Rendering in Battlefield 3

textureless defer render

介紹

在defer shading的第一個pass中，咱們將gameObject的幾何數據（如Position, Normal等）和材質貼圖數據（如從diffuse map中得到的diffuse）存到gbuffer中。

有了bindless texture的支持，咱們能夠對此進行優化：

gbuffer再也不存儲材質貼圖的數據，而是存儲uv和material id。在第二個pass中，shader根據它們去獲取對應材質貼圖texture的數據

這樣作的優勢是：
1.減小了gbuffer的大小
2.只在可見的像素中，採樣texture的數據，減小了採樣次數

這樣作也存在一些問題，不過都是能夠解決的：
具體能夠參考什麼是deferred material shading？是否會在將來流行開來？：

1.多材質如何作deferred shading？總不能每一個像素作動態分支，一個一個判斷吧。有人提出了作tile把像素區塊合併，而後一次性dispatch，性能會高不少。至於vgpr，sgpr，lds佔用率之類須要通盤考慮，偏向一邊都會影響性能。
2.結果SSAO，SSR之類的post effect仍是須要用到normal，roughness之類的g-bufffer信息。應用上仍是須要權衡利弊。

以及參考Deferred Texturing：

What about mip levels, or derivatives?

gbuffer不存儲幾何數據，而是存儲primitive ID。在第二個pass中，接收vertex data，在每一個可見像素上執行vertex shader

具體能夠參考Deferred Texturing -> Defer All The Things：

It stores only primitive IDs in its G-buffer; then in a later pass, it fetches vertex data, re-runs the vertex shader per pixel (!), finds the barycentric coordinates of each fragment within its triangle, interpolates the vertex attributes, then finally samples all the textures and does the shading work.

WebGPU支持狀況

根據本文後面bindless texture的分析，目前WebGPU不支持bindless texture
或許可用texture 2d array代替bindless texture，從而使用WebGPU實現textureless defer render

參考資料

Deferred Texturing
什麼是deferred material shading？是否會在將來流行開來？
BINDLESS TEXTURING FOR DEFERRED RENDERING AND DECALS
Modern textureless deferred rendering techniques

GPU Driven Render Pipeline

介紹

這個技術應該是在[Siggraph15] GPU-Driven Rendering Pipelines中提出來的。它的思想是把渲染任務從CPU端移到GPU端，減小CPU與GPU的同步和數據傳輸，實現1個draw call就渲染整個場景，從而提升GPU的利用率。

優勢

GPU更細粒度的Visibility
不須要在CPU和GPU之間來回傳遞數據

應用場景

繪製大量的靜態物體
繪製人羣
繪製模塊化半自動生成內容

主要步驟

離線處理
1.分解gameObject的mesh爲多個cluster

參考GPU Driven Pipeline — 工具鏈與進階渲染

CPU
1.對gameObject進行粗粒度的frustum cull

2.使用persistent map buffer，準備GPU的數據

能夠按照數據的類型，建立多個mapped buffer（如一個buffer存儲人羣的數據，另外一個buffer存儲全部靜態物體的數據）

3.使用virtual texture處理texture

全部的texture數據一次性所有準備好，只綁定一次texture

4.用indirect draw發起multi draw call，提交mapped buffer

WebGPU目前不支持multi draw，所以須要發起多個draw call，每一個draw call使用indirect draw提交對應的mapped buffer

GPU
1.對gameObject進行frustum cull和occlusion cull

2.對gameObject的cluster進行frustum cull和occlusion cull

3.修改index buffer，生成新的indices數據

根據Proposal: Run all index buffers through a compute shader validator：

I'm inclined to propose that WebGPU MVP doesn't support index buffers changed on the GPU, since this is quite a bit of headache, but eventually we can do that.
...
In an actual 1.0 release we'll absolutely need to support GPU-generated indices, there is no question here.

WebGPU MVP不會支持在GPU端修改index buffer，1.0版本會支持。

4.multi draw call

根據ExecuteIndirect investigation：

In order to issue draw calls on the CPU, there must be a synchronization point where the CPU waits for the GPU update to complete. This is particularly devastating for WebGPU, where if the CPU has to wait for the GPU, you miss your implicit present and now you're a frame late. Being able to issue these commands on the GPU directly means the rendering and update steps can be in sync.

在GPU端發起draw call能夠去掉「CPU和GPU同步」的開銷。

However, making it an extension seems valuable.

可能會在WebGPU extension中支持該特性。

總結

GPU Driven Render Pipeline能夠一次性取得全部mesh data，經過virtual texture能夠取得全部texture，意味着整個場景只須要一次drawcall

參考資料

[Siggraph15] GPU-Driven Rendering Pipelines
[GDC16] Optimizing the Graphics Pipeline with Compute
知乎大神MaxwellGeng關於GPU Driven Rendering Pipelines的相關文章1
知乎大神MaxwellGeng關於GPU Driven Rendering Pipelines的相關文章2

如今咱們介紹下GPU Driven Render Pipeline相關的概念和技術要點：

Approaching zero driver overhead

這個概念（簡稱爲AZDO）出自approaching-zero-driver-overhead，它分析了OpenGL如何使用GPU實現CPU端0負載，具體包括下面幾個方面：

persistent map buffer

介紹

該技術是爲了在「CPU把數據傳輸到GPU「時減少數據傳輸的開銷。
它包括下面的步驟：
1.映射GPU的buffer到CPU
2.在CPU端修改這個mapped buffer的數據（由於mapped buffer在shared memory中，CPU和GPU均可以訪問它，因此要使用fence同步來確保GPU沒操做這個buffer）
3.提交修改buffer數據的command
4.GPU執行該command，更新buffer數據

經過上面的步驟，再也不須要「從CPU傳輸新buffer的數據到GPU」了，減少了開銷

參考資料:
Persistent mapped buffers
Persistent Mapped Buffers in OpenGL

WebGPU支持狀況

有兩種方式實現「CPU把數據傳輸到GPU「：

1.調用GPUBuffer->setSubData方法
該方法性能差，須要從CPU傳輸數據到GPU（WebGPU規範並無定義該方法，可是Chrome的WebGPU實現目前有該方法）

2.使用persistent map buffer技術
對於該方法，有如下的要點要說明：
1)不須要fence
WebGPU提供了GPUBuffer->unmap方法，該方法將buffer設置爲unmapped state，使該buffer可以被GPU使用。

WebGPU應該在該方法中幫咱們作了fence同步的工做。

2)如何建立mapped buffer？
有兩種方式建立：
a)調用GPUDevice->createBufferMapped方法，建立mapped buffer
Make it easier to upload data into buffers correctly指出：
createBufferMapped建立的buffer會使內存增長，所以須要destory它。

b)調用GPUBuffer->mapReadAsync,mapWriteAsync，將buffer設置爲mapped buffer
Make it easier to upload data into buffers correctly指出，使用mapWriteAsync會形成一些問題：

in WebGPU, have an implicit present after rAF() returns
...
Using mapWriteAsync() requires you to wait on a promise, so if you do the naive thing and just wait on the promise inside rAF(), you’ll miss your present
...
Could we replace mapWriteAsync returning a Promise with it taking a callback that is guaranteed to execute before any submitted queue bundles are executed?

其中「rAF」指「requestAnimationFrame」

咱們根據示例代碼來講明下這個問題：

function frame(time){
    ...

    const vertexBuffer = device.createBuffer({
        ...
        usage: GPUBufferUsage.MAP_WRITE | GPUBufferUsage.COPY_SRC,
    });

    vertexBuffer.mapWriteAsync().then((vertexBufferData) => {
        設置vertexBufferData

        vertexBuffer.unmap();

        提交修改buffer數據的command到隊列中
        
        ...
    });

    window.requestAnimationFrame(frame);
}

由於mapWriteAsync是異步操做，而frame函數是同步操做，因此當執行到unmap時，可能已經執行了好幾回frame（過了好幾幀）。
在這幾幀中，可能提交了其它的command到隊列，WebGPU可能會在這幾幀之間提交了隊列中的command到GPU，GPU可能已經執行了其中的一些command。

執行unmap時，咱們預期GPU尚未執行其它的command，但實際上可能已經執行了。這樣會形成不一樣步的錯誤。

爲了解決該問題，或許可使用await關鍵字，將mapWriteAsync變成同步操做。
示例代碼以下：

async function frame(time){
    ...

    var vertexBufferData = await vertexBuffer.mapWriteAsync();
    
    設置vertexBufferData

    vertexBuffer.unmap();

    提交修改buffer數據的command到隊列

    ...
}

這裏給出使用persistent map buffer技術的參考代碼（來自Buffer operations）
（參考代碼經過「調用GPUDevice->createBufferMapped方法」來建立mapped buffer）：

//Updating data to an existing buffer(destBuffer)
function bufferSubData(device, destBuffer, destOffset, srcArrayBuffer) {
    const byteCount = srcArrayBuffer.byteLength;
    const [srcBuffer, arrayBuffer] = device.createBufferMapped({
        size: byteCount,
        usage: GPUBufferUsage.COPY_SRC
    });
    new Uint8Array(arrayBuffer).set(new Uint8Array(srcArrayBuffer)); // memcpy
    srcBuffer.unmap();

    const encoder = device.createCommandEncoder();
    encoder.copyBufferToBuffer(srcBuffer, 0, destBuffer, destOffset, byteCount);
    const commandBuffer = encoder.finish();
    const queue = device.getQueue();
    queue.submit([commandBuffer]);

    srcBuffer.destroy();
}

參考資料
Make it easier to upload data into buffers correctly
Buffer operations
Minutes for GPU Web meeting 2019-10-21

indirect draw

介紹

以WebGPU爲例，draw方法須要指定頂點個數、實例個數等數據，每次只能繪製一個gameObject（能夠批量繪製多個實例instance）：

void draw(unsigned long vertexCount, unsigned long instanceCount, unsigned long firstVertex, unsigned long firstInstance);

而indirect draw可使用buffer進行批量繪製多個gameObject（也能夠批量繪製多個實例），這個buffer包含了每一個gameObject的頂點個數等數據：

void drawIndirect(GPUBuffer indirectBuffer, GPUBufferSize indirectOffset);

優勢
1.能夠在compute shader修改buffer的數據，從而實現gpu cull
2.減小了繪製gameObject的次數
3.減小了CPU和GPU之間的同步開銷

WebGPU支持狀況
支持Indirect draw/dispatch，相關討論參考 Indirect draw/dispatch commands investigation

參考資料
What are the advantage of using indirect rendering in OpenGL?
vulkan Indirect drawing
INDIRECT RENDERING : 「A WAY TO A MILLION DRAW CALLS」
Surviving without gl_DrawID

bindless texture and virtual texture

bindless texture和virtual texture能夠結合使用，實現「只綁定一次texture」。

具體參見本文後面的說明：
其它->Bindless Texture
其它->Virtual Texture

GPU Cull

在GPU端實現剔除。

實現思路

1.建立persistent map buffer，indirect draw該buffer
2.在compute shader進行cull操做，將剩餘的gameObject對應的draw call數據（如頂點個數）寫到該buffer中

GPU Lod

在GPU端實現lod。

這個我沒有仔細研究，讀者能夠參考相關資料：
谷歌搜索結果
 GPU based dynamic geometry LOD

Hybrid Render For Real-time Ray Tracing

介紹

之前Ray Tracing只在離線渲染中使用（如製做CG電影，通常會使用path tracing來加快收斂速度），如今隨着DXR（DirectX Raytracing）的發佈，新增了Ray Tracing管線，提出了專爲Ray Tracing設計的shader，再配合上新的降噪方法（如使用SVGF降噪算法或者NVDIA提供的基於AI的降噪SDK），可以實現實時的Ray Tracing！

混合渲染

徹底用Ray Tracing來渲染太耗性能，因此目前業界使用混合方案來實現實時Ray Tracing：
若是支持DXR，可使用「光柵化管線 + Ray Tracing管線」來實現；
若是不支持DXR，可使用「光柵化管線 + Compute管線（即便用compute shader）」來實現。

咱們能夠把渲染分解爲：

WebGPU支持狀況

根據Is there some plan for Ray Tracing?：

There are not plan for ray-tracing for the forseeable future because WebGPU is meant to be extremely portable and ray-tracing isn't mature yet and is implemented only by a single hardware vendor for now.

WebGPU目前不支持Ray Tracing管線，所以只能使用「光柵化管線 + Compute管線（即便用compute shader）」來實現混合渲染。

如何使用WebGPU學習和實現Ray Tracing

能夠按照下面的步驟：
1.普遍收集相關資料，對整個技術體系有初步的瞭解（讀者能夠看下面的「學習資料」）
2.參考Ray Tracing in One Weekend、Ray Tracing: The Next Week、對應的詳解，使用fragment shader，從0實現Ray Tracing。
目前只須要渲染球體或者立方體就行了，不用渲染模型。
3.使用compute shader實現Ray Tracing
4.使用混合渲染（如使用光柵化實現GBuffer和直接光照，使用Ray Tracing實現陰影和反射）
5.實現降噪算法
直接實現SVGF頗有難度，能夠先實現其中的子環節（如temporal anti aliasing、tone map、Edge-Avoiding À-Trous等），而後再把它們組裝起來，實現SVGF
6.渲染模型
須要實現BVH
7.進一步研究和實現，探索path tracing、優化採樣、優化光線排序和連貫性、支持更多的材質等方向

學習資料

一篇光線追蹤的入門
 光線追蹤與實時渲染的將來
 Introduction to NVIDIA RTX and DirectX Ray Tracing
如何評價微軟的 DXR（DirectX Raytracing）？
Daily Pathtracer！安利下不錯的Pathtracer學習資料
 Ray Tracing in One Weekend
Ray Tracing: The Next Week
Ray Tracing in One Weekend和Ray Tracing: The Next Week的詳解
 基於OpenGL的GPU光線追蹤
 Webgl中採用PBR的實時光線追蹤
 Spatiotemporal Variance-Guided Filter, 向實時光線追蹤邁進
系統學習Ray Tracing的資料：Ray Tracing Gems

其它

Bindless Texture

在WebGPU中，什麼是bind texture？

Investigation: Bindless resources提到：

Currently, in WebGPU, if a draw/dispatch call wants to use a resource, that resource must be part of a pre-baked "bind group" and then associated with the draw call inside the current render/compute pass. This means that all the resources that the draw/dispatch call could possibly access are explicitly listed by the programmer at the draw/dispatch site.

也就是說，咱們須要定義每一個texture在shader的binding，而後在每次提交command時，綁定該texture。

咱們來看具體的textureCube sample：
綁定的texture須要在shader中指定binding：

//在fragment shader中指定binding爲2
  const fragmentShaderGLSL = `#version 450
  ...
  layout(set = 0, binding = 2) uniform texture2D myTexture;

在BindGroup中，設置binding爲2的相關數據：

const bindGroupLayout = device.createBindGroupLayout({
    bindings: [
    ...
    {
      // Texture view
      binding: 2,
      visibility: GPUShaderStage.FRAGMENT,
      type: "sampled-texture"
    }]
  });
  
  ...
  
  const uniformBindGroup = device.createBindGroup({
    layout: bindGroupLayout,
    bindings: [
    ...
    {
      binding: 2,
      resource: cubeTexture.createView(),
    }],
  });

把BindGroup設置到Pipeline中：

const pipelineLayout = device.createPipelineLayout({ bindGroupLayouts: [bindGroupLayout] });
  const pipeline = device.createRenderPipeline({
    layout: pipelineLayout,
    ...
  });

提交command時，設置該bind group和pipeline：

const passEncoder = commandEncoder.beginRenderPass(renderPassDescriptor);
    passEncoder.setPipeline(pipeline);
    passEncoder.setBindGroup(0, uniformBindGroup);
    ...
    passEncoder.endPass();

在WebGPU中，什麼是bindless texture？

Investigation: Bindless resources提到：

"Bindless" is a model where the programmer doesn't explicitly list all of the available resources at the draw/dispatch site. Instead, a large swath of resources are made available to the GPU ahead of time (e.g. during application launch) and then shaders can access any/all of them at runtime.

能夠將全部的texture設置到一個buffer中，將其傳給GPU，而後shader能夠在運行時操做任意的texture。

這樣的好處是咱們不須要在每次提交command時綁定特定的texture，只須要綁定一次。

若是不支持bindless texture，可使用texture 2d array替代

參考approaching-zero-driver-overhead->36頁，咱們可使用texture 2d array代替bindless texture，只須要綁定一次texture 2d array，不須要在每次提交command時綁定特定的texture。

texture 2d array的優勢參考：
爲何要強調Texture2DArray在地形上的應用？

缺點是texture array中的每一個texture的大小、格式要相同，而bindless texture沒有該要求。

爲了解決該缺點，咱們能夠按照大小和格式，把texture劃分爲多組，對應多個texture 2d array。

WebGPU支持狀況

從Minutes for GPU Web meeting 2019-08-12中得知，目前還未決定什麼時候實現bindless texture，可能實現爲extension，可能在1.0版本後實現。

因此目前可考慮用texture 2d array做爲替代品

參考資料

OPENGL AZDO : BINDLESS TEXTURES : BATCHING PROBLEM SOLVED

Virtual Texture

思想

把全部要用到的texture拼到一塊兒，組成physic texture；
經過索引，只把當前要用到的texture加載到內存中。

優勢

1.只綁定一次texture
2.組成physic texture的子紋理的格式和mipmap等能夠不同；
3.減少了內存佔用（內存中只有當前使用的texture）

缺點

由於要不斷地在內存中加載/卸載texture，因此增長了IO開銷

應用場景

地形紋理

WebGPU支持狀況

有人提出了Investigation: Sparse Resources，但願WebGPU增長操做堆heap的API。不過目前沒有迴應。

我目前不清楚WebGPU是否能實現virtual texture

參考資料

approaching-zero-driver-overhead -> Sparse Texture
知乎->Virtual Texture Tools & Practices
關於對virtual texture的淺顯認識

Tessellation

根據Investigation: Tessellation：

Let's wait until after the release of a MVP

WebGPU應該會在MVP後考慮加入Tessellation shader

Mesh Shader

介紹

NVDIA在Turing架構中推出了新的管線，用來替代光柵化管線。新管線只保留了Pixel Shader（即fragment shader），新增了Task Shader和Mesh Shader，以下圖所示：

新管線更適合於GPU Driven Render Pipeline的理念，包括如下的特性：
相似於Compute管線（compute shader），具備強大的計算能力；
把Mesh分解爲Meshlet（相似於GPU Driven Render Pipeline中提到的Cluster），更好地支持cluster cull。