vulkan asynchronous compute

時間 2019-11-09

標籤 vulkan asynchronous compute 简体版

原文原文鏈接

https://www.youtube.com/watch?v=XOGIDMJThtowindows

https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/9-Asynchonous-compute.pdf併發

https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization異步

https://gpuopen.com/concurrent-execution-asynchronous-queues/async

經過queue的並行增長GPU的並行ide

併發性 concurrency post

Radeon™ Fury X GPU consists of 64 Compute Units (CUs), each of those containing 4 Single-Instruction-Multiple-Data units (SIMD) and each SIMD executes blocks of 64 threads, which we call a 「wavefront」.fetch

Since latency for memory access can cause significant stalls in shader execution, up to 10 wavefronts can be scheduled on each SIMD simultaneously to hide this latency.flex

GPU有64個CUthis

每一個CU 4個SIMD3d

每一個SIMD 64blocks ----- 一個wavefront

ps的計算在裏面

GPU提高併發性減少GPU idel

async compute

Copy Queue(DirectX 12) / Transfer Queue (Vulkan): DMA transfers of data over the PCIe bus
Compute queue (DirectX 12 and Vulkan): execute compute shaders or copy data, preferably within local memory
Direct Queue (DirectX 12) / Graphics Queue (Vulkan): this queue can do anything, so it is similar to the main device in legacy APIs

這三種queue對應metal裏面三種encoder 是爲了增長上文所述併發性

對GPU底層的操做這種可行性是經過這裏的queue體現的

vulkan對queue的個數有限制能夠query

dx12沒有這種個數限制

更多部分拿出來用cs作異步計算

看圖--技能點還沒點

problem shooting

If resources are located in system memory accessing those from Graphics or Compute queues will have an impact on DMA queue performance and vice versa.
Graphics and Compute queues accessing local memory (e.g. fetching texture data, writing to UAVs or performing rasterization-heavy tasks) can affect each other due to bandwidth limitations 帶寬限制數據onchip
Threads sharing the same CU will share GPRs and LDS, so tasks that use all available resources may prevent asynchronous workloads to execute on the same CU
Different queues share their caches. If multiple queues utilize the same caches this can result in more cache thrashing and reduce performance

Due to the reasons above it is recommended to determine bottlenecks for each pass and place passes with complementary bottlenecks next to each other:

Compute shaders which make heavy use of LDS and ALU are usually good candidates for the asynchronous compute queue
Depth only rendering passes are usually good candidates to have some compute tasks run next to it
A common solution for efficient asynchronous compute usage can be to overlap the post processing of frame N with shadow map rendering of frame N+1
Porting as much of the frame to compute will result in more flexibility when experimenting which tasks can be scheduled next to each other
Splitting tasks into sub-tasks and interleaving them can reduce barriers and create opportunities for efficient async compute usage (e.g. instead of 「for each light clear shadow map, render shadow, compute VSM」 do 「clear all shadow maps, render all shadow maps, compute VSM for all shadow maps」)

而後給異步計算的功能加上開關

看vulkan這個意思它彷佛沒有metal2 那種persistent thread group 維持數據cs ps之間傳遞時還能夠 on tile

相關標籤/搜索

compute

asynchronous

vulkan

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。