vulkan asynchronous compute

https://www.youtube.com/watch?v=XOGIDMJThtowindows

https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/9-Asynchonous-compute.pdf併發

 

https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization異步

https://gpuopen.com/concurrent-execution-asynchronous-queues/async

經過queue的並行 增長GPU的並行ide

 

併發性 concurrency post

Radeon™ Fury X GPU consists of 64 Compute Units (CUs), each of those containing 4 Single-Instruction-Multiple-Data units (SIMD) and each SIMD executes blocks of 64 threads, which we call a 「wavefront」.fetch

Since latency for memory access can cause significant stalls in shader execution, up to 10 wavefronts can be scheduled on each SIMD simultaneously to hide this latency.flex

GPU有64個CUthis

每一個CU 4個SIMD3d

每一個SIMD 64blocks ----- 一個wavefront

ps的計算在裏面

GPU提高併發性 減少GPU idel

async compute

  • Copy Queue(DirectX 12) / Transfer Queue (Vulkan): DMA transfers of data over the PCIe bus
  • Compute queue (DirectX 12 and Vulkan): execute compute shaders or copy data, preferably within local memory
  • Direct Queue (DirectX 12) / Graphics Queue (Vulkan):  this queue can do anything, so it is similar to the main device in legacy APIs

這三種queue對應metal裏面三種encoder 是爲了增長上文所述併發性

 

對GPU底層的 操做這種可行性是經過這裏的queue體現的

vulkan對queue的個數有限制 能夠query

dx12沒有這種個數限制

更多部分拿出來用cs作異步計算

看圖--技能點還沒點

 

problem shooting

  • If resources are located in system memory accessing those from Graphics or Compute queues will have an impact on DMA queue performance and vice versa.
  • Graphics and Compute queues accessing local memory (e.g. fetching texture data, writing to UAVs or performing rasterization-heavy tasks) can affect each other due to bandwidth limitations  帶寬限制 數據onchip
  • Threads sharing the same CU will share GPRs and LDS, so tasks that use all available resources may prevent asynchronous workloads to execute on the same CU
  • Different queues share their caches. If multiple queues utilize the same caches this can result in more cache thrashing and reduce performance

Due to the reasons above it is recommended to determine bottlenecks for each pass and place passes with complementary bottlenecks next to each other:

  • Compute shaders which make heavy use of LDS and ALU are usually good candidates for the asynchronous compute queue
  • Depth only rendering passes are usually good candidates to have some compute tasks run next to it
  • A common solution for efficient asynchronous compute usage can be to overlap the post processing of frame N with shadow map rendering of frame N+1
  • Porting as much of the frame to compute will result in more flexibility when experimenting which tasks can be scheduled next to each other
  • Splitting tasks into sub-tasks and interleaving them can reduce barriers and create opportunities for efficient async compute usage (e.g. instead of 「for each light clear shadow map, render shadow, compute VSM」 do 「clear all shadow maps, render all shadow maps, compute VSM for all shadow maps」)

而後給異步計算的功能加上開關

看vulkan這個意思 它彷佛沒有metal2 那種persistent thread group 維持數據cs ps之間傳遞時還能夠 on tile

相關文章
相關標籤/搜索