最近在公司羣裏同事發了一個UE4關於Mask材質的優化,好比在場景中有大面積的草和樹的時候,能夠在很大程度上提升效率。這其中的原理就是利用了GPU的特性Early Z,可是它的作法跟我最開始的理解有些出入,由於Early Z是GPU硬件實現的,每一個廠商在實現的時候也有所不一樣。此次在查閱了一些資源和經過實驗測試,讓咱們來揭開Early Z的神祕面紗。首先咱們先講解一下什麼是Early Z,而後再講解一下UE4是如何利用Early Z的特性解決草和 樹的Overdraw問題的,而後咱們講解一下Early Z演化,最後咱們經過實驗數據來驗證Early Z是如何工做的。app
什麼是Early Z less
咱們知道傳統的渲染管線中,深度測試是發生在Pixel/Fragment Shader以後的,以下圖所示:ide
可是,若是咱們仔細想下,在光柵化的時候咱們已經知道了每一個片段(fragment)的深度,若是這個時候咱們能夠提早作測試就能夠避免後面複雜的Pixel/Fragment Shader計算過程,硬件廠商固然也想到了這一點,他們也在本身的硬件中各自實現了Early Z功能。在網上找到了一些他們的資料,咱們簡單看一下。性能
nVidia 學習
nVidia的GPU Programming Guide裏面有關於Early Z的優化方案,裏面提到了一些關於Early Z的一些使用細節。 測試
Early-z(GPU Programming Guide Version 2.5.0 (GeForce 7 and earlier GPUs)) optimization (sometimes called "z-cull") improves performance by avoiding the rendering of occluded surfaces. If the occluded surfaces have expensive shaders applied to them, z-cull can save a large amount of 優化
computation time. To take advantage of z-cull, follow these guidelines: ui
Violating these rules can invalidate the data the GPU uses for early this
optimization, and can disable z-cull until the depth buffer is cleared again. spa
能夠看到不要使用alpha test 或者texkll(clip discard),不要修改深度,只容許使用光柵化插值後的深度,違背這些規則會使GPU Early Z優化失效,直到下一次清除深度緩衝區,而後才能使用Early Z。限於當時的條件,是有這樣的限制,那麼到了如今GPU還有這些限制嗎?咱們接下來的實驗會說明這一點。
ZCULL and EarlyZ: Coarse and Fine-grained Z and Stencil Culling
NVIDIA GeForce 6 series and later GPUs can perform a coarse level Z and Stencil culling. Thanks to this optimization large blocks of pixels will not be scheduled for pixel shading if they are determined to be definitely occluded. In addition, GeForce 8 series and later GPUs can also perform fine-grained Z and Stencil culling, which allow the GPU to skip the shading of occluded pixels. These hardware optimizations are automatically enabled when possible, so they are mostly transparent to developers. However, it is good to know when they cannot be enabled or when they can underperform to ensure that you are taking advantage of them.
Coarse Z/Stencil culling (also known as ZCULL) will not be able to cull any pixels in the following cases:
1. If you don't use Clears (instead of fullscreen quads that write depth) to clear the depth-stencil buffer.
2. If the pixel shader writes depth.
3. If you change the direction of the depth test while writing depth. ZCULL will not cull any pixels until the next depth buffer Clear.
4. If stencil writes are enabled while doing stencil testing (no stencil culling)
5. On GeForce 8 series, if the DepthStencilView has Texture2D[MS]Array dimension
Also note that ZCULL will perform less efficiently in the following circumstances
1. If the depth buffer was written using a different depth test direction than that used for testing 2. If the depth of the scene contains a lot of high frequency information (i.e.: the depth varies a lot within a few pixels)
3. If you allocate too many large depth buffers.
4. If using DXGI_FORMAT_D32_FLOAT format Similarly,
fine-grained Z/Stencil culling (also known as EarlyZ) is disabled in the following cases:
1. If the pixel shader outputs depth
2. If the pixel shader uses the .z component of an input attribute with the SV_Position semantic (only on GeForce 8 series in D3D10)
3. If Depth or Stencil writes are enabled, or Occlusion Queries are enabled, and one of the following is true:
• Alpha-test is enabled
• Pixel Shader kills pixels (clip(), texkil, discard)
• Alpha To Coverage is enabled
• SampleMask is not 0xFFFFFFFF (SampleMask is set in D3D10 using OMSetBlendState and in D3D9 setting the D3DRS_MULTISAMPLEMASK renderstate)
這是GPU Programming Guide GeForce 8 and 9 Series,能夠看到它裏面又加入了ZCull(即Hierachical Z)這裏也有一些須要注意的地方,可是它沒有詳細說明若是開啓了Alpha Test以後會不地致使後面的全部Early Z失效。
AMD
Emil Persson的Depth in Depth對Early Z有一個比較深刻的講解。
Hierarchical Z, or HiZ for short, allows tiles of pixels to be rejected in a hierarchical fashion. This allows for faster rejection of occluded pixels and offers some bandwidth saving by doing a rough depth test using lower resolution buffers first instead of reading individual depth samples. Tiles that can safely be discarded are eliminated and thus the fragment 1 shader will not be executed for those pixels. Tiles that cannot safely be discarded are passed on to the Early Z stage, which will be discussed later on.
The Early Z component operates on a pixel level and allows fragments to be rejected before executingthe fragment shader. This means that if a certain fragment is found to be occluded by the current contents of the depth buffer, the fragment shader doesn't have to run for that pixel. Early Z can also reject fragments before shading based on the stencil test. On hardware prior to the Radeon HD 2000series, early Z was a monolithic top-of-the-pipe operation, which means that the entire read-modify- write cycle is executed before the fragment shader. As a result this impacts other functionality that kills fragments such as alpha test and texkill (called "clip" in HLSL and "discard" in GLSL). If Early Z would be left on and the alpha test kills a fragment, the depth- and/or stencil-buffer would have been incorrectly updated for the killed fragments. Therefore, Early Z is disabled for these cases. However, if depth and stencil writes are disabled there are no updates to the depth-stencil buffer anyway, so in this case Early Z will be enabled. On the Radeon HD 2000 series, Early Z works in all cases.
最後做者還給了一個參考表,列出了在什麼狀況下Early Z會失效,以下圖所示:
總結
經過上面兩個比較陳舊的文檔,咱們可能會對何時會致使Early Z的失效比較模糊,並且隨着硬件的演進,這些限制條件也會變化,後面咱們經過一些實驗來作些驗證。
UE4對Mask材質的Early Z優化
上面簡單講了下什麼是Early Z,接下來咱們來解決下UE4是如何解決Mask材質帶來的Overdraw問題。
它須要開啓一個開關,叫作Mask Material Only in Early-Z pass
上面這個只是一個操做,那麼代碼是怎麼實現的呢?咱們這裏就不貼代碼了,這裏只是說一下它作這個的步驟,具體代碼能夠去參考UE4 Pre Pass的相關代碼。
這就是UE4提升Mask材質渲染效率的辦法,可是這個有個前提就是你場景中的Mask材質比較費纔有比較大的提高。等等,它的實現方法跟咱們看到的一些文章是矛盾的,而有些文檔又沒說清楚,既然UE4已經實現了這個功能,而且已經實現了性能提高,那說明先前的文章只針對當時的GPU有效,後面隨着硬件的演進更智能了,能夠處理的狀況更多了。爲了驗證,咱們作一些實驗。
揭開Earlyl Z的神祕面紗
爲了驗證上面的一系了假設,我這裏作了一個簡單的實驗。這個Demo的基於rastertek的Drect3D 11的教程Texturing,這個Demo就是在屏幕上渲染一個帶紋理的三角形。以下圖所示:
我修改了它的代碼,讓它在同一個位置畫四個三角形,第一個三角形採用Mask渲染,第二個三角形在PS中修改深度,第三個三角形使用Mask渲染,第四個三角形使用Mask渲染,可是跟UE4同樣,把深度寫關閉,把深度測試改成Equal,關閉clip。測試顯卡爲nVidia GTX 570。這樣我用GPA(intel graphics performance analyzer)分析PS調用次數以及實現Pixel的個數以下表所示:
渲染批次 |
Depth |
Clip |
PS Invocations |
Pixels Rendered |
1 Mask |
Less Write |
Yes |
10.4k |
6548 |
2 Modify depth |
Less Write |
No |
10.4k |
3820 |
3 Mask |
Less Write |
Yes |
10.4k |
0 |
4 Mask |
Equal do not Write(寫不與深度不影響結果由於是equal,可是爲了節省帶寬關閉) |
No |
6548 |
6548 |
從上圖能夠看出不管是Modify depth或者Clip都隻影響當前Draw call的early z優化,並不會影響後面的early z優化。能夠看出,隨着硬件的演化,early z(包括Hierachical Z)變得更智能了,能夠處理的狀況更多了。
總結
經過對Early Z的簡單分析以及實驗,咱們得出了一個有用的結論:
隨着硬件的演進,原來硬件的不少限制也會被解除,這樣就須要咱們不斷學習新的知識來正確的優化咱們的引擎或者遊戲。