深刻剖析MSAA

本文打算對MSAA(Multisample anti aliasing)作一個深刻的講解,包括基本的原理、以及不一樣平臺上的實現對比(主要是PC與Mobile)。爲了對MSAA有個更好的理解,因此寫下了這篇文章。固然文章中不免有錯誤之處,若有發現,還請指證,以避免誤導其餘人。好了,廢話很少說,下面咱們開始正文。html

 

MSAA的原理

Aliasing(走樣)

在介紹MSAA原理以前,咱們先對走樣(Aliasing)作個簡單介紹。在信號處理以及相關領域中,走樣(混疊)在對不一樣的信號進行採樣時,致使得出的信號相同的現象。它也能夠指信號從採樣點從新信號致使的跟原始信號不匹配的瑕疵。它分爲時間走樣(好比數字音樂、以及在電影中看到車輪倒轉等)和空間走樣兩種(摩爾紋)。這裏咱們不詳細展開。android

具體到實時渲染領域中,走樣有如下三種:算法

  1. 幾何體走樣(幾何物體的邊緣有鋸齒),幾何走樣因爲對幾何邊緣採樣不足致使。

  2. 着色走樣,因爲對着色器中着色公式(渲染方程)採樣不足致使。比較明顯的現象就是高光閃爍。

    上面一張圖顯示了因爲對使用了高頻法線貼圖的高頻高光BRDF採樣不足時產生的着色走樣。下面這張圖顯示了使用4倍超採樣產生的效果。windows

  3. 時間走樣,主要是對高速運動的物體採樣不足致使。好比遊戲中播放的動畫發生跳變等。

    SSAA(超採樣反走樣)

    從名字能夠看出,超採樣技術就是以一個更大的分辨率來渲染場景,而後再把相鄰像素值作一個過濾(好比平均等)獲得最終的圖像(Resolve)。由於這個技術提升了採樣率,因此它對於解決上面幾何走樣和着色走樣都是有效果的。以下圖所示,首先經對每一個像素取n個子採樣點,而後針對每一個子像素點進行着色計算。最後根據每一個子像素的值來合成最終的圖像。緩存

    雖然SSAA能夠有效的解決幾何走樣和着色走樣問題,可是它須要更多的顯存空間以及更多的着色計算(每一個子採樣點都須要進行光照計算),因此通常不會使用這種技術。順着上面的思路,若是咱們對最終的每一個像素着色,而不是每一個子採樣點着色的話,那這樣雖然顯存仍是那麼多,可是着色的數量少了,那它的效率也會有比較大的提升。這就是咱們今天想要主要說的MSAA技術。架構

    MSAA(多重採樣反走樣)

    在前面提到的SSAA中,每一個子採樣點都要進行單獨的着色,這樣在片段(像素)着色器比較複雜的狀況下仍是很費的。那麼能不能只計算每一個像素的顏色,而對於那些子採樣點只計算一個覆蓋信息(coverage)和遮擋信息(occlusion)來把像素的顏色信息寫到每一個子採樣點裏面呢?最終根據子採樣點裏面的顏色值來經過某個重建過濾器來降採樣生成目標圖像。這就是MSAA的原理。注意這裏有一個很重要的點,就是每一個子像素都有本身的顏色、深度模板信息,而且每個子採樣點都是須要通過深度和模板測試才能決定最終是否是把像素的顏色獲得到這個子採樣點所在的位置,而不是簡單的做一個覆蓋測試就寫入顏色。關於這個的出處,我在接下來的文章裏會寫出多個出處來佐證這一點。如今讓咱們先把MSAA的原理講清楚。app

    Coverage(覆蓋)以及Occlusion(遮擋)

    一個支持D3D11的顯卡支持經過光柵化來渲染點、線以及三角形。顯卡上的光柵化管線把圖形的頂點看成輸入,這些頂點的位置是在經由透視變換的齊次裁剪空間。它們用來決定這個三角形在當前渲染目標上的像素的位置。這個可見像素由兩個因素決定:ide

  • 覆蓋 覆蓋是經過判斷一個圖形是否跟一個指定的像素重疊來決定的。在顯卡中,覆蓋是經過測試一個採樣點是否在像素的中心來決定的。接下來的圖片說明了這個過程。

    一個三角形的覆蓋信息。藍色的點表明採樣點,每個都在像素的中心位置。紅色的點表明三角形覆蓋的採樣點。wordpress

  • 遮擋告訴咱們被一個圖形覆蓋的像素是否被其它的像素覆蓋了,這種狀況你們應該很熟悉就是z buffer的深度測試。

    覆蓋和遮擋兩個一塊兒決定了一個圖形的可見性。post

    就光柵化而言,MSAA跟SSAA的方式差很少,覆蓋和遮擋信息都是在一個更大分辨率上進行的。對於覆蓋信息來講,硬件會對每一個子像素根據採樣規則生成n的子採樣點。接下來的這張圖展現了一個使用了旋轉網格(rotated grid)採樣方式的子採樣點位置。

    三角形會與像素的每一個子採樣點進行覆蓋測試,會生成一個二進制覆蓋掩碼,它表明了這個三角形覆蓋當前像素的比例。對於遮擋測試來講,三角形的深度在每個覆蓋的子採樣點的位置進行插值,而且跟z buffer中的深度信息進行比較。因爲深度測試是在每一個子採樣點的級別而不是像素級別進行的,深度buffer必須相應的增大以來存儲額外的深度值。在實現中,這意味着深度緩衝區是非MSAA狀況下的n倍。

    MSAA跟SSAA不一樣的地方在於,SSAA對於全部子採樣點着色,而MSAA只對當前像素覆蓋掩碼不爲0的進行着色,頂點屬性在像素的中心進行插值用於在片段程序中着色。這是MSAA相對於SSAA來講最大的好處。

    雖然咱們只對每一個像素進行着色,可是並不意味着咱們只須要存儲一個顏色值,而是須要爲每個子採樣點都存儲顏色值,因此咱們須要額外的空間來存儲每一個子採樣點的顏色值。因此,顏色緩衝區的大小也爲非MSAA下的n倍。當一個片段程序輸出值時,只有地了覆蓋測試和遮擋測試的子採樣點纔會被寫入值。所以若是一個三角形覆蓋了4倍採樣方式的一半,那麼一半的子採樣點會接收到新的值。或者若是全部的子採樣點都被覆蓋,那麼全部的都會接收到值。接下來的這張圖展現了這個概念:

    經過使用覆蓋掩碼來決定子採樣點是否須要更新值,最終結果多是n個三角形部分覆蓋子採樣點的n個值。接下來的圖像展現了4倍MSAA光柵化的過程。

    MSAA Resolve(MSAA 解析)

    像超採樣同樣,過採樣的信號必須從新採樣到指定的分辨率,這樣咱們才能夠顯示它。

    這個過程叫解析(resolving)。在它最先的版本里,解析過程是在顯卡的固定硬件裏完成的。通常使用的採樣方法就是一像素寬的box過濾器。這種過濾器對於徹底覆蓋的像素會產生跟沒有使用MSAA同樣的效果。好很差取決於怎麼看它(好是由於你不會由於模糊而減小細節,壞是由於一個box過濾器會引入後走樣(postaliasing))。對於三角形邊上的像素,你會獲得一個標誌性的漸變顏色值,數量等於子採樣點的個數。接下來的圖展現了這一現象:

    固然不一樣的硬件廠商可能會使用不一樣的算法。好比nVidia的"Quincunx" AA等。隨着顯卡的不斷升級,咱們如今能夠經過自定義的shader來作MSAA的解析了。

    小結

    經過上面的解釋,咱們能夠看到,整個MSAA並非在光柵化階段就能夠徹底的,它在這個階段只是生成覆蓋信息。而後計算像素顏色,根據覆蓋信息和深度信息決定是否來寫入子採樣點。整個完成後再經過某個過濾器進行降採樣獲得最終的圖像。大致流程以下所示:

    PC與Mobile對比

    上面咱們講解了MSAA的基本原理,那麼具體到不一樣顯卡廠商以及不一樣平臺上的實現有什麼不一樣嗎?下面就讓咱們作些簡單的對比。其實,既然算法已經肯定了,那麼差別基本上就是在一些細節上的處理,以及GPU架構不一樣帶來的差別。

     

    版本

    MSAA是否支持

    自定義Shader解析

    是否須要更大的顏色 深度 緩衝區

    Direct3D 9

    須要

    Direct3D 11

    須要

    Direct3D 12

    須要

    OpenGL ES 2.0

    (Multisample rasterization cannot be enabled or disabled after a GL context is created. It is enabled if the value of SAMPLE_BUFFERS is one, and disabled otherwise)

    Multisample Texture:

    使用GL_EXT_multisampled_render_to_texture擴展

    蘋果:

    APPLE_framebuffer_multisample

    安卓:

    使用EGL

    看GPU架構 :

    TBR(Mali Qualcomm Adreno(300系列以前)) TBDR(PowerVR) 不須要

    IMR(nVidia Tera Qualcomm Adreno 300系列以及以後能夠在IMR、TBR之間切換)須要。

     

    若是使用GL_EXT_multisampled_render_to_texture也須要(跟硬件實現有關(enabling MSAA the right way in OpenGL ES))。

    OpenGL ES 3.0

    (The technique is to sample all primitives multiple times at each pixel. The color sample values are resolved to a single, displayable color. For window system-provided framebuffers, this occurs each time a pixel is updated, so the antialiasing appears to be automatic at the application level. For application-created framebuffers, this must be requested by calling the BlitFramebuffer command (see section 4.3.3).)

    When rendering textures, emphasis is placed on multisample anti-aliasing (MSAA), which earlier hardware generations could only run against the framebuffer. OpenGL ES 3.0 can presently support MSAA-type rendering for a texture.

    若是是系統提供的framebuffer,那麼同OpenGL ES 2.0的版本。若是是用戶建立的framebuffer,那麼是須要額外的顯存的(跟硬件實現有關???)。

    OpenGL ES 3.1

    是(sampler2DMS)

    若是是系統提供的framebuffer,那麼同OpenGL ES 2.0的版本。若是是用戶建立的framebuffer,那麼是須要額外的顯存的(跟硬件實現有關???)。

    IMR vs TBR vs TBDR

    IMR (當即渲染模式)

    目前PC平臺上基本上都是當即渲染模式,CPU提交渲染數據和渲染命令,GPU開始執行。它跟當前已經畫了什麼以及未來要畫什麼的關係很小(Early Z除外)。流程以下圖所示:

    TBR(分塊渲染)

    TBR把屏幕分紅一系列的小塊,每一個單獨來處理,因此能夠作到並行。因爲在任什麼時候候顯卡只須要場景中的一部分數據就可完成工做,這些數據(如顏色 深度等)足夠小到能夠放在顯卡芯片上(on-chip),有效得減小了存取系統內存的次數。它帶來的好處就是更少的電量消耗以及更少的帶寬消耗,從而會得到更高的性能。

    分塊

    TBDR (分塊延遲渲染)

    TBDR跟TBR有些類似,也是分塊,並使用在芯片上的緩存來存儲數據(顏色以及深度等),它還使用了延遲技術,叫隱藏面剔除(Hidden Surface Removal),它把紋理以及着色操做延遲到每一個像素已經在塊中已經肯定可見性以後,只有那些最終被看到的像素才消耗處理資源。這意味着隱藏像素的沒必要要處理被去掉了,這確保了每幀使用最低可能的帶寬使用和處理週期數,這樣就能夠獲取更高的性能以及更少的電量消耗。

    一個簡單的對比傳統GPU與TBDR

     

    移動平臺上的MSAA

    有了上面對移動GPU架構的簡單瞭解,下面咱們看下在移動平臺上是怎麼處理MSAA的,以下圖所示:

     

    能夠看到若是相對於IMR模式的顯卡來講,TBR或者TBDR的實現MSAA會省不少,由於好多工做直接在on-chip上就完成了。這裏仍是有兩個消耗: 

 

  • 4倍MSAA須要四倍的塊緩衝內存。因爲芯片上的塊緩衝內存很最貴,因此顯卡會經過減小塊的大小來消除這個問題。減小塊的大小對性能有所影響,可是減小一半的大小並不意味着性能會減半,瓶頸在片段程序的只會有一個很小的影響。

 

  • 第二個影響就是在物體邊緣會產生更多的片段,這個在IMR模式下也有。每一個多邊形都會覆蓋更多的像素以下圖所示。並且,背景和前景的圖形都貢獻到一個交互的地方,兩片段都須要着色,這樣硬件隱藏背面剔除就會剔除更少的像素。這些額外片段的消耗跟場景是由多少邊緣組成有關,可是10%是一個比較好的猜想。

主流移動GPU的實現細節

Mali:

JUST22 - Multisampled resolve on-tile is supported in hardware with no bandwidth hit Mali GPUs support resolving multisampled framebuffers on-tile. Combined with tile-buffer support for full throughput in 4x MSAA makes 4x MSAA a very compelling way of improving quality with minimal speed hit.

 

In GLES on Mali GPUs, the simplest case for 4xMSAA would be to render directly to the window surface (FB0), having set EGL_SAMPLES to 4. This will do all multisampling and resolving in the GPU registers, and will only flush the resolved buffer to memory. This is the most efficient way to implement MSAA on a Mali GPU, and comes at almost no performance cost compared to rendering to a normal window surface. Note that this does not expose the sample buffers themselves to you, and does not require an explicit resolve.

 

Qualcomm Adreno:

Anti-aliasing is an important technique for improving the quality of generated images. It reduces

the visual artifacts of rendering into discrete pixels.

Among the various techniques for reducing aliasing effects, multisampling is efficiently

supported by Adreno 4x. Multisampling divides every pixel into a set of samples, each of which

is treated like a "mini-pixel" during rasterization. Each sample has its own color, depth, and

stencil value. And those values are preserved until the image is ready for display. When it is time

to compose the final image, the samples are resolved into the final pixel color. Adreno 4xx

supports the use of two or four samples per pixel.

PowerVR:

Another benefit of the SGX and SGX-MP architecture is the ability to perform efficient 4x Multi-Sample Anti-Aliasing (MSAA). MSAA is performed entirely on-chip, which keeps performance high without introducing a system memory bandwidth overhead (as would be seen when performing anti-aliasing in some other architectures). To achieve this, the tile size is effectively quartered and 4 sample positions are taken for each fragment (e.g., if the tile size is 16x16, an 8x8 tile will be processed when MSAA is enabled). The reduction in tile size ensures the hardware has sufficient memory to process and store colour, depth and stencil data for all of the sample positions. When the ISP operates on each tile, HSR and depth tests are performed for all sample positions. Additionally, the ISP uses a 1 bit flag to indicate if a fragment contains an edge. This flag is used to optimize blending operations later in the render. When the subsamples are submitted to the TSP, texturing and shading operations are executed on a per-fragment basis, and the resultant colour is set for all visible subsamples. This means that the fragment workload will only slightly increase when MSAA is enabled, as the subsamples within a given fragment may be coloured by different primitives when the fragment contains an edge. When performing blending, the edge flag set by the ISP indicates if the standard blend path needs to be taken, or if the optimized path can be used. If the destination fragment contains an edge, then the blend needs to be performed individually for each visible subsample to give the correct resultant colour (standard blend). If the destination fragment does not contain an edge, then the blend operation is performed once and the colour is set for all visible subsamples (optimized blend). Once a tile has been rendered, the Pixel Back End (PBE) combines the subsample colours for each fragment into a single colour value that can be written to the frame buffer in system memory. As this combination is done on the hardware before the colour data is sent, the system memory bandwidth required for the tile flush is identical to the amount that would be required when MSAA is not enabled.

 

 

On PowerVR hardware Multi-Sampled Anti-Aliasing (MSAA) can be performed directly in on-chip memory before being written out to system memory, which saves valuable memory bandwidth. In general, MSAA is considered to cost relatively little performance. This is true for typical games and UIs, which have low geometry counts but very complex shaders. The complex shaders typically hide the cost of MSAA and have a reduced blend workload. 2x MSAA is virtually free on most PowerVR graphics cores (Rogue onwards), while 4x MSAA+ will noticeably impact performance. This is partly due to the increased on-chip memory footprint, which results in a reduction in tile dimensions (for instance 32 x 32 -> 32 x 16 -> 16 x 16 pixels) as the number of samples taken increases. This in turn results in an increased number of tiles that need to be processed by the tile accelerator hardware, which then increases the vertex stages overall processing cost. The concept of "good enough‟ should be followed in determining how much anti-aliasing is enough. An application may only require 2x MSAA to look "good enough‟, while performing comfortably at a consistent 60 FPS. In some cases there may be no need for anti-aliasing to be used at all e.g. when the target device‟s display has high PPI (pixels per-inch). Performing MSAA becomes more costly when there is an alpha blended edge, resulting in the graphics core marking the pixels on the edge to "on edge blend". On edge blend is a costly operation, as the blending is performed for each sample by a shader (i.e. in software). In contrast, on opaque edge is performed by dedicated hardware, and is a much cheaper operation as a result. On edge blend is also "sticky‟, which means that once an on-screen pixel is marked, all subsequent blended pixels are blended by a shader, rather than by dedicated hardware. In order to mitigate these costs, submit all opaque geometry first, which keeps the pixels "off edge" for as long as possible. Also, developers should be extremely reserved with the use of blending, as blending has lots of performance implications, not just for MSAA.

總結

經過上面的講解,咱們瞭解了MSAA的實現原理,以及在PC平臺和移動平臺上由於架構的不一樣致使具體實現細節的不一樣。MSAA是影響了GPU管理的光柵化、片段程序、光柵操做階段(每一個子採樣點都要作深度測試)的。每一個子採樣點都是有本身的顏色和深度存儲的,而且每一個子採樣點都會作深度測試。在移動平臺上,是否須要額外的空間來存儲顏色和深度須要根據OpenGL ES的版本以及具體硬件的實現有關。MSAA在通常的狀況下(不須要額外空間來存儲顏色和深度,直接在on-chip上完成子採樣點計算,而後直接resolve到framebuffer)是要比PC平臺上效率高的,由於沒有了那麼大的帶寬消耗。可是鑑於硬件實現差別大,建議仍是以實測爲準。因爲本人水平有限,不免會有錯誤的地方。若是發現,還請指正,以避免誤導了他人。

 

參考文獻

  1. https://en.wikipedia.org/wiki/Aliasing
  2. https://en.wikipedia.org/wiki/Moir%C3%A9_pattern
  3. https://mynameismjp.wordpress.com/2012/10/21/applying-sampling-theory-to-real-time-graphics/
  4. https://en.wikipedia.org/wiki/Supersampling
  5. https://mynameismjp.wordpress.com/2012/10/24/msaa-overview/
  6. https://mynameismjp.wordpress.com/2012/10/28/msaa-resolve-filters/
  7. http://graphics.stanford.edu/courses/cs248-07/lectures/2007.10.11%20CS248-06%20Multisample%20Antialiasing/2007.10.11%20CS248-06%20Multisample%20Antialiasing.ppt
  8. https://msdn.microsoft.com/en-us/library/windows/desktop/cc627092(v=vs.85).aspx
  9. https://www.khronos.org/registry/OpenGL/specs/es/2.0/es_full_spec_2.0.pdf
  10. https://www.khronos.org/registry/OpenGL/extensions/EXT/EXT_multisampled_render_to_texture.txt
  11. https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/WorkingwithEAGLContexts/WorkingwithEAGLContexts.html#//apple_ref/doc/uid/TP40008793-CH103-SW4
  12. https://stackoverflow.com/questions/27035893/antialiasing-in-opengl-es-2-0
  13. https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
  14. https://www.imgtec.com/blog/understanding-powervr-series5xt-powervr-tbdr-and-architecture-efficiency-part-4/
  15. https://en.wikipedia.org/wiki/Tiled_rendering
  16. https://www.qualcomm.com/media/documents/files/the-rise-of-mobile-gaming-on-android-qualcomm-snapdragon-technology-leadership.pdf
  17. https://static.docs.arm.com/100019/0100/arm_mali_application_developer_best_practices_developer_guide_100019_0100_00_en2.pdf
  18. https://www.imgtec.com/blog/introducing-the-brand-new-opengl-es-3-0/
  19. https://www.khronos.org/assets/uploads/developers/library/2014-gdc/Khronos-OpenGL-ES-GDC-Mar14.pdf
  20. https://android.googlesource.com/platform/external/deqp/+/193f598/modules/gles31/functional/es31fMultisampleShaderRenderCase.cpp
  21. https://www.anandtech.com/show/4686/samsung-galaxy-s-2-international-review-the-best-redefined/15
  22. https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
  23. http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-TileBasedArchitectures.pdf
  24. https://static.docs.arm.com/100019/0100/arm_mali_application_developer_best_practices_developer_guide_100019_0100_00_en2.pdf
  25. https://community.arm.com/graphics/f/discussions/4426/multisample-antialiasing-using-multisample-fbo
  26. http://cdn.imgtec.com/sdk-documentation/PowerVR+Series5.Architecture+Guide+for+Developers.pdf
  27. http://cdn.imgtec.com/sdk-documentation/PowerVR.Performance+Recommendations.pdf
相關文章
相關標籤/搜索