iOS App 使用 GCD 致使的卡頓問題

時間 2019-11-10

標籤 ios app 使用 gcd 致使問題欄目 iOS 简体版

原文原文鏈接

最近在調研 iOS app 中存在的各類卡頓現象以及解決方法。html

iOS App 出現卡頓（stall）的機率可能超出大部分人的想象，尤爲是對於大公司旗艦型 App。一方面是因爲業務功能不停累積，各個產品團隊之間缺少協調，你們都忙着增長功能，系統資源出現瓶頸。另外一方面的緣由是老設備更新換代太慢，iOS 設備的耐用度極好，如今還有很多 iPhone 4S 在服役，iPhone 6 做爲問題設備持有量很高，據估計，如今 iPhone 6s 之前的設備佔有比高達 40%。多線程

因此，若是嘗試在線上 App 加入卡頓檢測的工具，你會發現卡頓出現的機率高的驚人。但卡頓的檢測就修復並不簡單，主要是由於難以在開發設備上覆現。app

以前寫過一篇介紹主線程卡頓監控的文章，好像如今主流的作法都是經過監控 Runloop 事件回調，檢查進入回調的時間間隔是否超過 Threshold，超過則記錄當前 App 全部線程的 call stack。less

我前段時間從後臺上報的卡頓日誌裏看到這樣一個 call stack：dom

> 0 libsystem_kernel.dylib __workq_kernreturn
> 1 libsystem_pthread.dylib _pthread_workqueue_addthreads
> 2 libdispatch.dylib _dispatch_queue_wakeup_global_slow
> 3 libdispatch.dylib _dispatch_queue_wakeup_with_qos_slow
> 4 libdispatch.dylib dispatch_async

也就是說卡頓出如今 dispatch_async，以我現有對於 GCD 的認知，dispatch_async 是絕無可能出現卡頓的。dispatch_async 的主要任務是從系統線程池裏取出一個工做線程，並將 block 放到該線程裏去執行。async

上述 call stack 確確實實的出現了，並且樣本數量還很多，最後一個函數明顯是一個內核調用。從函數名字猜想，多是 GCD 嘗試從線程池裏獲取線程，但已有線程都在執行狀態，因此向系統內核申請建立新的線程。但建立線程的內核調用會很慢嗎？會慢到讓主線程出現卡頓的程度？帶着疑問我搜索了大量相關資料，最後比較相關的有這樣一篇文章：http://newosxbook.com/articles/GCD.htmlide

其中有這樣一段話：函數

This isn’t due to 10.9’s GCD being different - rather, it demonstrates the true asynchronous nature of GCD: The main thread has yet to return from requesting the worker (which it does by pthread_workqueue_addthreads_np, as I’ll describe later), and already the worker thread has spawned and is mid execution, possibly on another CPU core. The exact state of the main thread with respect to the worker is largely unpredictable.工具

做者認爲，GCD 申請到的線程有多是一個正在處理其餘任務的 thread，main thread 須要等待這個忙碌的線程返回才能繼續執行，我對這種說法存疑。oop

最後求助無門的情況下，我決定使用一次寶貴的 TSL 機會，直接向 Apple 的工程師求教。這裏不得不提下，向 Apple 尋求 technical support 是很是寶貴並且可行的方案，每一個開發者帳號每一年都有 2 次機會，不用很是惋惜。

我把問題拋過去後，獲得一位 Apple 內核團隊工程師的回覆，我將精簡過的回覆以問答的形式展現和你們分享：

Q: looks like even if it’s async dispatching, the main thread still has to wait for the other thread to return, during which time, the other thread happen to be in mid execution of sth. this confuses me, what exactly is the main thread waiting for?

爲何主線程須要等待 dispatch_async 返回，主線程到底在等待什麼？

A: It’s hard to say with just a user space backtrace. Frame 0 has clearly sent the current thread into the kernel, and this specific kernel call is /way/ too complex to analyse from outside [1].

從用戶態調用棧沒法得出答案，內核可能的狀態過於複雜。

Q: I know it’s suggested that we create limited amount of serial queue，and use target queue probably. but what could happen if we don’t follow that rule?

Apple 一直推薦本身建立 serial GCD queue 的時候，必定要控制數量，並且最好設置 target queue，不然會出現問題，但會出現什麼問題我一直很好奇，此次藉着機會一塊兒問了。

* On macOS, where the system is happier to over commit, you end up with a thread explosion.  That in turn can lead to problems running out of memory, running out of Mach ports, and so on.

* On iOS, which is not happy about over committing, you find that the latency between a block being queued and it running can skyrocket.  This can, in turn, have knock-on effects.  For example, the last time I looked at a problem like this I found that `NSOperationQueue` was dispatching blocks to the global queue for internal maintenance tasks, so when one subsystem within the app consumed all the dispatch worker threads other subsystems would just stall horribly.

Note: In the context of dispatch, an 「over commit」 is where the system had to allocate more threads to a queue then there are CPU cores.  In theory this should never be necessary because work you dispatch to a queue should never block waiting for resources.  In practice it’s unavoidable because, at a minimum, the work you queue can end up blocking on the VM subsystem.

Despite this, it’s still best to structure your code to avoid the need for over committing, especially when the over commit doesn’t buy you anything.  For example, code like this:

group = dispatch_group_create();
for (url in urlsToFetch) {
    dispatch_group_enter(group);
    dispatch_async(dispatch_get_global_queue(…), ^{
        … fetch `url` synchronously …
        dispatch_group_leave(group);
    });
}
dispatch_group_wait(group, …);

is horrible because it ties up 10 dispatch worker threads for a very long time without any benefit.  And while this is an extreme example — from dispatch’s perspective, networking is /really/ slow — there are less extreme examples that are similarly problematic.  From dispatch’s perspective, even the disk drive is slow (-:

這段回覆頗有意思。閱讀過 GCD 源碼的同窗會知道，全部默認建立的 GCD queue 都有一個優先級，但其實每一個優先級對應兩個 queue，好比一個是 default-priority，那麼另外一個就是 default-priority-overcommit。dispatch_async 的時候，會首先將任務丟進 default-priority 隊列，若是隊列滿了，就轉而丟進 default-priority-overcommit。

在 Mac 系統裏，GCD 容許 overcommit，意味着每次 dispatch_async 都會建立一個新線程，即便 over commit 了，這些過量的線程會根據優先級來競爭 CPU 資源。

而在 iOS 系統裏，GCD 會控制 overcommit，若是某個優先級隊列 over commit 裏，那麼排在後面的任務就會處於等待狀態。移動設備 CPU 資源比較緊張，這種設計合乎常理。

因此若是在 iOS 裏建立過多的 serial queue，那麼後面提交的任務可能就會一直處於等待狀態。這也是爲何咱們須要嚴格控制 queue 的數量和層級關係，最好是 App 當中每一個子系統只能分配固定數量和優先級的 queue，從而避免 thread explosion 致使的代碼沒法及時執行問題。

Q：I know the system watchdog can kill an app if the main thread is taking too long to respond. I also heard rumors that there are two other cases that may gets your app killed by watchdog. the first is too many new threads are being created like by random usage of dispatching work to global concurrent queue? the second case is if CPU has been kept too busy like 100% for too long, watchdog kills app too?

我藉機問了下系統 watchdong 強殺 App 的緣由，由於坊間一直有傳聞是除了主線程長時間沒反應以外，建立過多的線程和 CPU 長時間超負荷運轉也會致使被強殺。

A：I’m not aware of any specific watchdog check along those lines, but it’s not hard to imagine that the above-mentioned knock-on effects might jam up your app sufficiently for the watchdog to kill it for other reasons. Running the CPU for too long generates a crash report but it doesn’t actually kill the app. It’s essentially a ‘warning’ crash report about the problem.

建立過多線程不會直接致使 watchdog 強殺，但過多線程有可能致使主線程得不到及時處理，而由於其餘緣由被 kill。而 CPU 長時間過載並不會致使強殺，但系統會生成一個 report 來警告開發者。我確實看到過很多這類 ‘this is not a crash’ 的 crash 日誌。

另外還有一些問答，和我當前疑問並不直接相關因此略去。最後再貼一段比較有意思的回覆，在閱讀以前你們能夠本身先思考下：

dispatch_async(myQueue, ^{    
// line A
});
// line B

line A 和 line B 誰先執行？

Consider a snippet like this:

dispatch_async(myQueue, ^{
    // line A
});
// line B

there’s clearly a race condition between lines A and B, that is, between the `dispatch_async` returning and the block running on the queue.  This can pan out in multiple ways, including:

* If `myQueue` (which we’re assuming is a serial queue) is busy, A has to wait so B will definitely run before A.

* If `myQueue` is empty, there’s no idle CPU, and `myQueue` has a higher priority then the thread that called `dispatch_async`, you could imagine the kernel switching the CPU to `myQueue` so that it can run A.

* The thread that called `dispatch_async` could run out of its time quantum after scheduling B on `myQueue` but before returning from `dispatch_async`, which again results in A running before B.

* If `myQueue` is empty and there’s an idle CPU, A and B could end up running simultaneously.

答案

其實最後我也沒有獲得我想要的準確的答案，可能正如回覆裏所說，狀況有不少並且過於複雜，無法經過一個用戶態的 call stack 簡單推知內核的狀態，但有些有價值的信息仍是得以大體理清：

信息一

iOS 系統自己是一個資源調度和分配系統，CPU，disk IO，VM 等都是稀缺資源，各個資源之間會互相影響，主線程的卡頓看似 CPU 資源出現瓶頸，但也有可能內核忙於調度其餘資源，好比當前正在發生大量的磁盤讀寫，或者大量的內存申請和清理，都會致使下面這個簡單的建立線程的內核調用出現卡頓：

libsystem_kernel.dylib __workq_kernreturn

因此解決辦法只能是本身分析各 thread 的 call stack，根據用戶場景分析當前正在消耗的系統資源。後面也確實經過最近提交的代碼分析，發現是因爲增長了一些很是耗時的磁盤 io 任務（雖然也是放在在子線程），纔出現這個看着不怎麼沾邊的 call stack。revert 以後卡頓警報就消失了。

信息二

現有的卡頓檢測工具都只能在超時的狀況下 dump call stack，但出現超時有多是任務 A，B，C 共同做用致使的，A 和 B 多是真正耗時的任務，C 不耗時但碰巧是最後一個，因此被當成元兇，而 A 和 B 卻沒有出如今上報日誌裏。我暫時也沒有想到特別好的解決辦法。很明顯，libsystem_kernel.dylib __workq_kernreturn 就是一個不怎麼耗時的 C 任務。