Android線程死鎖檢測方案設計

時間 2019-12-08

原文原文鏈接

前言

在項目中，使用多線程是很常見的事情，可是若是處理不當，代碼寫的很差，就可能會致使線程死鎖，對於死鎖問題，從發現到定位問題都是比較困難的，若是是線上用戶發生了線程死鎖，那就是難上加難了，所以最好是項目自己有本身一套線程死鎖檢測機制，可以自動檢測，自動上報，而後咱們分析上報日誌就能夠了。html

WatchDog原理

在討論如何設計一套完整的線程死鎖檢測方案以前，先了解一下Android系統中的WatchDog實現原理。WatchDog的本質是一條死鎖檢測線程，運行在SystemServer進程中，它的做用是不斷地檢測AMS，WMS等這些關鍵服務是否出現死鎖，假如死鎖了，就中止SystemServer進程，這時Zygote進程也會自殺，而後重啓，也就是手機系統重啓。java

接下來經過源碼分析WatchDog的實現原理。

Zygote進程啓動SystemServer進程後，會調用到SystemServer類的main靜態方法：android

/**
  * The main entry point from zygote.
  */
 public static void main(String[] args) {
     new SystemServer().run();
 }
複製代碼

直接看run方法：多線程

private void run() {	
 	......
 	// Start services.
     try {
         traceBeginAndSlog("StartServices");
         startBootstrapServices();
         startCoreServices();
         startOtherServices();
         SystemServerInitThreadPool.shutdown();
     } catch (Throwable ex) {
         Slog.e("System", "******************************************");
         Slog.e("System", "************ Failure starting system services", ex);
         throw ex;
     } finally {
         traceEnd();
     }
 	......
 }
複製代碼

startOtherServices方法會啓動不少關鍵服務，包括AMS，WMS，WatchDog也是在這裏面啓動的：app

private void startOtherServices() {
 	......
 	traceBeginAndSlog("StartWatchdog");
     Watchdog.getInstance().start();
     traceEnd();
 }
複製代碼

能夠看到，WatchDog被設計爲單例，由於WatchDog繼承自Thread類，也就是這裏調用start方法，最終會開啓一個檢測線程，而後執行其run方法：ide

@Override
 public void run() {
     boolean waitedHalf = false;
     while (true) {
         final List<HandlerChecker> blockedCheckers;
         final String subject;
         final boolean allowRestart;
         int debuggerWasConnected = 0;
         synchronized (this) {
             long timeout = CHECK_INTERVAL;
             // Make sure we (re)spin the checkers that have become idle within
             // this wait-and-check interval
             for (int i=0; i<mHandlerCheckers.size(); i++) {
                 HandlerChecker hc = mHandlerCheckers.get(i);
                 hc.scheduleCheckLocked();
             }

             if (debuggerWasConnected > 0) {
                 debuggerWasConnected--;
             }

             // NOTE: We use uptimeMillis() here because we do not want to increment the time we
             // wait while asleep. If the device is asleep then the thing that we are waiting
             // to timeout on is asleep as well and won't have a chance to run, causing a false
             // positive on when to kill things.
             long start = SystemClock.uptimeMillis();
             while (timeout > 0) {
                 if (Debug.isDebuggerConnected()) {
                     debuggerWasConnected = 2;
                 }
                 try {
                     wait(timeout);
                 } catch (InterruptedException e) {
                     Log.wtf(TAG, e);
                 }
                 if (Debug.isDebuggerConnected()) {
                     debuggerWasConnected = 2;
                 }
                 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
             }

             boolean fdLimitTriggered = false;
             if (mOpenFdMonitor != null) {
                 fdLimitTriggered = mOpenFdMonitor.monitor();
             }

             if (!fdLimitTriggered) {
                 final int waitState = evaluateCheckerCompletionLocked();
                 if (waitState == COMPLETED) {
                     // The monitors have returned; reset
                     waitedHalf = false;
                     continue;
                 } else if (waitState == WAITING) {
                     // still waiting but within their configured intervals; back off and recheck
                     continue;
                 } else if (waitState == WAITED_HALF) {
                     if (!waitedHalf) {
                         // We've waited half the deadlock-detection interval.  Pull a stack
                         // trace and wait another half.
                         ArrayList<Integer> pids = new ArrayList<Integer>();
                         pids.add(Process.myPid());
                         ActivityManagerService.dumpStackTraces(true, pids, null, null,
                             getInterestingNativePids());
                         waitedHalf = true;
                     }
                     continue;
                 }

                 // something is overdue!
                 blockedCheckers = getBlockedCheckersLocked();
                 subject = describeCheckersLocked(blockedCheckers);
             } else {
                 blockedCheckers = Collections.emptyList();
                 subject = "Open FD high water mark reached";
             }
             allowRestart = mAllowRestart;
         }

         // If we got here, that means that the system is most likely hung.
         // First collect stack traces from all threads of the system process.
         // Then kill this process so that the system will restart.
         EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

         ArrayList<Integer> pids = new ArrayList<>();
         pids.add(Process.myPid());
         if (mPhonePid > 0) pids.add(mPhonePid);
         // Pass !waitedHalf so that just in case we somehow wind up here without having
         // dumped the halfway stacks, we properly re-initialize the trace file.
         final File stack = ActivityManagerService.dumpStackTraces(
                 !waitedHalf, pids, null, null, getInterestingNativePids());

         // Give some extra time to make sure the stack traces get written.
         // The system's been hanging for a minute, another second or two won't hurt much.
         SystemClock.sleep(2000);

         // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
         doSysRq('w');
         doSysRq('l');

         // Try to add the error to the dropbox, but assuming that the ActivityManager
         // itself may be deadlocked.  (which has happened, causing this statement to
         // deadlock and the watchdog as a whole to be ineffective)
         Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                 public void run() {
                     mActivity.addErrorToDropBox(
                             "watchdog", null, "system_server", null, null,
                             subject, null, stack, null);
                 }
             };
         dropboxThread.start();
         try {
             dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
         } catch (InterruptedException ignored) {}

         IActivityController controller;
         synchronized (this) {
             controller = mController;
         }
         if (controller != null) {
             Slog.i(TAG, "Reporting stuck state to activity controller");
             try {
                 Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                 // 1 = keep waiting, -1 = kill system
                 int res = controller.systemNotResponding(subject);
                 if (res >= 0) {
                     Slog.i(TAG, "Activity controller requested to coninue to wait");
                     waitedHalf = false;
                     continue;
                 }
             } catch (RemoteException e) {
             }
         }

         // Only kill the process if the debugger is not attached.
         if (Debug.isDebuggerConnected()) {
             debuggerWasConnected = 2;
         }
         if (debuggerWasConnected >= 2) {
             Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
         } else if (debuggerWasConnected > 0) {
             Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
         } else if (!allowRestart) {
             Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
         } else {
             Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
             WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
             Slog.w(TAG, "*** GOODBYE!");
             Process.killProcess(Process.myPid());
             System.exit(10);
         }

         waitedHalf = false;
     }
 }
複製代碼

意料之中，這個run方法就是一個死循環，不斷執行檢測邏輯。至於具體的檢測邏輯，能夠分爲三部分：oop

(1) 檢測是否發生死鎖。源碼分析

(2) 若是發生死鎖，就收集SystemServer全部線程信息。post

(3) 殺死SystemServer進程。ui

下面分析這三部分的具體實現細節，先看WatchDog是如何檢測有沒有死鎖的，這裏涉及到幾個變量，先了解一下：

public class Watchdog extends Thread {
 	......
 	final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
     final HandlerChecker mMonitorChecker;
 	......
 }
複製代碼

HandlerChecker也就是檢測器，一個HandlerChecker對象用於檢測一個線程，由於咱們要檢測多個線程，因此這裏就是一個ArrayList列表。而HandlerChecker內部，會有一個Handler對象引用，它對應的Looper也就是檢測線程的Looper，咱們能夠用這個Handler對象引用向檢測線程Looper發消息，另外，HandlerChecker還有一個Monitor列表，這個就是被檢測線程中檢測對象列表，好比UI線程中的AMS，WMS這些對象，都會加到檢測UI線程的HandlerChecker對象中的Monitor列表裏面，固然AMS，WMS類他們都實現了Monitor接口。它們的關係以下：

1個WatchDog --> n個HandlerChecker --> 對應n個檢測線程

1個HandlerChekcer --> n個Monitor --> 對應n個檢測對象

回到如何檢測死鎖的問題上面，這裏會遍歷HandlerChecker列表，對每一個HandlerChecker對象，用對應線程的Handler對象post一個任務到消息隊列頭部，任務執行時會遍歷Monitor列表，對每個Monitor對象調用monitor方法，monitor方法的實現，就是具體類本身實現了，好比AMS的monitor方法：

/** In this method we try to acquire our lock to make sure that we have not deadlocked */
 public void monitor() {
     synchronized (this) { }
 }
複製代碼

WMS的monitor方法：

// Called by the heartbeat to ensure locks are not held indefnitely (for deadlock detection).
 @Override
 public void monitor() {
     synchronized (mWindowMap) { }
 }
複製代碼

能夠看到，它們的檢測邏輯都是看能不能正常獲取鎖，若是鎖一直被別的線程佔用着，這裏就會一直等待直到拿到鎖爲止。注意AMS中使用的鎖是this，WMS使用的鎖是mWindowMap，只要可以正常獲取這個鎖，就說明服務沒問題，處於正常狀態，相反，不能獲取這個鎖，就說明鎖由於未知緣由被某個線程長期佔用了，服務其它方法沒法獲取這個鎖會致使沒法被調用返回，也就是服務處於異常狀態。

其實這裏阻塞除了一直在等待鎖外，還有可能就是monitor方法根本沒被調用，雖然monitor方法執行任務被放到Looper的消息隊列最前面，但若是以前的消息阻塞了，就會致使monitor方法沒法被執行到，這種狀況也會被認爲是死鎖了。

繼續往下看，給全部線程發送全部monitor執行任務後，檢測線程就開始等待了，這裏等待30s，等待結束後，就開始查看全部monitor任務的執行狀態，執行完的就屬於正常，沒執行或者還在執行，就說明有線程發送死鎖了，這時開始收集全部線程信息：

// If we got here, that means that the system is most likely hung.
 // First collect stack traces from all threads of the system process.
 // Then kill this process so that the system will restart.
 EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

 ArrayList<Integer> pids = new ArrayList<>();
 pids.add(Process.myPid());
 if (mPhonePid > 0) pids.add(mPhonePid);
 // Pass !waitedHalf so that just in case we somehow wind up here without having
 // dumped the halfway stacks, we properly re-initialize the trace file.
 final File stack = ActivityManagerService.dumpStackTraces(
         !waitedHalf, pids, null, null, getInterestingNativePids());

 // Give some extra time to make sure the stack traces get written.
 // The system's been hanging for a minute, another second or two won't hurt much.
 SystemClock.sleep(2000);
複製代碼

收集到的線程信息，會被輸出到/data/anr/traces.txt文件中。

最後是WatchDog殺死SystemServer，這個比較簡單，調用Process的killProcess方法實現。

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
 WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
 Slog.w(TAG, "*** GOODBYE!");
 Process.killProcess(Process.myPid());
 System.exit(10);
複製代碼

總結：WatchDog的核心就是檢測死鎖邏輯，開死循環線程檢測全部線程是否阻塞住了，若是是就說明發生死鎖了。

方案設計

設計線程死鎖檢測方案，主要考慮兩個方面：檢測邏輯和線程信息抓取。

檢測邏輯
- 方案一
  
  這個與WatchDog相似，開個死循環線程檢測其它線程的Looper是否阻塞住。這裏的阻塞，包括消息隊列阻塞和沒法獲取關鍵對象鎖兩種。
- 方案二
  
  死循環線程定時（好比5s）檢查全部線程的狀態，若是發現至少有兩個線程處於BLOCKED狀態的時間過長（好比3分鐘），就認爲有線程死鎖了，這種方法對於ReentrantLock不適用，阻塞時獲取到的線程狀態爲WAITING而不是BLOCKED。
  
  Java線程的六個狀態說明參考：Java Thread States and Life Cycle
線程信息抓取

由於是在App內抓取，相對比AMS，可能會有權限問題。咱們能夠直接抓線程的堆棧信息，但這樣不會帶有哪一個線程正在持有哪一個鎖這個信息，不利於咱們分析定位問題，而根據《手Q Android線程死鎖監控與自動化分析實踐》這篇文章，它是把使用鎖歸類爲三種，synchronized、wait/notify和ReentrantLock，而後前兩種系統的/data/anr/traces.txt文件會有哪一個線程持有哪一個鎖，哪一個線程正在等待哪一個鎖這個信息，所以app只須要發送SIGQUIT信號給ANR進程觸發生成traces.txt文件便可，而對於ReentrantLock，只能在代碼裏手動記錄鎖的線程使用信息。但這裏看完有兩個問題：
1. 有些手機沒有權限讀取/data/anr/traces.txt這個文件，或者這個文件名被叫成traces-xxx，怎麼獲取到這個名稱？
  
  這個暫時找不到解決方法，讀取不到的話只有線程堆棧信息了。
2. App怎麼發送SIGQUIT信號給ANR進程？
  
  這個比較簡單，android.os.Process有現成的接口：
```
Process.sendSignal(Process.myPid(), Process.SIGNAL_QUIT);
複製代碼
```
固然若是隻經過調用棧信息大部分狀況下也是能夠分析出死鎖的緣由的，這時就不須要誰持有鎖這個信息了。