Android系統層Watchdog機制源碼分析

時間 2019-12-05

標籤 android 系統 watchdog 機制源碼分析欄目 Android 简体版

原文原文鏈接

一：爲何須要看門狗?

Watchdog,初次見到這個詞語是在大學的單片機書上, 談到了看門狗定時器. 在很早之前那個單片機剛發展的時候, 單片機容易受到外界工做影響, 致使本身的程序跑飛, 所以有了看門狗的保護機制, 即:須要每多少時間內都去喂狗, 若是不喂狗, 看門狗將觸發重啓. 大致原理是, 在系統運行之後啓動了看門狗的計數器，看門狗就開始自動計數，若是到了必定的時間還不去清看門狗，那麼看門狗計數器就會溢出從而引發看門狗中斷，形成系統復位。java

而手機, 實際上是一個超強超強的單片機, 其運行速度比單片機快N倍, 存儲空間比單片機大N倍, 裏面運行了若干個線程, 各類軟硬件協同工做, 不怕一萬,就怕萬一, 萬一咱們的系統死鎖了, 萬一咱們的手機也受到很大的干擾程序跑飛了. 均可能發生jj思密達的事情, 所以, 咱們也須要看門狗機制.android

二：Android系統層看門狗

看門狗有硬件看門狗和軟件看門狗之分, 硬件就是單片機那種的定時器電路, 軟件, 則是咱們本身實現一個相似機制的看門狗.Android系統爲了保證系統的穩定性，也設計了這麼一個看門狗，其爲了保證各類系統服務可以正常工做，要監控不少的服務，而且在覈心服務異常時要進行重啓，還要保存現場。git

接下來咱們就看看Android系統的Watchdog是怎麼設計的。github

注：本文以Android6.0代碼講解ide

Android系統的Watchdog源碼路徑在此： frameworks/base/services/core/java/com/android/server/Watchdog.javaoop

Watchdog的初始化位於SystemServer. /frameworks/base/services/java/com/android/server/SystemServer.javapost

在SystemServer中會對Watchdog進行初始化。ui

492            Slog.i(TAG, "Init Watchdog");
493            final Watchdog watchdog = Watchdog.getInstance();
494            watchdog.init(context, mActivityManagerService);
複製代碼

此時Watchdog會走以下初始化方法，先是構造方法，再是init方法：this

216    private Watchdog() {
217        super("watchdog");
218        // Initialize handler checkers for each common thread we want to check. Note
219        // that we are not currently checking the background thread, since it can
220        // potentially hold longer running operations with no guarantees about the timeliness
221        // of operations there.
222
223        // The shared foreground thread is the main checker. It is where we
224        // will also dispatch monitor checks and do other work.
225        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
226                "foreground thread", DEFAULT_TIMEOUT);
227        mHandlerCheckers.add(mMonitorChecker);
228        // Add checker for main thread. We only do a quick check since there
229        // can be UI running on the thread.
230        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
231                "main thread", DEFAULT_TIMEOUT));
232        // Add checker for shared UI thread.
233        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
234                "ui thread", DEFAULT_TIMEOUT));
235        // And also check IO thread.
236        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
237                "i/o thread", DEFAULT_TIMEOUT));
238        // And the display thread.
239        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
240                "display thread", DEFAULT_TIMEOUT));
241
242        // Initialize monitor for Binder threads.
243        addMonitor(new BinderThreadMonitor());
244    }

246    public void init(Context context, ActivityManagerService activity) {
247        mResolver = context.getContentResolver();
248        mActivity = activity;
249        // 註冊重啓廣播
250        context.registerReceiver(new RebootRequestReceiver(),
251                new IntentFilter(Intent.ACTION_REBOOT),
252                android.Manifest.permission.REBOOT, null);
253    }
複製代碼

可是咱們看了源碼會知道，Watchdog這個類繼承於Thread，因此還會須要一個啓動的地方，就是下面這行代碼，這是在ActivityManagerService的SystemReady接口中乾的。lua

Watchdog.getInstance().start();

TAG: HandlerChecker

上面的代碼中有個比較重要的類，HandlerChecker,這是Watchdog用來檢測主線程，io線程，顯示線程，UI線程的機制，代碼也不長，直接貼出來吧。其原理就是經過各個Handler的looper的MessageQueue來判斷該線程是否卡住了。固然，該線程是運行在SystemServer進程中的線程。

public final class HandlerChecker implements Runnable {
88        private final Handler mHandler;
89        private final String mName;
90        private final long mWaitMax;
91        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
92        private boolean mCompleted;
93        private Monitor mCurrentMonitor;
94        private long mStartTime;
95
96        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
97            mHandler = handler;
98            mName = name;
99            mWaitMax = waitMaxMillis;
100            mCompleted = true;
101        }
102
103        public void addMonitor(Monitor monitor) {
104            mMonitors.add(monitor);
105        }
106        // 記錄當前的開始時間
107        public void scheduleCheckLocked() {
108            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
109                // If the target looper has recently been polling, then
110                // there is no reason to enqueue our checker on it since that
111                // is as good as it not being deadlocked. This avoid having
112                // to do a context switch to check the thread. Note that we
113                // only do this if mCheckReboot is false and we have no
114                // monitors, since those would need to be executed at this point.
115                mCompleted = true;
116                return;
117            }
118
119            if (!mCompleted) {
120                // we already have a check in flight, so no need
121                return;
122            }
123
124            mCompleted = false;
125            mCurrentMonitor = null;
126            mStartTime = SystemClock.uptimeMillis();
127            mHandler.postAtFrontOfQueue(this);
128        }
129
130        public boolean isOverdueLocked() {
131            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
132        }
133        // 獲取完成時間標識
134        public int getCompletionStateLocked() {
135            if (mCompleted) {
136                return COMPLETED;
137            } else {
138                long latency = SystemClock.uptimeMillis() - mStartTime;
139                if (latency < mWaitMax/2) {
140                    return WAITING;
141                } else if (latency < mWaitMax) {
142                    return WAITED_HALF;
143                }
144            }
145            return OVERDUE;
146        }
147
148        public Thread getThread() {
149            return mHandler.getLooper().getThread();
150        }
151
152        public String getName() {
153            return mName;
154        }
155
156        public String describeBlockedStateLocked() {
157            if (mCurrentMonitor == null) {
158                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
159            } else {
160                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
161                        + " on " + mName + " (" + getThread().getName() + ")";
162            }
163        }
164
165        @Override
166        public void run() {
167            final int size = mMonitors.size();
168            for (int i = 0 ; i < size ; i++) {
169                synchronized (Watchdog.this) {
170                    mCurrentMonitor = mMonitors.get(i);
171                }
172                mCurrentMonitor.monitor();
173            }
174
175            synchronized (Watchdog.this) {
176                mCompleted = true;
177                mCurrentMonitor = null;
178            }
179        }
180    }
複製代碼

經過上面的代碼，咱們能夠看到一個核心的方法是

mHandler.getLooper().getQueue().isPolling()

這個方法的實如今MessageQueue中，我將代碼貼出來，咱們能夠看到上面的註釋寫到：返回當前的looper線程是否在polling工做來作，這個是個很好的用於檢測loop是否存活的方法。咱們從HandlerChecker源碼能夠看到，若是looper這個返回true，將會直接返回。

139    /** 140 * Returns whether this looper's thread is currently polling for more work to do. 141 * This is a good signal that the loop is still alive rather than being stuck 142 * handling a callback. Note that this method is intrinsically racy, since the 143 * state of the loop can change before you get the result back. 144 * 145 * <p>This method is safe to call from any thread. 146 * 147 * @return True if the looper is currently polling for events. 148 * @hide 149 */
150    public boolean isPolling() {
151        synchronized (this) {
152            return isPollingLocked();
153        }
154    }
155
複製代碼

若沒有返回true，代表looper當前正在工做，會post一下本身，同時將mComplete置爲false，標明已經發出一個消息正在等待處理。若是當前的looper沒有阻塞，那很快，將會調用到本身的run方法。

本身的run方法幹了什麼呢。乾的是TAG: HandlerChecker源碼裏面的166行，裏面對本身的Monitors遍歷並進行monitor。（注：此處的monitor下面會講到），如有monitor發生了阻塞，那麼mComplete會一直是false。

那麼在系統檢測調用這個獲取完成狀態時，就會進入else裏面，進行了時間的計算，並返回相應的時間狀態碼。

133        // 獲取完成時間標識
134        public int getCompletionStateLocked() {
135            if (mCompleted) {
136                return COMPLETED;
137            } else {
138                long latency = SystemClock.uptimeMillis() - mStartTime;
139                if (latency < mWaitMax/2) {
140                    return WAITING;
141                } else if (latency < mWaitMax) {
142                    return WAITED_HALF;
143                }
144            }
145            return OVERDUE;
146        }
複製代碼

好了，到這咱們已經知道是怎麼判斷線程是否卡住的了

MessageQueue.isPolling
Monitor.monitor

TAG：Monitor

204    public interface Monitor {
205        void monitor();
206    }
複製代碼

Monitor是一個接口，實現這個接口的類有好幾個。好比：以下我搜出來的結果

看，有這麼多的類實現了該接口，並且咱們都不用去猜，就能夠知道，他們必定會註冊到這個Watchdog中。註冊到哪的呢，下面代碼能夠看到。

225        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
226                "foreground thread", DEFAULT_TIMEOUT);
227        mHandlerCheckers.add(mMonitorChecker);

275    public void addMonitor(Monitor monitor) {
276        synchronized (this) {
277            if (isAlive()) {
278                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
279            }
280            mMonitorChecker.addMonitor(monitor);
281        }
282    }
複製代碼

因此各個實現這個接口的類，只須要調一下，上述接口就好了。咱們看一下ActivityManagerService類的調法。路徑在此，點擊能夠進入。 /frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

2381        Watchdog.getInstance().addMonitor(this);

19655    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
19656    public void monitor() {
19657        synchronized (this) { }
19658    }
複製代碼

能夠看到，咱們的AMS實現了該接口，並在2381行，將本身註冊進Watchdog. 同時其monitor方法只是同步一下本身，確保本身沒有死鎖。乾的事情雖然很少，但這足夠了。足夠讓外部經過這個方法獲得AMS是否死了。

好了，如今咱們知道是如何判斷其餘服務是否死鎖了，那麼看Watchdog的run方法是怎麼完成這一套機制的吧。

TAG: Watchdog.run

run方法就是死循環，不斷的去遍歷全部HandlerChecker,並調其監控方法，等待三十秒，評估狀態。具體見下面的註釋：

341    @Override
342    public void run() {
343        boolean waitedHalf = false;
344        while (true) {
345            final ArrayList<HandlerChecker> blockedCheckers;
346            final String subject;
347            final boolean allowRestart;
348            int debuggerWasConnected = 0;
349            synchronized (this) {
350                long timeout = CHECK_INTERVAL;
351                // Make sure we (re)spin the checkers that have become idle within
352                // this wait-and-check interval
                   // 在這裏，咱們遍歷全部HandlerChecker,並調其監控方法，記錄開始時間
353                for (int i=0; i<mHandlerCheckers.size(); i++) {
354                    HandlerChecker hc = mHandlerCheckers.get(i);
355                    hc.scheduleCheckLocked();
356                }
357
358                if (debuggerWasConnected > 0) {
359                    debuggerWasConnected--;
360                }
361
362                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
363                // wait while asleep. If the device is asleep then the thing that we are waiting
364                // to timeout on is asleep as well and won't have a chance to run, causing a false
365                // positive on when to kill things.
366                long start = SystemClock.uptimeMillis();
                   // 等待30秒，使用uptimeMills是爲了避免把手機睡眠時間算進入，手機睡眠時系統服務一樣睡眠
367                while (timeout > 0) {
368                    if (Debug.isDebuggerConnected()) {
369                        debuggerWasConnected = 2;
370                    }
371                    try {
372                        wait(timeout);
373                    } catch (InterruptedException e) {
374                        Log.wtf(TAG, e);
375                    }
376                    if (Debug.isDebuggerConnected()) {
377                        debuggerWasConnected = 2;
378                    }
379                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
380                }
381                // 評估Checker的狀態，裏面會遍歷全部的HandlerChecker,並獲取最大的返回值。
382                final int waitState = evaluateCheckerCompletionLocked();
                   // 最大的返回值有四種狀況，分別是：COMPLETED對應消息已處理完畢線程無阻塞
383                if (waitState == COMPLETED) {
384                    // The monitors have returned; reset
385                    waitedHalf = false;
386                    continue;
                   // WAITING對應消息處理花費0～29秒,繼續運行
387                } else if (waitState == WAITING) {
388                    // still waiting but within their configured intervals; back off and recheck
389                    continue;
                   // WAITED_HALF對應消息處理花費30～59秒，線程可能已經被阻塞，須要保存當前AMS堆棧狀態
390                } else if (waitState == WAITED_HALF) {
391                    if (!waitedHalf) {
392                        // We've waited half the deadlock-detection interval. Pull a stack
393                        // trace and wait another half.
394                        ArrayList<Integer> pids = new ArrayList<Integer>();
395                        pids.add(Process.myPid());
396                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
397                                NATIVE_STACKS_OF_INTEREST);
398                        waitedHalf = true;
399                    }
400                    continue;
401                }
402                //OVERDUE對應消息處理已經花費超過60, 可以走到這裏，說明已經發生了超時60秒了。那麼下面接下來全是應對超時的狀況
403                // something is overdue!
404                blockedCheckers = getBlockedCheckersLocked();
405                subject = describeCheckersLocked(blockedCheckers);
406                allowRestart = mAllowRestart;
407            }
408
409            // If we got here, that means that the system is most likely hung.
410            // First collect stack traces from all threads of the system process.
411            // Then kill this process so that the system will restart.
412            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
413
               .......各類記錄的保存
468
469            // Only kill the process if the debugger is not attached.
470            if (Debug.isDebuggerConnected()) {
471                debuggerWasConnected = 2;
472            }
473            if (debuggerWasConnected >= 2) {
474                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
475            } else if (debuggerWasConnected > 0) {
476                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
477            } else if (!allowRestart) {
478                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
479            } else {
480                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
481                for (int i=0; i<blockedCheckers.size(); i++) {
482                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
483                    StackTraceElement[] stackTrace
484                            = blockedCheckers.get(i).getThread().getStackTrace();
485                    for (StackTraceElement element: stackTrace) {
486                        Slog.w(TAG, " at " + element);
487                    }
488                }
489                Slog.w(TAG, "*** GOODBYE!");
490                Process.killProcess(Process.myPid());
491                System.exit(10);
492            }
493
494            waitedHalf = false;
495        }
496    }
複製代碼