[erlang]一次erlcron崩潰引發的事故分析

時間 2019-11-10

標籤 erlang 一次 erlcron 崩潰引發事故分析简体版

原文原文鏈接

事故背景

因爲誤操做在erlcron設置了一個超過3個月後的定時任務。而後次日以後發現天天的daily reset沒有被執行，一些定時任務也沒有被執行。瞬間感受整我的都很差了，怎麼無故端就不執行了呢。html

經過排查日誌，發現瞭如下報錯：git

2016-03-22 16:54:32.014 [error] gen_server ecrn_control terminated with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111
2016-03-22 16:54:32.015 [error] CRASH REPORT Process ecrn_control with 0 neighbours exited with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111 in gen_server:terminate/6 line 744

我擦，ecrn_control都崩了，怎麼回事。github

找到具體出錯的代碼：web

internal_cancel(AlarmRef) ->
    case ecrn_reg:get(AlarmRef) of
        undefined ->
            undefined;
        {ok, [Pid]} ->
            ecrn_agent:cancel(Pid)
    end.

發現調用ecrn_reg:get(AlarmRef)被返回了{ok, List}，並且這個List的數據遠不止一個。明顯在設置那個超過3個月的定時任務的時候，ecrn_reg被註冊進了髒數據。express

事故重現

先設置幾個正常的定時任務

erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).
erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).
erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).

查看observer:start() 能夠看到進程樹以下：canvas

再設置一個4294968秒以後的定時任務

erlcron:cron({{once, 4294968}, {io, fwrite, ["Hello, world!~n"]}}).

結果就gg了，好多崩潰信息是否是：ruby

22:49:16.818 [error] CRASH REPORT Process <0.5822.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.818 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5822.64> exit with reason timeout_value in context child_terminated
22:49:16.819 [error] CRASH REPORT Process <0.5701.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5701.64> exit with reason timeout_value in context child_terminated
22:49:16.821 [error] CRASH REPORT Process <0.6237.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.6237.64> exit with reason timeout_value in context child_terminated
22:49:16.821 [error] CRASH REPORT Process <0.5862.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358
22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5862.64> exit with reason timeout_value in context child_terminated

...(總共有25條)

再看一下進程數：markdown

我擦，爲毛原來的 scrn_agent 進程也沒有了。app

能夠發現，erlcron 在嘗試了25次設置這個定時任務以後，也就是 scrn_agent 崩潰了25次以後，原來設置的三個正常的定時任務的scrn_agent 進程也沒有掉了。也就是說，不但我新設置的定時任務沒有成功，並且我原來正常的定時任務也沒有掉了。ide

再看一下崩潰日誌裏面的崩掉的進程號，每個都是不同的。能夠推算其實原來的報錯ecrn_reg:get(AlarmRef)獲取到了多個Pid，其實就是這裏插入失敗的定時任務產生的25個Pid。也就是說，雖然ecrn_agent進程崩潰了，可是ecrn_reg仍是保存了這些Pid。因此在取消這些定時任務的時候，ecrn_reg:get(AlarmRef)返回的內容在internal_cancel(AlarmRef)沒有被匹配到。

爲何是4294968，實際上是2^32

爲何設置了4294968秒後的定時任務就崩潰了。這個數估計不少人很熟悉，2^32=4294967296，而4294968000也就是恰好大於2^32。即，若是設置的定時任務超過了2^32毫秒，在erlcron裏面就不支持了。

查看gen_server:loop的源碼，找到引發崩潰的代碼：

loop(Parent, Name, State, Mod, hibernate, Debug) ->
    proc_lib:hibernate(?MODULE,wake_hib,[Parent, Name, State, Mod, Debug]);
loop(Parent, Name, State, Mod, Time, Debug) ->
    Msg = receive
          Input ->
            Input
      after Time ->
          timeout
      end,
    decode_msg(Msg, Parent, Name, State, Mod, Time, Debug, false).

能夠發現引發崩潰的，358行是一段receive代碼。也就是說receive是不支持超過2^32大小的。

自測了一下，的確若是receive的after後面若是是大於等於2^32的數值就會出現bad receive timeout value的報錯。查看官方解釋，已經明確說明不能大於32位大小。

ExprT is to evaluate to an integer. The highest allowed value is 16#FFFFFFFF, that is, the value must fit in 32 bits. receive..after works exactly as receive, except that if no matching message has arrived within ExprT milliseconds, then BodyT is evaluated instead. The return value of BodyT then becomes the return value of the receive..after expression.

引用自：http://erlang.org/doc/reference_manual/expressions.html

再回到erlcron，在 ecrn_agent:start_link的時候，ecrn_agent:init執行完ecrn_reg:register(JobRef, self())返回{ok, NewState, Millis}到gen_server以後，Millis若是超過2^32在gen_server:loop就會引發gen_server的timeout_value異常退出。

%% @private
init([JobRef, Job]) ->
    State = #state{job=Job,
                   alarm_ref=JobRef},
    {DateTime, Actual} = ecrn_control:datetime(),
    NewState = set_internal_time(State, DateTime, Actual),
    case until_next_milliseconds(NewState, Job) of
        {ok, Millis} when is_integer(Millis) ->
            ecrn_reg:register(JobRef, self()),
            {ok, NewState, Millis};
        {error, _}  ->
            {stop, normal}
    end.

最後

這坑踩的，有點鬱悶。其實這跟erlcron也不要緊，也不是gen_server的問題。而是erlang自身receive不支持2^32引發的。繼續往下查其實能夠發現，再往下是其它語言寫的了。

-module(prim_eval).

%% This module is simply a stub which abstract code gets included in the result
%% of compilation of prim_eval.S, to keep Dialyzer happy.

-export(['receive'/2]).

-spec 'receive'(fun((term()) -> nomatch | T), timeout()) -> T.
'receive'(_, _) ->
    erlang:nif_error(stub).

與君共勉

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。