因爲誤操做在erlcron
設置了一個超過3個月後的定時任務。而後次日以後發現天天的daily reset沒有被執行,一些定時任務也沒有被執行。瞬間感受整我的都很差了,怎麼無故端就不執行了呢。html
經過排查日誌,發現瞭如下報錯:git
2016-03-22 16:54:32.014 [error] gen_server ecrn_control terminated with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111 2016-03-22 16:54:32.015 [error] CRASH REPORT Process ecrn_control with 0 neighbours exited with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111 in gen_server:terminate/6 line 744
我擦,ecrn_control
都崩了,怎麼回事。github
找到具體出錯的代碼:web
internal_cancel(AlarmRef) -> case ecrn_reg:get(AlarmRef) of undefined -> undefined; {ok, [Pid]} -> ecrn_agent:cancel(Pid) end.
發現調用ecrn_reg:get(AlarmRef)
被返回了{ok, List},並且這個List的數據遠不止一個。明顯在設置那個超過3個月的定時任務的時候,ecrn_reg
被註冊進了髒數據。express
erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}). erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}). erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).
查看observer:start()
能夠看到進程樹以下:canvas
erlcron:cron({{once, 4294968}, {io, fwrite, ["Hello, world!~n"]}}).
結果就gg了,好多崩潰信息是否是:ruby
22:49:16.818 [error] CRASH REPORT Process <0.5822.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.818 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5822.64> exit with reason timeout_value in context child_terminated 22:49:16.819 [error] CRASH REPORT Process <0.5701.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5701.64> exit with reason timeout_value in context child_terminated 22:49:16.821 [error] CRASH REPORT Process <0.6237.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.6237.64> exit with reason timeout_value in context child_terminated 22:49:16.821 [error] CRASH REPORT Process <0.5862.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5862.64> exit with reason timeout_value in context child_terminated ...(總共有25條)
再看一下進程數:markdown
我擦,爲毛原來的 scrn_agent 進程也沒有了。app
能夠發現,erlcron 在嘗試了25次設置 這個定時任務以後,也就是 scrn_agent 崩潰了25次以後,原來設置的三個正常的定時任務的scrn_agent 進程也沒有掉了。 也就是說,不但我新設置的定時任務沒有成功,並且我原來正常的定時任務也沒有掉了。ide
再看一下崩潰日誌裏面的崩掉的進程號,每個都是不同的。能夠推算其實原來的報錯ecrn_reg:get(AlarmRef)
獲取到了多個Pid,其實就是這裏插入失敗的定時任務產生的25個Pid。也就是說,雖然ecrn_agent
進程崩潰了,可是ecrn_reg
仍是保存了這些Pid。因此在取消這些定時任務的時候,ecrn_reg:get(AlarmRef)
返回的內容在internal_cancel(AlarmRef)
沒有被匹配到。
爲何設置了4294968
秒後的定時任務就崩潰了。這個數估計不少人很熟悉,2^32=4294967296
,而4294968000
也就是恰好大於2^32
。即,若是設置的定時任務超過了2^32
毫秒,在erlcron
裏面就不支持了。
查看gen_server:loop
的源碼,找到引發崩潰的代碼:
loop(Parent, Name, State, Mod, hibernate, Debug) -> proc_lib:hibernate(?MODULE,wake_hib,[Parent, Name, State, Mod, Debug]); loop(Parent, Name, State, Mod, Time, Debug) -> Msg = receive Input -> Input after Time -> timeout end, decode_msg(Msg, Parent, Name, State, Mod, Time, Debug, false).
能夠發現引發崩潰的,358行是一段receive
代碼。也就是說receive
是不支持超過2^32
大小的。
自測了一下,的確若是receive
的after
後面若是是大於等於2^32
的數值就會出現bad receive timeout value
的報錯。查看官方解釋,已經明確說明不能大於32位
大小。
ExprT is to evaluate to an integer. The highest allowed value is 16#FFFFFFFF, that is, the value must fit in 32 bits. receive..after works exactly as receive, except that if no matching message has arrived within ExprT milliseconds, then BodyT is evaluated instead. The return value of BodyT then becomes the return value of the receive..after expression.
引用自:http://erlang.org/doc/reference_manual/expressions.html
再回到erlcron
, 在 ecrn_agent:start_link
的時候,ecrn_agent:init
執行完ecrn_reg:register(JobRef, self())
返回{ok, NewState, Millis}
到gen_server
以後,Millis若是超過2^32
在gen_server:loop
就會引發gen_server
的timeout_value
異常退出。
%% @private init([JobRef, Job]) -> State = #state{job=Job, alarm_ref=JobRef}, {DateTime, Actual} = ecrn_control:datetime(), NewState = set_internal_time(State, DateTime, Actual), case until_next_milliseconds(NewState, Job) of {ok, Millis} when is_integer(Millis) -> ecrn_reg:register(JobRef, self()), {ok, NewState, Millis}; {error, _} -> {stop, normal} end.
這坑踩的,有點鬱悶。其實這跟erlcron
也不要緊,也不是gen_server
的問題。而是erlang
自身receive
不支持2^32引發的。繼續往下查其實能夠發現,再往下是其它語言寫的了。
-module(prim_eval). %% This module is simply a stub which abstract code gets included in the result %% of compilation of prim_eval.S, to keep Dialyzer happy. -export(['receive'/2]). -spec 'receive'(fun((term()) -> nomatch | T), timeout()) -> T. 'receive'(_, _) -> erlang:nif_error(stub).
與君共勉