理解Erlang/OTP Supervisor

http://www.cnblogs.com/me-sa/archive/2012/01/10/erlang0030.htmlhtml

Supervisors are used to build an hierarchical process structure called a supervision tree, a nice way to structure a fault tolerant application.
                                                                                                                                                                                     --Erlang/OTP Doc數據庫

     Supervisor的基本思想就是經過創建層級結構實現錯誤隔離和管理,具體方法是經過重啓的方式保持子進程一直活着.若是supervisor是進程樹的一部分,它會被它的supervisor自動終止,當它的supervisor讓它shutdown的時候,它會按照子進程啓動順序的逆序終止其全部的子進程,最後終止掉本身.重啓的目的是讓系統迴歸到一個穩定的狀態,迴歸穩定狀態後再出現異常能夠進行重試,若是初始化都不穩定,後續的監控-重啓策略意義不大.換句話說,Application初始化的階段要有可靠性的保障,初始化階段可能讀取配置文件或者從數據庫加載恢復數據,哪怕執行時間長一點都等待同步執行完.若是application依賴非本地數據庫或外部服務就能夠採起更快的異步啓動,由於這種服務在正常使用過程當中也常常出情況,早一點仍是晚一點啓動沒有什麼關係.app

   [Erlang 0025]理解Erlang/OTP - Application以log4erl項目爲學習了Erlang/OTP application,咱們說到application在start的方法中啓動了log4erl的頂層監控樹.今天咱們繼續跟進,看log4erl的監控樹是怎麼構建起來的,並作實驗看supervisor如何經過重啓恢復服務的.使用application:start(log4erl).啓動起來以後的進程樹:異步

下面是log4erl_sup文件的start_link方法,supervisor:start_link方法的執行是同步的,直到全部的子進程都啓動了纔會返回. supervisor:start_link會使用回調函數init/1.ide

複製代碼
start_link(Default_logger) ->
R = supervisor:start_link({local, ?MODULE}, ?MODULE, []),
%log4erl:start_link(Default_logger),
add_logger(Default_logger),
?LOG2("Result in supervisor is ~p~n",[R]),
R.

%%回調的方法init/1
init([]) ->
{ok,
{
{one_for_one,3,10},
[]
}
}.
複製代碼
  log4erl的頂層監控樹的初始化至關簡單僅僅定義了重啓策略(RestartStrategy)和最大重啓頻率(maximum restart frequency):{one_for_one,3,10}.
 {one_for_one,3,10}表達的語義是{How, Max, Within}:在多長時間內(Within)重啓了幾回(Max),如何重啓(HOW 重啓策略);設計最大重啓頻率是爲了不反覆重啓進入死循環,一旦超出了此閾值,supervisor進程會結束掉本身以及它全部的子進程,並經過進程樹傳遞退出消息,更上層的supervisor就會採起適當的措施,要麼重啓終止的supervisor要麼本身也終止掉.可能比較糾結這幾個值怎麼配置,多數資料上都會告訴你"如何配置徹底取決於你的應用程序".這個仍是有經驗值的,生成環境的經驗值是一小時內重啓4次,也能夠參考一些和你應用相似的開源項目看看它們是怎麼配置的.若是填寫的是{one_for_one,0,1}就是不容許重啓,下面的示例中能夠看到YAWS項目採用了這樣的策略.
 
下面幾個開源項目頂層supervisor的init方法:
複製代碼
%%rabbit_sup.erl  來自大名鼎鼎的rabbitmq
init([]) ->
{ok, {{one_for_all, 0, 1}, []}}.


%%yaws_sup.erl Yaws項目 - Yet Another Web Server
init([]) ->

ChildSpecs = child_specs(),
%% 0, 1 means that we never want supervisor restarts
{ok,{{one_for_all, 0, 1}, ChildSpecs}}.


%%ejabberd_sup ejabberd項目
init([]) ->
Hooks =
{ejabberd_hooks,
{ejabberd_hooks, start_link, []},
%%......................... 省略代碼
{ok, {{one_for_one, 10, 1},
[Hooks,
GlobalRouter,
Cluster,
..................
Listener]}}.
複製代碼
 
重啓策略
 
 one_for_one : 把子進程當成各自獨立的,一個進程出現問題其它進程不會受到崩潰的進程的影響.該子進程死掉,只有這個進程會被重啓
 one_for_all : 若是子進程終止,全部其它子進程也都會被終止,而後全部進程都會被重啓.
 rest_for_one:若是一個子進程終止,在這個進程啓動以後啓動的進程都會被終止掉.而後終止掉的進程和連帶關閉的進程都會被重啓.
 simple_one_for_one 是one_for_one的簡化版 ,全部子進程都動態添加同一種進程的實例

 one-for-one維護了一個按照啓動順序排序的子進程列表,而simple_one_for_one 因爲全部的子進程都是一樣的(相同的MFA),使用的是字典來維護子進程信息;
Note: one of the big differences between one_for_one and simple_one_for_one is that one_for_one holds a list of all the children it has (and had, if you don't clear it), started in order, while simple_one_for_one holds a single definition for all its children and works using a dict to hold its data. Basically, when a process crashes, the simple_one_for_one supervisor will be much faster when you have a large number of children.
 
Note: it is important to note that  simple_one_for_one  children are  not respecting this rule with the  Shutdown time. In the case of  simple_one_for_one, the supervisor will just exit and it will be left to each of the workers to terminate on their own, after their supervisor is gone.
 
For the most part, writing a  simple_one_for_one supervisor is similar to writing any other type of supervisor, except for one thing. The argument list in the  {M,F,A} tuple is not the whole thing, but is going to be appended to what you call it with when you do supervisor:start_child(Sup, Args). That's right,  supervisor:start_child/2 changes API. So instead of doing  supervisor:start_child(Sup, Spec), which would call  erlang:apply(M,F,A), we now have supervisor:start_child(Sup, Args), which calls  erlang:apply(M,F,Args++A).

在log4erl_sup.erl的start_link中啓動了頂層supervisor以後,添加了一個默認的logger: add_logger(Default_logger),
複製代碼
add_logger(Name) when is_atom(Name) ->
N = atom_to_list(Name),
add_logger(N);
add_logger(Name) when is_list(Name) ->
C1 = {Name,
{log_manager, start_link ,[Name]},
permanent,
10000,
worker,
[log_manager]},

?LOG2("Adding ~p to ~p~n",[C1, ?MODULE]),
supervisor:start_child(?MODULE, C1).
複製代碼
添加的logger是log4erl_sup的子進程,子進程啓動和監控的方式經過child specification來指定.
 C1 =  {Name,{log_manager, start_link ,[Name]},permanent,10000,worker,[log_manager]}
 
C1的六個數據項分別爲: {ID, StartEntery, Restart, Shutdown, Type, Modules}:
ID :supervisor 用來在內部區分specification的,因此只要子進程規格說明之間不重複就能夠.
Start : 啓動參數{M,F,A}
Restart : 這個進程遇到錯誤以後是否重啓
                permanent:遇到任何錯誤致使進程終止就會重啓
                temporary:進程永遠都不會被重啓
                transient: 只有進程異常終止的時候會被重啓
Shutdown 進程如何被幹掉,這裏是使用整型值2000的意思是,進程在被強制幹掉以前有2000毫秒的時間料理後事自行終止.
              實際過程是supervisor給子進程發送一個exit(Pid,shutdown)而後等待exit信號返回,在指定時間沒有返回則將子進程使用exit(Child,kill)
             這裏的參數還有 brutal_kill 意思是進程立刻就會被幹掉
             infinity :當一個子進程是supervisor那麼就要用infinity,意思是給supervisor足夠的時間進行重啓.
Type 這裏只有兩個值:supervisor worker ; 只要沒有實現supervisor behavior的進程都是worker;
                    能夠經過supervisor的層級結構來精細化對進程的控制.這個值主要做用是告知監控進程它的子進程是supervisor仍是worker
Modules 是進程依賴的模塊,這個信息只有在代碼熱更新的時候纔會被用到:標註了哪些模塊須要按照什麼順序進行更新;一般這裏只須要列出進程依賴的主模塊. 若是子進程是supervisor gen_server gen_fsm Module名是回調模塊的名稱,這時Modules的值是隻有一個元素的列表,元素就是回調模塊的名稱;若是子進程是gen_event Modules的值是 dynamic;關於dynamic參數餘鋒有一篇專門的分析:Erlang supervisor規格的dynamic行爲分析  http://blog.yufeng.info/archives/1455
 
Modules is a list of one element, the name of the callback module used by the child behavior. The exception to that is when you have callback modules whose identity you do not know beforehand (such as event handlers in an event manager). In this case, the value of  Modules should be  dynamic so that the whole OTP system knows who to contact when using more advanced features, such as  releases.
 
 
實際應用中log4erl中的logger會根據業務邏輯添加多個,咱們也不是直接經過application:start(log4erl).而是調用  log4erl:conf(log4erl.conf)這個方法簡單的封裝了內部邏輯,實際調用的是 log4erl_conf:conf(File).咱們這裏定義一個簡單的log4erl.conf文件,使用log4erl:conf(log4erl.conf).啓動以後,咱們看看它的進程樹是什麼樣的:
%%log4erl.conf文件 內容我作了簡單的縮排
%%mod
logger default_logger{
     file_appender default_app{
    dir = "./log", level = debug, file = default_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}

%%mail mod
logger mail_logger{
     file_appender mail_app{
     dir = "./log", level = debug, file = mail_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}
對應的進程樹是這樣的,進程之間的紅線表示link關係:

咱們沿着調用關係,逐步跟進代碼:函數

複製代碼
%==== File : log4erl_conf =======

%%log4erl_conf:conf(File).
conf(File) ->
application:start(log4erl), %%啓動log4erl
Tree = parse(leex(File)), %%解析配置文件
traverse(Tree). %%遍歷配置項構造監控樹

%%跟進遍歷的邏輯,對於每一條配置執行的是element/1方法
traverse([]) ->
ok;
traverse([H|Tree]) ->
element(H),
traverse(Tree).

%%對於咱們自定義的logger走的是{logger, Logger, Appenders}邏輯
element({cutoff_level, CutoffLevel}) ->
log_filter_codegen:set_cutoff_level(CutoffLevel);
element({default_logger, Appenders}) ->
appenders(Appenders);
element({logger, Logger, Appenders}) ->
log4erl:add_logger(Logger),
appenders(Logger, Appenders).

%==== File : log4erl =======
%%繼續跟進咱們走到log4erl:add_logger/1
add_logger(Logger) ->
try_msg({add_logger, Logger}).

%%try_msg 是的添加了異常捕獲的通用方法
try_msg(Msg) ->
try
handle_call(Msg)
catch
exit:{noproc, _M} ->
io:format("log4erl has not been initialized yet. To do so, please run~n"),
io:format("> application:start(log4erl).~n"),
{error, log4erl_not_started};
E:M ->
?LOG2("Error message received by log4erl is ~p:~p~n",[E, M]),
{E, M}
end.

%%handle_call的代碼片斷
handle_call({add_logger, Logger}) ->
log_manager:add_logger(Logger);

%==== File : log_manager =======
%%邏輯轉到log_manager的add_logger(Logger)
%%最終調用的是log4erl_sup:add_logger(Logger).這個咱們上面已經分析過了
add_logger(Logger) ->
log4erl_sup:add_logger(Logger).

%%element方法在添加loger以後會添加appender
appenders([]) ->
ok;
appenders([H|Apps]) ->
appender(H),
appenders(Apps).

appenders(_, []) ->
ok;
appenders(Logger, [H|Apps]) ->
appender(Logger, H),
appenders(Logger, Apps).

appender({appender, App, Name, Conf}) ->
log4erl:add_appender({App, Name}, {conf, Conf}).

appender(Logger, {appender, App, Name, Conf}) ->
log4erl:add_appender(Logger, {App, Name}, {conf, Conf}).


%==== File : log4erl =======
%% Appender = {Appender, Name}
add_appender(Logger, Appender, Conf) ->
try_msg({add_appender, Logger, Appender, Conf}).

handle_call({add_appender, Logger, Appender, Conf}) ->
log_manager:add_appender(Logger, Appender, Conf);

%==== File : log_manager =======
add_appender(Logger, {Appender, Name} , Conf) ->
?LOG2("add_appender ~p with name ~p to ~p with Conf ~p ~n",[Appender, Name, Logger, Conf]),
log4erl_sup:add_guard(Logger, Appender, Name, Conf).

%==== File : log4erl_sup =======
add_guard(Logger, Appender, Name, Conf) ->
C = {Name,
{logger_guard, start_link ,[Logger, Appender, Name, Conf]},
permanent,
10000,
worker,
[logger_guard]},
?LOG2("Adding ~p to ~p~n",[C, ?MODULE]),
supervisor:start_child(?MODULE, C).

%==== File : logger_guard =======
start_link(Logger, Appender, Name, Conf) ->
%?LOG2("starting guard for logger ~p~n",[Logger]),
{ok, Pid} = gen_server:start_link(?MODULE, [Appender, Name], []),
case add_sup_handler(Pid, Logger, Conf) of
{error, E} ->
gen_server:call(Pid, stop),
{error, E};
_R ->
{ok, Pid}
end.

add_sup_handler(G_pid, Logger, Conf) ->
?LOG("add_sup()~n"),
gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).

handle_call({add_sup_handler, Logger, Conf}, _From, [{appender, Appender, Name}] = State) ->
?LOG2("Adding handler ~p with name ~p for ~p From ~p~n",[Appender, Name, Logger, _From]),
try
Res = gen_event:add_sup_handler(Logger, {Appender, Name}, Conf),
{reply, Res, State}
catch
E:R ->
{reply, {error, {E,R}}, State}
end;
複製代碼

gen_event:add_sup_handler會創建EventManager與Event Handler之間的link的關係,因此咱們修改一下,註釋掉這段,看看監控樹是什麼樣子:學習

add_sup_handler(G_pid, Logger, Conf) ->

%    ?LOG("add_sup()~n"),
%    gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).
  ok.ui

註釋掉以後能夠看到logger和guard之間的link關係就不存在了.

 
kill進程的實驗
 
後面咱們會用各類狀況殺掉進程,看這個進程樹對異常的處理狀況;咱們的實驗步驟:
1.發送退出消息Reason:some_reason給default_logger
2.發送退出消息Reason:kill 給default_logger
3.發送退出消息Reason:some_reason給logger_guard
4.發送退出消息Reason:some_reason給log4erl_sup
5.發送退出消息Reason:kill 給log4erl_sup
複製代碼
3> whereis(default_logger).
<0.45.0>
4> exit(whereis(default_logger),some_reason).
true
5> whereis(default_logger).
<0.45.0>
6> exit(whereis(default_logger),some_reason). %%因爲gen_event默認process_flag(trap_exit, true),因此some_reason的退出消息並無把它幹掉
true
7> whereis(default_logger).
<0.45.0>
8> exit(whereis(default_logger),kill). %%向進程發送強制退出消息,
true

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %首先可以看到log4erl報出的子進程終止的報告
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.45.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %log4erl_sup重建default_logger,新進程pid是<0.69.0>
supervisor: {local,log4erl_sup}
started: [{pid,<0.69.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %default_logger退出消息轉變成爲killed繼續廣播給link的進程,對應的logger_guard終止
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.46.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size},
{max,1000000},{suffix,log}, {rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %logger_guard 重建
supervisor: {local,log4erl_sup}
started: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug}, {file,default_log},{type,size},
{max,1000000}, {suffix,log},{rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]
9> whereis(default_logger).
<0.69.0>
10> is_process_alive(pid(0,70,0)). %這是新啓動的logger_guard進程
true
11> exit(pid(0,70,0),some_reason). %向進程發送一個退出消息
true

=SUPERVISOR REPORT==== 10-Jan-2012::11:07:51 ===
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: some_reason
Offender: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

12>
=PROGRESS REPORT==== 10-Jan-2012::11:07:51 ===
supervisor: {local,log4erl_sup}
started: [{pid,<0.76.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},{child_type,worker}]
12> is_process_alive(pid(0,70,0)).
false
13> whereis(default_logger). %退出消息廣播對default_logger沒有影響
<0.69.0>
14> whereis(log4erl_sup).
<0.44.0>
15> exit(whereis(log4erl_sup),some_reason). % Supervisor 初始化的時候也會設置 process_flag(trap_exit, true),
true
16> whereis(log4erl_sup).
<0.44.0>
17> exit(whereis(log4erl_sup),kill). %殺掉log4erl_sup 應用程序中止
true

=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.69.0>
registered_name: default_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.76.0>,killed}]
links: [#Port<0.1891>,#Port<0.1885>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 720
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.47.0>
registered_name: mail_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.48.0>,killed}]
links: [#Port<0.546>]
dictionary: []
trap_exit: true
status: running
heap_size: 377
stack_size: 24
reductions: 411
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:25 ===
crasher:
initial call: application_master:init/4
pid: <0.42.0>
registered_name: []
exception exit: killed
in function application_master:terminate/2
ancestors: [<0.41.0>]
messages: []
links: [<0.6.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 1555
neighbours:
18>
=INFO REPORT==== 10-Jan-2012::13:26:25 ===
application: log4erl
exited: killed
type: temporary
18>
複製代碼


 最後再貼一次log4erl項目的地址: http://code.google.com/p/log4erl/,建議下載下來代碼本身動手作一下上面的實驗.this

相關文章
相關標籤/搜索