sigsuspend()阻塞:異步信號SIGIO爲何會被截胡?

關鍵詞:fcntl、fasync、signal、sigsuspend、pthread_sigmask、trace events編程

 

此文主要是解決問題過程當中的記錄,內容有較多冗餘。但也反映解決問題中用到的方法和思路。數據結構

簡單的描述問題就是:snap線程在pthread_sigmask()和sigsuspend()之間調度出去,而後此時中斷髮送SIGIO信號。多線程

但此時snap線程是阻塞SIGIO信號的,因此內核選擇喚醒其餘進程來處理信號。app

在內核返回用戶空間的時候,AiApp處理了SIGIO信號。而snap並無獲得喚醒,一直處於sigsuspend()中。less

解決的方法就是講SIGIO信號發送和snap線程綁定,而不是和snap線程所在的進程組綁定。保證SIGIO只發送到snap。ssh

    /* asynchronus notification enable */
    owner_ex.pid = syscall(SYS_gettid);
    owner_ex.type = F_OWNER_TID;
    fcntl(enc->fd_enc, F_SETOWN_EX, &owner_ex);
    //fcntl(enc->fd_enc, F_SETOWN, syscall(SYS_gettid));  /* this thread will receive SIGIO */
    oflags = fcntl(enc->fd_enc, F_GETFL);
    fcntl(enc->fd_enc, F_SETFL, oflags | FASYNC);   /* set ASYNC notification flag */

 

下面首先描述一下問題,而後記錄問題排查過程,以及緣由分析,最後給出解決方法。異步

1. 問題描述

建立一個線程snap,snap在sigsuspend()處等待SIGIO信號。這個信號由中斷4從內核發送,指定發送給snap線程。async

下面首先梳理關鍵API,而後簡要介紹一下發現的問題。ide

1.1 關鍵API解釋

1.1.1 fasync和kill_fasync()

void kill_fasync(struct fasync_struct **fp, int sig, int band)

fasync是爲了使驅動經過kill_fasync()異步發送SIGIO信號給應用,應用經過fcntl將自身和SIGIO信號綁定。函數

當中斷或者數據到達時,調用kill_fasync()發送SIGIO,應用接收到信號後,進行SIGIO的handler處理。

int fasync_helper(int fd, struct file * filp, int on, struct fasync_struct **fapp)

fasync_helper()是內核驅動中初始化fasync隊列的函數,包括分配內存和設置屬性。

在實際使用中,內核須要作的有:

1. 在設備抽象數據結構中增長一個struct fasync_struct指針;

2. 實現struct file_operations中的fasync成員,一般就是調用內核的fasync_helper()函數

3. 在須要嚮應用發送統治的地方(好比中斷中)調用內核的kill_fasync()函數,發送SIGIO信號。

4. 在struct file_operations的release成員中,調用fasync(-1, filp, 0);

在應用中,須要作的有:

1. fcntl(fd, F_SETOWN, getpid())來指定一個進程做爲文件的屬主,這樣內核就知道SIGIO信號發送給哪一個進程。若是指定線程,避免進程中其餘線程處理則須要使用fcntl(fd, F_SETOWN_EX, &f_owner_ex)。

2. 設置文件標誌,添加FASYNC標誌:fcntl(fd, F_SETFL, f_flags | FASYNC)。驅動中就會調用struct file_operations的fasync成員。

3. 調用signal()或者sigaction()設置SIGIO信號的處理函數。

 

1.1.2 sigaction()

#include <signal.h>
int sigaction(int sig, const struct sigaction *act, struct sigaction *oldact);

sigaction()系統調用是設置信號處置的另外一選擇。

sig參數標識想要獲取或改變的信號編號,該參數能夠是除去SIGKILL和SIGSTOP以外的任何信號。

acti指向描述信號新處置的數據結構。oldact參數是指向同一結構類型的指針,用來返回以前信號處置的相關信息。

struct sigaction的sa_handler是指定信號的處理函數。

詳細請參考:《Linux/UNIX系統編程手冊》 第20.13章

 

1.1.3 sigemptyset()

int sigemptyset(sigset_t *set);
int sigfillset(sigset_t *set);

sigemptyset()函數初始化一個未包含任何成員的信號集。

sigfillset()則初始化一個信號集,使其包含全部信號。

 

1.1.4 sigaddset()

int sigaddset(sigset_t *set, int sig);
int sigdelset(sigset_t *set, int sig);

sigaddset()和sigdelset()函數想一個集合中添加或者移除單個信號。

 

1.1.5 pthread_sigmask()

int pthread_sigmask(int how, const sigset_t *set, sigset_t *oldset);

剛剛建立的新線程會從其建立者處繼承信號掩碼的一份拷貝。

線程可使用pthread_sigmask()來改變或/並獲取當前的信號掩碼。

除了所操做的是線程信號掩碼以外,pthread_sigmask()和司股票榮成mask() 用法徹底相同。

int sigprocmask(int how, const sigset_t *set, sigset_t *oldset);

內核爲每一個進程維護一個信號掩碼,並將阻塞其針對該進城的傳遞。

若是將遭阻塞的信號發送給某進程,那麼對該信號的傳遞將延後,直至從進程信號掩碼中移除該信號,從而解除阻塞位置。

 

1.1.6 sigsuspend()

int sigsuspend(const sigset_t *mask);

sigsuspend()系統調用將以mask所指向的信號集來替換進程的信號掩碼,而後掛起進程的執行,直到其捕獲到信號,並從信號handler中返回。

一旦handler返回, sigsuspend()會將進程信號掩碼恢復爲調用前的值。

調用sigsuspend(),至關於亦不可中斷方式執行下列操做:

sigprocmask(SIG_SETMASK, &maks, &prevMask);
pause();
sigprocmask(SIG_SETMASK, &prevMask, NULL);

sigsuspend()對第一個sigprocmask()和pause()之間的竟態保護。

詳細解釋見《Linux/UNIX系統編程手冊》第22.9章

 

1.2 發現問題

以下HandleSIGIO()註冊SIGIO信號的行爲,sig_handler()是SIGIO的信號處理函數。

EWLWaitHwRdy()是snap進程用於和中斷進行同步,表現爲等待SIGIO信號。

正常的流程log爲B->C->A->D,出現問題的時候B->C->A即中止,沒有出現D。說明中斷kill_fasync()發送的信號並無丟失。

說明snap進程卡在sigsuspend()處,可是log A代表SIGIO信號收到並進行了處理。這是疑點。

static volatile sig_atomic_t sig_delivered = 0;

/* SIGIO handler */
static void sig_handler(int signal_number)
{
    sig_delivered++;
//    printf("sig_handler func sig_delivered is %d\n",sig_delivered);--------------------------------------------------------------log A //    fflush(stdout);
}

void HandleSIGIO(hx280ewl_t * enc)
{
    struct sigaction sa;

    /* asynchronus notification handler */
    memset(&sa, 0, sizeof(sa));
    sa.sa_handler = sig_handler;
    sa.sa_flags |= SA_RESTART;  /* restart of system calls */
    sigaction(SIGIO, &sa, NULL);

    /* EWLInit might be called in a separate thread */
    /* we want to register the encoding thread for SIGIO */
    enc->sigio_needed = 1;  /* register for SIGIO in EWLEnableHW */
}

i32 EWLWaitHwRdy(const void *inst, u32 *slicesReady)
{
    hx280ewl_t *enc = (hx280ewl_t *) inst;
    u32 prevSlicesReady = 0;

//    PTRACE("EWLWaitHw: Start\n");
   
    printf("EWLWaitHw: Start\n");
    fflush(stdout);

    /* Check invalid parameters */
    if(enc == NULL)
    {
        assert(0);
        return EWL_HW_WAIT_ERROR;
    }

        sigset_t set, oldset;

        sigemptyset(&set);
        sigaddset(&set, SIGIO);

        if (slicesReady)
        {
...
        }
        else
        {
        
            /* Wait for frame ready signal (SIGIO) */
//            printf("######## sigsuspend() %d, oldset=%08x-%08x, set=%08x-%08x\n",sig_delivered, oldset.__val[1], oldset.__val[0], set.__val[1], set.__val[0]);----log B //            fflush(stdout);
            pthread_sigmask(SIG_BLOCK, &set, &oldset);
            while(!sig_delivered)
            {
//                printf("Before sigsuspend() %d, oldset=%08x-%08x, set=%08x-%08x\n",sig_delivered, oldset.__val[1], oldset.__val[0], set.__val[1], set.__val[0]);---log C //                fflush(stdout);
                sigsuspend(&oldset);---------------------------------------------------------------------------------------------------------------------------------在此處睡眠,異常的時候沒有正確喚醒。 //                printf("After sigsuspend() %d, oldset=%08x-%08x, set=%08x-%08x\n",sig_delivered, oldset.__val[1], oldset.__val[0], set.__val[1], set.__val[0]);----log D //                fflush(stdout);
            }
            sig_delivered = 0;
            pthread_sigmask(SIG_UNBLOCK, &set, NULL);
        }

    asic_status = enc->pRegBase[1]; /* update the buffered asic status */
...
    return EWL_OK;
}

 

2. 問題排查過程

通過上面能夠大體知道問題點,在於sigsuspend()沒有正確的退出,進而snap進程阻塞,流程中止。

因此首先從信號和進程的關係着手。

2.1 SIGIO信號和snap進程關係

因爲SIGIO已經發出,而且其handler已經被執行。

對SIGIO信號的發送和執行,經過signal_generate()和signal_deliver()跟蹤,對這兩個events加filter "sig==29"進行過濾。

對snap進程,經過sched_wakeup()和sched_switch()進行跟蹤,同時只跟蹤喚醒snap、切換到snap、從snap切換出的動做。

echo > /sys/kernel/debug/tracing/trace
echo 0 > /sys/kernel/debug/tracing/events/enable
echo
"sig==29" > /sys/kernel/debug/tracing/events/signal/signal_deliver/filter echo "sig==29" > /sys/kernel/debug/tracing/events/signal/signal_generate/filter echo 1 > /sys/kernel/debug/tracing/events/signal/enable echo "prev_comm==snap || next_comm==snap" > /sys/kernel/debug/tracing/events/sched/sched_switch/filter echo "comm==snap" > /sys/kernel/debug/tracing/events/sched/sched_wakeup/filter echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable cat /sys/kernel/debug/tracing/trace_pipe > temp.txt

下面第一部分是正常流程,sched_wakeup(喚醒snap)->signal_generate(發送SIGIO到snap)->sched_switch(切換到snap)->signal_deliver()。

第二部分異常在於沒有sched_wakeup()喚醒snap和sched_switch()喚醒snap,並且最重要的一點是爲何AiApp進程處理了SIGIO!

因爲只顯示snap進程,這裏可能有sched_wakeup()其餘進程。

... coreComm-223 [000] d... 72.754955: sched_switch: prev_comm=coreComm prev_pid=223 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=203 next_prio=120 snap-203 [000] d... 72.755360: sched_switch: prev_comm=snap prev_pid=203 prev_prio=120 prev_state=S ==> next_comm=rtpjitterbuffer next_pid=250 next_prio=120 udpsrc1:src-247 [000] dnh. 72.874125: sched_wakeup: comm=snap pid=203 prio=120 target_cpu=000------------------------------------------------------------------------------------wakeup snap thread udpsrc1:src-247 [000] dnh. 72.874141: signal_generate: sig=29 errno=0 code=128 comm=snap pid=203 grp=1 res=0---------------------------------------------------------------------generate SIGIO signal udpsrc1:src-247 [000] d... 72.874161: sched_switch: prev_comm=udpsrc1:src prev_pid=247 prev_prio=120 prev_state=R ==> next_comm=snap next_pid=203 next_prio=120------------------switch to snap thread snap-203 [000] d... 72.875114: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b9ff450 sa_flags=10000000--------------------------------------------------------------snap thread waked by SIGIO snap-203 [000] d... 72.875195: sched_switch: prev_comm=snap prev_pid=203 prev_prio=120 prev_state=S ==> next_comm=cat next_pid=219 next_prio=120 IFMS_Open-231 [000] d... 73.065332: sched_wakeup: comm=snap pid=203 prio=120 target_cpu=000 ... AiApp-228 [000] d... 73.221721: sched_switch: prev_comm=AiApp prev_pid=228 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=203 next_prio=120 snap-203 [000] d... 73.222124: sched_switch: prev_comm=snap prev_pid=203 prev_prio=120 prev_state=S ==> next_comm=coreComm next_pid=223 next_prio=120  <idle>-0 [000] dnh. 73.380965: signal_generate: sig=29 errno=0 code=128 comm=snap pid=203 grp=1 res=0-----------------------------------------------------------------------Need to open all sched_wakeup and sched_switch. AiApp-198 [000] d... 73.381628: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b9ff450 sa_flags=10000000----------------------------------------------------------------Why AiApp got SIGIO <idle>-0 [000] dns. 73.559623: sched_wakeup: comm=snap pid=203 prio=120 target_cpu=000 <idle>-0 [000] d... 73.559671: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snap next_pid=203 next_prio=120 snap-203 [000] d... 73.559837: sched_switch: prev_comm=snap prev_pid=203 prev_prio=120 prev_state=S ==> next_comm=adapter next_pid=201 next_prio=120 copy-202 [000] d... 93.421913: sched_wakeup: comm=snap pid=203 prio=120 target_cpu=000 adapter-201 [000] d... 93.422067: sched_switch: prev_comm=adapter prev_pid=201 prev_prio=120 prev_state=D ==> next_comm=snap next_pid=203 next_prio=120 

小結:這裏說明了爲何最後SIGIO信號handler會被處理,可是沒有換新snap進程。由於異常狀況下,SIGIO被AiApp進程處理了。

 

2.2 SIGIO、snap線程、中斷關係

光有SIGIO和snap線程的關係還不夠,還須要查看一下中斷。

增長irq_handler_entry和irq_handler_exit的跟蹤,其中sched_wakeup()和signal_generate()是在中斷處理handler中進行的。

現象和上面的一致,同時也理順了從中斷出發,發送信號,進程處理三者之間的關係。

echo > /sys/kernel/debug/tracing/trace
echo 0 > /sys/kernel/debug/tracing/events/enable
echo "sig==29" > /sys/kernel/debug/tracing/events/signal/signal_deliver/filter
echo "sig==29" > /sys/kernel/debug/tracing/events/signal/signal_generate/filter
echo "comm==snap" > /sys/kernel/debug/tracing/events/signal/signal_blocked/filter
echo 1 > /sys/kernel/debug/tracing/events/signal/enable


echo "prev_comm==snap || next_comm==snap" > /sys/kernel/debug/tracing/events/sched/sched_switch/filter
echo "comm==snap" > /sys/kernel/debug/tracing/events/sched/sched_wakeup/filter
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable

echo "irq==4" > /sys/kernel/debug/tracing/events/irq/irq_handler_entry/filter
echo "irq==4" > /sys/kernel/debug/tracing/events/irq/irq_handler_exit/filter
echo 1 > /sys/kernel/debug/tracing/events/irq/irq_handler_entry/enable
echo 1 > /sys/kernel/debug/tracing/events/irq/irq_handler_exit/enable

cat /sys/kernel/debug/tracing/trace_pipe

中斷產生後,在中斷處理函數中先進行snap線程的sched_wakeuo(),而後發送SIGIO信號給snap線程;中斷處理結束後喚醒snap線程,處理SIGIO信號的handler。

可是異常狀況中斷處理函數中並無給snap線程放入RunQ中,難道sched_wakeup()其餘線程了?中斷處理結束以後,signal_deliver()代表SIGIO信號handler被AiApp處理了。

           AiApp-228   [000] d...   102.576745: sched_switch: prev_comm=AiApp prev_pid=228 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=209 next_prio=120
            snap-209   [000] d...   102.577042: sched_switch: prev_comm=snap prev_pid=209 prev_prio=120 prev_state=R ==> next_comm=coreComm next_pid=223 next_prio=120
        coreComm-223   [000] d...   102.577104: sched_switch: prev_comm=coreComm prev_pid=223 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=209 next_prio=120
            snap-209   [000] d...   102.577286: sched_switch: prev_comm=snap prev_pid=209 prev_prio=120 prev_state=S ==> next_comm=rtpjitterbuffer next_pid=250 next_prio=120
 rtpjitterbuffer-250   [000] d.h.   102.588617: irq_handler_entry: irq=4 name=hx280enc
 rtpjitterbuffer-250   [000] dnh.   102.773291: sched_wakeup: comm=snap pid=209 prio=120 target_cpu=000------------------------中斷處理中sched_wakeup()snap進程,代表snap進程被選中。
 rtpjitterbuffer-250   [000] dnh.   102.773299: signal_generate: sig=29 errno=0 code=128 comm=snap pid=209 grp=1 res=0
 rtpjitterbuffer-250   [000] dnh.   102.773303: irq_handler_exit: irq=4 ret=handled rtpjitterbuffer-250   [000] d...   102.773860: sched_switch: prev_comm=rtpjitterbuffer prev_pid=250 prev_prio=120 prev_state=R ==> next_comm=snap next_pid=209 next_prio=120 snap-209   [000] d...   102.773898: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b9ff450 sa_flags=10000000
            snap-209   [000] d...   102.773948: sched_switch: prev_comm=snap prev_pid=209 prev_prio=120 prev_state=S ==> next_comm=coreComm next_pid=223 next_prio=120...
        coreComm-223   [000] d...   102.838285: sched_switch: prev_comm=coreComm prev_pid=223 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=209 next_prio=120
            snap-209   [000] d...   102.839026: sched_switch: prev_comm=snap prev_pid=209 prev_prio=120 prev_state=S ==> next_comm=rtpjitterbuffer next_pid=250 next_prio=120 AiApp-228 [000] d.h. 102.850469: irq_handler_entry: irq=4 name=hx280enc AiApp-228 [000] dnh. 102.856927: signal_generate: sig=29 errno=0 code=128 comm=snap pid=209 grp=1 res=0 AiApp-228 [000] dnh. 102.856931: irq_handler_exit: irq=4 ret=handled  AiApp-198 [000] d... 102.857479: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b9ff450 sa_flags=10000000
          <idle>-0     [000] dns.   103.132781: sched_wakeup: comm=snap pid=209 prio=120 target_cpu=000
          <idle>-0     [000] d...   103.132833: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snap next_pid=209 next_prio=120
            snap-209   [000] d...   103.132995: sched_switch: prev_comm=snap prev_pid=209 prev_prio=120 prev_state=S ==> next_comm=rtpjitterbuffer next_pid=250 next_prio=120

因此有必要所有打開sched_wakeup()查看在中斷處理函數中是否sched_wakeup()了AiApp?

由於sched_switch()太多,因此沒有打開。關鍵點在於中斷處理函數中sched_wakeup()了哪一個線程。

echo > /sys/kernel/debug/tracing/trace
echo 0 > /sys/kernel/debug/tracing/events/enable
echo "sig==29" > /sys/kernel/debug/tracing/events/signal/signal_deliver/filter
echo "sig==29" > /sys/kernel/debug/tracing/events/signal/signal_generate/filter
echo 1 > /sys/kernel/debug/tracing/evnnnnn/ents/signal/enable

echo > /sys/kernel/debug/tracing/events/sched/sched_wakeup/filter
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable

echo "irq==4" > /sys/kernel/debug/tracing/events/irq/irq_handler_entry/filter
echo "irq==4" > /sys/kernel/debug/tracing/events/irq/irq_handler_exit/filter
echo 1 > /sys/kernel/debug/tracing/events/irq/irq_handler_entry/enable
echo 1 > /sys/kernel/debug/tracing/events/irq/irq_handler_exit/enable

cat /sys/kernel/debug/tracing/trace_pipe > /tmp/temp.txt

 能夠看出在正常狀況下,sched_wakeup()了snap線程;異常狀況sched_wakeup()了AiApp進程。這也致使了後面AiApp響應了SIGIO的handler。

...
             cat-218   [000] d.h.    59.216451: sched_wakeup: comm=AiApp pid=236 prio=120 target_cpu=000
    kworker/u2:1-76    [000] d...    59.216815: sched_wakeup: comm=sshd pid=211 prio=120 target_cpu=000 sshd-211   [000] d.h.    59.217162: irq_handler_entry: irq=4 name=hx280enc sshd-211   [000] dnh.    59.217210: sched_wakeup: comm=snap pid=203 prio=120 target_cpu=000 sshd-211   [000] dnh.    59.217220: signal_generate: sig=29 errno=0 code=128 comm=snap pid=203 grp=1 res=0 sshd-211   [000] dnh.    59.217224: irq_handler_exit: irq=4 ret=handled  snap-203   [000] d...    59.217281: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b9ff450 sa_flags=10000000
            snap-203   [000] dnh.    59.217380: sched_wakeup: comm=omx_main pid=248 prio=120 target_cpu=000
        omx_main-248   [000] d...    59.217580: sched_wakeup: comm=omx_g1_output pid=258 prio=120 target_cpu=000
        omx_main-248   [000] d...    59.217637: sched_wakeup: comm=omxdec:src pid=251 prio=120 target_cpu=000...
             cat-218   [000] d...    59.413571: sched_wakeup: comm=kworker/u2:1 pid=76 prio=120 target_cpu=000
     udpsrc0:src-254   [000] dnh.    59.414333: sched_wakeup: comm=AiApp pid=236 prio=120 target_cpu=000
            sshd-211   [000] dnh.    59.414615: sched_wakeup: comm=coreComm pid=223 prio=120 target_cpu=000 sshd-211   [000] d.h.    59.414790: irq_handler_entry: irq=4 name=hx280enc sshd-211   [000] dnh.    59.533611: sched_wakeup: comm=AiApp pid=198 prio=120 target_cpu=000----------------------------代表AiApp進程被選中用於處理SIGIO信號。 sshd-211   [000] dnh.    59.533620: signal_generate: sig=29 errno=0 code=128 comm=snap pid=203 grp=1 res=0 sshd-211   [000] dnh.    59.533623: irq_handler_exit: irq=4 ret=handled
           AiApp-198   [000] d.h.    59.533696: sched_wakeup: comm=cat pid=218 prio=120 target_cpu=000
           AiApp-198   [000] d.h.    59.533719: sched_wakeup: comm=coreComm pid=223 prio=120 target_cpu=000
           AiApp-198   [000] d.h.    59.533726: sched_wakeup: comm=AiApp pid=236 prio=120 target_cpu=000 AiApp-198   [000] d...    59.534558: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b9ff450 sa_flags=10000000
             cat-218   [000] dn..    59.535430: sched_wakeup: comm=kworker/u2:1 pid=76 prio=120 target_cpu=000
             cat-218   [000] dnh.    59.535769: sched_wakeup: comm=coreComm pid=223 prio=120 target_cpu=000
             cat-218   [000] dn..    59.535786: sched_wakeup: comm=kworker/u2:1 pid=76 prio=120 target_cpu=000
             cat-218   [000] dn..    59.535930: sched_wakeup: comm=kworker/u2:1 pid=76 prio=120 target_cpu=000
             cat-218   [000] dn..    59.536006: sched_wakeup: comm=kworker/u2:1 pid=76 prio=120 target_cpu=000
             cat-218   [000] dn..    59.536059: sched_wakeup: comm=kworker/u2:1 pid=76 prio=120 target_cpu=000
             cat-218   [000] dn..    59.536112: sched_wakeup: comm=kworker/u2:1 pid=76 prio=120 target_cpu=000

小結:這裏面清晰的看到中斷處理中,異常狀況下sched_wakeup()選擇的是AiApp進程。這就是問題所在,爲何會選擇AiApp,而不是指望的snap線程。這也能解釋爲何在單項測試的時候不出現異常,由於單項測試只有一個進程,沒有其餘線程。

 

2.3 新增signal_blocked()跟蹤handle_signal()、sigprocmask()、sigsuspend()中blocked.sig[0]狀態

在內核硬件中斷處理函數中調用kill_fasync()來給snap進程發送SIGIO信號,kill_fasync()->kill_fasync_rcu()->send_sigio()->send_sigio_to_task()->do_send_sig_info->send_signal()->__send_signal()->complete_signal()->wants_signal()。

增長signal_blocked()跟蹤snap進程的task_struct->blocked.sig[0],在中斷handler入口和出口打印其值。

增長signal_blocked()的目的是爲了和其它trace events協同顯示,在同一時間軸顯示流程。

使用的地方只須要trace_signal_blocked(snap_task, __func__, __LINE__);

TRACE_EVENT(signal_blocked,

    TP_PROTO(struct task_struct *tsk, const char *func, unsigned int line),

    TP_ARGS(tsk, func, line),

    TP_STRUCT__entry(
        __array( char,  comm,   TASK_COMM_LEN   )
        __field( pid_t, pid         )
        __field( int,   blocked         )
        __field( const char *,   func    )
        __field( int,   line    )
    ),

    TP_fast_assign(
        memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
        __entry->pid        = tsk->pid;
        __entry->blocked    = tsk->blocked.sig[0];
        __entry->func       = func;
        __entry->line       = line;
    ),

    TP_printk("%s %d comm=%s pid=%d blocked.sig[0]=0x%08X",
            __entry->func, __entry->line,
            __entry->comm, __entry->pid,
            __entry->blocked)
);

 

能夠看出正常狀況下,irq_handler_entry()的blocked.sig[0]爲0x00000000,異常狀況irq_handler_entry()的blocked.sig[0]爲0x1000000。

0x1000000表示block SIGIO這個信號。

            snap-180   [000] ....  2950.211703: signal_blocked: handle_signal 217 comm=snap pid=180 blocked.sig[0]=0x10000000
            snap-180   [000] ....  2951.882924: signal_blocked: sigsuspend 3535 comm=snap pid=180 blocked.sig[0]=0x10000000
            snap-180   [000] ....  2951.882940: signal_blocked: sigsuspend 3537 comm=snap pid=180 blocked.sig[0]=0x00000000 rtpjitterbuffer-225   [000] d.h.  2951.894413: signal_blocked: __handle_irq_event_percpu 145 comm=snap pid=180 blocked.sig[0]=0x00000000---------------能夠看出這次中斷handler中,給snap進程不會被block。 rtpjitterbuffer-225   [000] d.h.  2951.894426: irq_handler_entry: irq=4 name=hx280enc rtpjitterbuffer-225   [000] d.h.  2951.894464: signal_generate: sig=29 errno=0 code=128 comm=snap pid=180 grp=1 res=0 rtpjitterbuffer-225   [000] d.h.  2951.894468: irq_handler_exit: irq=4 ret=handled rtpjitterbuffer-225   [000] d.h.  2951.894471: signal_blocked: __handle_irq_event_percpu 149 comm=snap pid=180 blocked.sig[0]=0x00000000
            snap-180   [000] ....  2951.894519: signal_blocked: sigsuspend 3543 comm=snap pid=180 blocked.sig[0]=0x00000000
            snap-180   [000] ....  2951.894526: signal_blocked: sigsuspend 3545 comm=snap pid=180 blocked.sig[0]=0x00000000 snap-180   [000] d...  2951.894546: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b983410 sa_flags=10000000---------------------------snap進程在返回用戶空間的時候,獲得執行SIGIO handler的機會。
            snap-180   [000] ....  2951.894551: signal_blocked: handle_signal 199 comm=snap pid=180 blocked.sig[0]=0x00000000
            snap-180   [000] ....  2951.894563: signal_blocked: handle_signal 209 comm=snap pid=180 blocked.sig[0]=0x00000000
            snap-180   [000] d...  2951.894566: signal_blocked: handle_signal 214 comm=snap pid=180 blocked.sig[0]=0x10000000
            snap-180   [000] ....  2951.894569: signal_blocked: handle_signal 217 comm=snap pid=180 blocked.sig[0]=0x10000000 udpsrc0:src-222   [000] d.h.  2952.194958: signal_blocked: __handle_irq_event_percpu 145 comm=snap pid=180 blocked.sig[0]=0x10000000---------------這次中斷handler中snap的SIGIO被block。 udpsrc0:src-222   [000] d.h.  2952.194973: irq_handler_entry: irq=4 name=hx280enc udpsrc0:src-222   [000] dnh.  2952.202000: signal_generate: sig=29 errno=0 code=128 comm=snap pid=180 grp=1 res=0 udpsrc0:src-222   [000] dnh.  2952.202004: irq_handler_exit: irq=4 ret=handled udpsrc0:src-222   [000] dnh.  2952.202008: signal_blocked: __handle_irq_event_percpu 149 comm=snap pid=180 blocked.sig[0]=0x10000000 AiApp-176   [000] d...  2952.202264: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b983410 sa_flags=10000000---------------------------SIGIO信號handler被AiApp進程處理。
           AiApp-176   [000] ....  2952.202280: signal_blocked: handle_signal 199 comm=snap pid=180 blocked.sig[0]=0x10000000
           AiApp-176   [000] ....  2952.202294: signal_blocked: handle_signal 209 comm=snap pid=180 blocked.sig[0]=0x10000000
           AiApp-176   [000] d...  2952.202297: signal_blocked: handle_signal 214 comm=snap pid=180 blocked.sig[0]=0x10000000
           AiApp-176   [000] ....  2952.202301: signal_blocked: handle_signal 217 comm=snap pid=180 blocked.sig[0]=0x10000000
            snap-180   [000] ....  2952.203748: signal_blocked: sigsuspend 3535 comm=snap pid=180 blocked.sig[0]=0x10000000
            snap-180   [000] ....  2952.203759: signal_blocked: sigsuspend 3537 comm=snap pid=180 blocked.sig[0]=0x00000000-----------------------------開始執行sigsuspend()。
            snap-180   [000] ....  3243.710836: signal_blocked: sigsuspend 3543 comm=snap pid=180 blocked.sig[0]=0x00000000-----------------------------一段時間事後整個進程出錯,關閉退出。
            snap-180   [000] ....  3243.710850: signal_blocked: sigsuspend 3545 comm=snap pid=180 blocked.sig[0]=0x00000000
              sh-150   [000] ....  3244.812561: signal_blocked: handle_signal 199 comm=snap pid=180 blocked.sig[0]=0x00000000
              sh-150   [000] ....  3244.812584: signal_blocked: handle_signal 209 comm=snap pid=180 blocked.sig[0]=0x00000000
              sh-150   [000] d...  3244.812588: signal_blocked: handle_signal 214 comm=snap pid=180 blocked.sig[0]=0x00000000
              sh-150   [000] ....  3244.812591: signal_blocked: handle_signal 217 comm=snap pid=180 blocked.sig[0]=0x00000000

那麼到底是哪些地方改變blocked.sig[0]呢?上面對於pthread_sigmask()的執行沒有反應,因此這裏須要使用signal_blocked()在sigsuspend()、sigprocmask()、handle_signal()中打印blocked.sig[0]值得改變。

            snap-174   [000] d...   687.919424: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=S ==> next_comm=udpsrc1:src next_pid=209 next_prio=120
          filter-194   [000] d...   688.494547: sched_wakeup: comm=snap pid=174 prio=120 target_cpu=000
          filter-194   [000] d...   688.494657: sched_switch: prev_comm=filter prev_pid=194 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] ....   688.498629: signal_blocked: sigprocmask 2518 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] ....   688.498639: signal_blocked: sigprocmask 2533 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] ....   688.498645: signal_blocked: sigprocmask 2536 comm=snap pid=174 blocked.sig[0]=0x10000000
            snap-174   [000] ....   688.498650: signal_blocked: sigprocmask 2518 comm=snap pid=174 blocked.sig[0]=0x10000000
            snap-174   [000] ....   688.498652: signal_blocked: sigprocmask 2533 comm=snap pid=174 blocked.sig[0]=0x10000000
            snap-174   [000] ....   688.498655: signal_blocked: sigprocmask 2536 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] d...   688.499279: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=cat next_pid=245 next_prio=120
 rtpjitterbuffer-212   [000] d...   688.503295: sched_switch: prev_comm=rtpjitterbuffer prev_pid=212 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120 snap-174   [000] ....   688.503493: signal_blocked: sigprocmask 2518 comm=snap pid=174 blocked.sig[0]=0x00000000---------------------------------------------------------第一個pthread_sigmask() snap-174   [000] ....   688.503497: signal_blocked: sigprocmask 2533 comm=snap pid=174 blocked.sig[0]=0x00000000 snap-174   [000] ....   688.503503: signal_blocked: sigprocmask 2536 comm=snap pid=174 blocked.sig[0]=0x10000000 snap-174   [000] ....   688.503548: signal_blocked: sigsuspend 3539 comm=snap pid=174 blocked.sig[0]=0x10000000----------------------------------------------------------sigsuspend(),以阻塞SIGIO信號狀態進入。而後解除阻塞進入進程可喚醒模式。 snap-174   [000] ....   688.503551: signal_blocked: sigsuspend 3541 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] d...   688.503562: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=S ==> next_comm=udpsrc1:src next_pid=209 next_prio=120------------snap進入休眠 udpsrc1:src-209   [000] d.h.   688.503888: signal_blocked: __handle_irq_event_percpu 145 comm=snap pid=174 blocked.sig[0]=0x00000000--------------------------------------------產生中斷 udpsrc1:src-209   [000] d.h.   688.503898: irq_handler_entry: irq=4 name=hx280enc udpsrc1:src-209   [000] d.h.   688.503930: sched_wakeup: comm=snap pid=174 prio=120 target_cpu=000------------------------------------------------------------------------------snap進程放入RunQ udpsrc1:src-209   [000] d.h.   688.503935: signal_generate: sig=29 errno=0 code=128 comm=snap pid=174 grp=1 res=0 udpsrc1:src-209   [000] d.h.   688.503939: irq_handler_exit: irq=4 ret=handled udpsrc1:src-209   [000] d.h.   688.503942: signal_blocked: __handle_irq_event_percpu 149 comm=snap pid=174 blocked.sig[0]=0x00000000
 rtpjitterbuffer-212   [000] d...   688.504528: sched_switch: prev_comm=rtpjitterbuffer prev_pid=212 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120--------進程被喚醒。 snap-174   [000] ....   688.504542: signal_blocked: sigsuspend 3547 comm=snap pid=174 blocked.sig[0]=0x00000000 snap-174   [000] ....   688.504546: signal_blocked: sigsuspend 3549 comm=snap pid=174 blocked.sig[0]=0x00000000 snap-174   [000] d...   688.504564: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b983410 sa_flags=10000000--------------------------------------------------------信號發送到snap進程 snap-174   [000] ....   688.504571: signal_blocked: handle_signal 199 comm=snap pid=174 blocked.sig[0]=0x00000000--------------------------------------------------------進行SIGIO sig_handler()處理,在handle_signal()結尾恢復對SIGIO的阻塞。 snap-174   [000] d...   688.504583: signal_blocked: handle_signal 211 comm=snap pid=174 blocked.sig[0]=0x00000000 snap-174   [000] d...   688.504585: signal_blocked: handle_signal 214 comm=snap pid=174 blocked.sig[0]=0x10000000 snap-174   [000] ....   688.504588: signal_blocked: handle_signal 217 comm=snap pid=174 blocked.sig[0]=0x10000000 snap-174   [000] ....   688.504713: signal_blocked: sigprocmask 2518 comm=snap pid=174 blocked.sig[0]=0x10000000---------------------------------------------------------第二個pthread_sigmask(),解除對SIGIO的阻塞。 snap-174   [000] ....   688.504716: signal_blocked: sigprocmask 2533 comm=snap pid=174 blocked.sig[0]=0x10000000 snap-174   [000] ....   688.504719: signal_blocked: sigprocmask 2536 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] d...   688.504988: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=NotifySrvlResul next_pid=190 next_prio=120
 rtpjitterbuffer-212   [000] d...   688.506152: sched_switch: prev_comm=rtpjitterbuffer prev_pid=212 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120...
          filter-194   [000] d...   691.058210: sched_switch: prev_comm=filter prev_pid=194 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] d...   691.059290: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=cat next_pid=245 next_prio=120
          filter-194   [000] d...   691.060795: sched_switch: prev_comm=filter prev_pid=194 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] d...   691.061464: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=omx_main next_pid=203 next_prio=120
            sshd-216   [000] d...   691.064415: sched_switch: prev_comm=sshd prev_pid=216 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] d...   691.065170: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=coreComm next_pid=187 next_prio=120
        coreComm-187   [000] d...   691.065249: sched_switch: prev_comm=coreComm prev_pid=187 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] d...   691.067633: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=NotifySrvlResul next_pid=190 next_prio=120
 rtpjitterbuffer-212   [000] d...   691.071517: sched_switch: prev_comm=rtpjitterbuffer prev_pid=212 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] ....   691.071615: signal_blocked: sigprocmask 2518 comm=snap pid=174 blocked.sig[0]=0x00000000--------------------------------------------------------被SIGIO之外信號喚醒,可是sig_delivered仍是爲0,還在while中繼續等待。
            snap-174   [000] ....   691.071618: signal_blocked: sigprocmask 2533 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] ....   691.071624: signal_blocked: sigprocmask 2536 comm=snap pid=174 blocked.sig[0]=0x10000000
            snap-174   [000] ....   691.071627: signal_blocked: sigprocmask 2518 comm=snap pid=174 blocked.sig[0]=0x10000000
            snap-174   [000] ....   691.071630: signal_blocked: sigprocmask 2533 comm=snap pid=174 blocked.sig[0]=0x10000000
            snap-174   [000] ....   691.071632: signal_blocked: sigprocmask 2536 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] d...   691.071709: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=coreComm next_pid=187 next_prio=120
        coreComm-187   [000] d...   691.071760: sched_switch: prev_comm=coreComm prev_pid=187 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] d...   691.071933: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=R ==> next_comm=NotifySrvlResul next_pid=190 next_prio=120
 NotifySrvlResul-190   [000] d...   691.072021: sched_switch: prev_comm=NotifySrvlResul prev_pid=190 prev_prio=120 prev_state=S ==> next_comm=snap next_pid=174 next_prio=120 snap-174   [000] ....   691.072510: signal_blocked: sigprocmask 2518 comm=snap pid=174 blocked.sig[0]=0x00000000--------------------------------第一個pthread_sigmask()。 snap-174   [000] ....   691.072514: signal_blocked: sigprocmask 2533 comm=snap pid=174 blocked.sig[0]=0x00000000 snap-174   [000] ....   691.072518: signal_blocked: sigprocmask 2536 comm=snap pid=174 blocked.sig[0]=0x10000000
            snap-174   [000] d...   691.072567: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=S ==> next_comm=udpsrc1:src next_pid=209 next_prio=120-----------snap進程切換出去,這是問題關鍵點。切換出去snap的SIGIO是處於blocked狀態。 udpsrc1:src-209   [000] d.h.   691.084054: signal_blocked: __handle_irq_event_percpu 145 comm=snap pid=174 blocked.sig[0]=0x10000000-------------------中斷觸發,可是此時blocked.sig[0]爲0x1000000,屏蔽SIGIO。 udpsrc1:src-209   [000] d.h.   691.084069: irq_handler_entry: irq=4 name=hx280enc udpsrc1:src-209   [000] dnh.   691.102600: signal_generate: sig=29 errno=0 code=128 comm=snap pid=174 grp=1 res=0 udpsrc1:src-209   [000] dnh.   691.102605: irq_handler_exit: irq=4 ret=handled udpsrc1:src-209   [000] dnh.   691.102608: signal_blocked: __handle_irq_event_percpu 149 comm=snap pid=174 blocked.sig[0]=0x10000000 AiApp-170 [000] d... 691.102845: signal_deliver: sig=29 errno=0 code=128 sa_handler=2b983410 sa_flags=10000000-------------------------------AiApp中進行的SIGIO處理。 AiApp-170   [000] ....   691.102859: signal_blocked: handle_signal 199 comm=snap pid=174 blocked.sig[0]=0x10000000-------------------------------SIGIO sig_handler()處理。 AiApp-170   [000] d...   691.102871: signal_blocked: handle_signal 211 comm=snap pid=174 blocked.sig[0]=0x10000000 AiApp-170   [000] d...   691.102874: signal_blocked: handle_signal 214 comm=snap pid=174 blocked.sig[0]=0x10000000 AiApp-170   [000] ....   691.102877: signal_blocked: handle_signal 217 comm=snap pid=174 blocked.sig[0]=0x10000000
          <idle>-0     [000] dns.   691.303189: sched_wakeup: comm=snap pid=174 prio=120 target_cpu=000
          <idle>-0     [000] d...   691.303267: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snap next_pid=174 next_prio=120 snap-174   [000] ....   691.303552: signal_blocked: sigsuspend 3539 comm=snap pid=174 blocked.sig[0]=0x10000000---------------------------------第二個sigsuspend()處理,等待SIGIO信號handler對sig_delivered纔會退出while。而流程卡住,不會發送編碼請求,也不會有中斷及SIGIO。 snap-174   [000] ....   691.303564: signal_blocked: sigsuspend 3541 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] d...   691.303584: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=S ==> next_comm=rtpjitterbuffer next_pid=212 next_prio=120
            copy-173   [000] dn..   988.790892: sched_wakeup: comm=snap pid=174 prio=120 target_cpu=000
         adapter-172   [000] d...   988.791081: sched_switch: prev_comm=adapter prev_pid=172 prev_prio=120 prev_state=D ==> next_comm=snap next_pid=174 next_prio=120 snap-174   [000] ....   988.791094: signal_blocked: sigsuspend 3547 comm=snap pid=174 blocked.sig[0]=0x00000000---------------------------------進程異常準備退出。 snap-174   [000] ....   988.791098: signal_blocked: sigsuspend 3549 comm=snap pid=174 blocked.sig[0]=0x00000000
            snap-174   [000] d...   988.791111: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=D ==> next_comm=AiApp next_pid=188 next_prio=120
            copy-173   [000] dn..   988.793177: sched_wakeup: comm=snap pid=174 prio=120 target_cpu=000
           AiApp-188   [000] d...   988.794392: sched_switch: prev_comm=AiApp prev_pid=188 prev_prio=120 prev_state=x ==> next_comm=snap next_pid=174 next_prio=120
            snap-174   [000] d...   988.794428: sched_switch: prev_comm=snap prev_pid=174 prev_prio=120 prev_state=x ==> next_comm=adapter next_pid=172 next_prio=120
              sh-144   [000] ....   989.892972: signal_blocked: handle_signal 199 comm=snap pid=174 blocked.sig[0]=0x00000000
              sh-144   [000] d...   989.892997: signal_blocked: handle_signal 211 comm=snap pid=174 blocked.sig[0]=0x00000000
              sh-144   [000] d...   989.893001: signal_blocked: handle_signal 214 comm=snap pid=174 blocked.sig[0]=0x00000000
              sh-144   [000] ....   989.893005: signal_blocked: handle_signal 217 comm=snap pid=174 blocked.sig[0]=0x00000000

小結:這裏snap進程切換出去的點在sigprocmask()和sigsuspend()之間。在中斷handler處理前,SIGIO信號已經被blocked了。send_sidio_to_task()在group爲1的狀況下,不得不選擇其餘合適的進程。

3. 緣由分析

3.1 SIGIO被blocked狀況下,如何選擇進程發送信號

complete_signal()中選擇一個合適的task來發送SIGIO,優先選擇snap線程。

可是此時snap進程的SIGIO處於blocked狀態,及wants_signal()返回0。

而後complete_signal()轉而選擇其餘合適線程,就選擇了AiApp。

static inline int wants_signal(int sig, struct task_struct *p)
{
    if (sigismember(&p->blocked, sig))-----------------------------------------異常狀況下此處返回0。 return 0;
    if (p->flags & PF_EXITING)
        return 0;
    if (sig == SIGKILL)
        return 1;
    if (task_is_stopped_or_traced(p))
        return 0;
    return task_curr(p) || !signal_pending(p);
}

static void complete_signal(int sig, struct task_struct *p, int group)
{
    struct signal_struct *signal = p->signal;
    struct task_struct *t;

    if (wants_signal(sig, p))--------------------------------------------------由上面的異常log能夠看出,此時進程p對應的blocked.sig[0]爲0x1000000。因此t不會選擇當前進程p。
        t = p;
    else if (!group || thread_group_empty(p))----------------------------------若是group爲0,表示只發給p進程;或者若是p進程所在組只有一個,那麼就沒法選擇其餘進程。
        return;
    else {
        t = signal->curr_target;
        while (!wants_signal(sig, t)) {----------------------------------------判斷當前進程t是否被SIGIO blocked。
            t = next_thread(t);------------------------------------------------選擇t所在thread_group的其餘沒有被blocked的進程。 if (t == signal->curr_target)
                return;
        }
        signal->curr_target = t;
    }
...
    signal_wake_up(t, sig == SIGKILL);-------------------------首先將進程設置爲TIF_SIGPENDING標誌,說明該進程有延遲的信號要等待處理。而後調用wake_up_state()喚醒目標進程。此後當該進程被調度時,在進程返回用戶空間前,會調用do_notify_resume()處理該進程的信號。 return;
}

 

3.2 SIGIO被阻塞的竟態

在正常狀況下,sigsuspend()進入睡眠,若是此時產生中斷,SIGIO會選擇snap進程進行喚醒。而後調度到snap進程返回到用戶空間的時候,執行SIGIO handler,sigsuspend()返回。

pthread_sigmask()
=========================>A
  sigsuspend()===========>B
=========================>C
pthread_sigmask()

在A處,SIGIO是被blocked的;若是B處執行了SIGIO handler,那麼C處SIGIO也是被blocked的。

C處出現被調度,進而致使SIGIO被blocked的機會很小。

可是A處在進程調度頻繁的狀況下,很容易出現上面的狀況。

而後kill_fasync()就會選擇thread_group中其餘合適的線程。

 

4. 解決方法

出現問題的關鍵是SIGIO信號發送出如今了第一個pthread_sigmask()以後,sigsuspend()以前。而後SIGIO被AiApp處理。

因此一是從避免出現SIGIO被blocked狀態;二是SIGIO不要選擇其餘進程。

第3中方法是較好的解決方法,將SIGIO和snap線程綁定,而不是阻塞其餘進程對SIGIO的響應。

4.1 儘可能減小SIGIO被blocked時隙

因爲當前系統進程任務重,尤爲snap線程佔用時間較多,被切換出去後就會很長時間纔會被調度。

若是在A處被調度出去,剛好此時產生信號,那麼則形成異常。

能夠提升snap進程優先級,那也只是下降了出現的機率。此方法不可取。

 

4.2 阻塞線程組全部其餘線程SIGIO信號

能夠在主線程中,sigprocmask()阻塞SIGIO信號。那麼其餘線程則不會替代snap線程對pending的SIGIO信號進行處理。

因爲snap進程會繼承對SIGIO的阻塞,因此須要關閉對SIGIO的阻塞。

而後即便上述一樣的pthread_sigmask()->sigsuspend()->pthread_sigmask(),中間即便出現SIGIO被阻塞的狀況,也只是pending,還會等待snap進程處理。

 

下面是一個臨時workaround,若是發送SIGIO到snap線程,而且此時SIGIO被阻塞了。那麼group爲0表示只發送給snap線程,不選擇其餘線程。

static void send_sigio_to_task(struct task_struct *p,
                   struct fown_struct *fown,
                   int fd, int reason, int group)
{
    /*
     * F_SETSIG can change ->signum lockless in parallel, make
     * sure we read it once and use the same value throughout.
     */
    int signum = ACCESS_ONCE(fown->signum);

    if (!sigio_perm(p, fown, signum))
        return;
  if(!strcmp(p->comm, "snap"))
  {
    if(p->blocked.sig[0] == 0x10000000)
    {
      group = 0;
    }
  }
switch (signum) { siginfo_t si; default: si.si_signo = signum; si.si_errno = 0; si.si_code = reason; BUG_ON((reason & __SI_MASK) != __SI_POLL); if (reason - POLL_IN >= NSIGPOLL) si.si_band = ~0L; else si.si_band = band_table[reason - POLL_IN]; si.si_fd = fd; if (!do_send_sig_info(signum, &si, p, group)) break; case 0: do_send_sig_info(SIGIO, SEND_SIG_PRIV, p, group); } }

 

4.3 fcntl設置F_SETOWN_EX線程屬性

有上面的分析可知SIGIO處於blocked的狀態沒法避免,在出現後group爲1,則選擇其餘的進程接收信號。

若是group爲0,則能夠避免這個異常。

下面看看group是如何影響SIGIO和線程之間的關係的,以及如何限定SIGIO只發送給相關的線程。

void kill_fasync(struct fasync_struct **fp, int sig, int band)
{
    if (*fp) {
        rcu_read_lock();
        kill_fasync_rcu(rcu_dereference(*fp), sig, band);
        rcu_read_unlock();
    }
}

static void kill_fasync_rcu(struct fasync_struct *fa, int sig, int band)
{
    while (fa) {
        struct fown_struct *fown;
        unsigned long flags;
...
        spin_lock_irqsave(&fa->fa_lock, flags);
        if (fa->fa_file) {
            fown = &fa->fa_file->f_owner;
            if (!(sig == SIGURG && fown->signum == 0))
                send_sigio(fown, fa->fa_fd, band);
        }
        spin_unlock_irqrestore(&fa->fa_lock, flags);
        fa = rcu_dereference(fa->fa_next);
    }
}

void send_sigio(struct fown_struct *fown, int fd, int band)
{
    struct task_struct *p;
    enum pid_type type;
    struct pid *pid;
    int group = 1;
    
    read_lock(&fown->lock);

    type = fown->pid_type;-------------------由fown_struct可知,pid是SIGIO將要發送的進程號或者進程組號;pid_typeSIGIO將要發送的進程組類型。PIDTYPE_PID表示進程PID,PIDTYPE_TGID表示線程組領頭的進程PID,PIDTYPE_PGID表示進程組領頭的進程PID,PIDTYPE_SID表示會話組領頭進程ID。 if (type == PIDTYPE_MAX) {---------------當pid_type爲PIDTYPE_MAX的時候,group爲0,表示SIGIO只給次pid發送。
        group = 0;
        type = PIDTYPE_PID;
    }

    pid = fown->pid;
    if (!pid)
        goto out_unlock_fown;
    
    read_lock(&tasklist_lock);
    do_each_pid_task(pid, type, p) {
        send_sigio_to_task(p, fown, fd, band, group);
    } while_each_pid_task(pid, type, p);
    read_unlock(&tasklist_lock);
 out_unlock_fown:
    read_unlock(&fown->lock);
}

關於group爲0或者1,對於SIGIO的處理有很大影響。

在send_sgio_to_task()中,作了一個workaround,改變group的值。下面看看group到底是如何影響進程對SIGIO的相應的。

send_sgio_to_task()->do_send_sig_info()->send_signal()->__send_signal()中能夠看出groupt的做用。

static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
            int group, int from_ancestor_ns)
{
    struct sigpending *pending;
    struct sigqueue *q;
    int override_rlimit;
    int ret = 0, result;

    assert_spin_locked(&t->sighand->siglock);

    result = TRACE_SIGNAL_IGNORED;
    if (!prepare_signal(sig, t,
            from_ancestor_ns || (info == SEND_SIG_FORCED)))
        goto ret;

    pending = group ? &t->signal->shared_pending : &t->pending;-----------------------這裏group決定後面SIGIO信號的處理時放入t->signal->shared_pending仍是t->pending。若是group爲0,則只放入當前進程的pending,其餘進程不會處理SIGIO。 ...
    result = TRACE_SIGNAL_ALREADY_PENDING;
    if (legacy_queue(pending, sig))
        goto ret;

    result = TRACE_SIGNAL_DELIVERED;
...
    if (info == SEND_SIG_FORCED)
        goto out_set;

    if (sig < SIGRTMIN)
        override_rlimit = (is_si_special(info) || info->si_code >= 0);
    else
        override_rlimit = 0;

    q = __sigqueue_alloc(sig, t, GFP_ATOMIC | __GFP_NOTRACK_FALSE_POSITIVE,
        override_rlimit);------------------------------------------------------------分配信號sig的sigqueue。 if (q) {
        list_add_tail(&q->list, &pending->list);-------------------------------------將sig信號放入pending->list列表。 switch ((unsigned long) info) {
        case (unsigned long) SEND_SIG_NOINFO:
            q->info.si_signo = sig;
            q->info.si_errno = 0;
            q->info.si_code = SI_USER;
            q->info.si_pid = task_tgid_nr_ns(current,
                            task_active_pid_ns(t));
            q->info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
            break;
        case (unsigned long) SEND_SIG_PRIV:
            q->info.si_signo = sig;
            q->info.si_errno = 0;
            q->info.si_code = SI_KERNEL;
            q->info.si_pid = 0;
            q->info.si_uid = 0;
            break;
        default:
            copy_siginfo(&q->info, info);
            if (from_ancestor_ns)
                q->info.si_pid = 0;
            break;
        }

        userns_fixup_signal_uid(&q->info, t);

    } else if (!is_si_special(info)) {
...
    }

out_set:
    signalfd_notify(t, sig);
    sigaddset(&pending->signal, sig);
    complete_signal(sig, t, group);
ret:
    trace_signal_generate(sig, info, t, group, result);-------------------------------打印signal_generate()。 return ret;
}

從上面的分析可知,當pid_type爲PIDTYPE_MAX的時候,group爲0,則不會出現sig被其餘進程處理的狀況。

那麼什麼時候pid_type會被設置爲PIDTYPE_MAX呢?

經過分析fcntl系統調用,能夠看出F_SETOWN_EX命令,若是當前是線程F_OWNER_TID,則設置PIDTYPE_MAX。

SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{    
    struct fd f = fdget_raw(fd);
    long err = -EBADF;

    if (!f.file)
        goto out;

    if (unlikely(f.file->f_mode & FMODE_PATH)) {
        if (!check_fcntl_cmd(cmd))
            goto out1;
    }

    err = security_file_fcntl(f.file, cmd, arg);
    if (!err)
        err = do_fcntl(fd, cmd, arg, f.file);

out1:
     fdput(f);
out:
    return err;
}

static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
        struct file *filp)
{
    long err = -EINVAL;

    switch (cmd) {
...
    case F_GETOWN:
        err = f_getown(filp);
        force_successful_syscall_return();
        break;
    case F_SETOWN:
        f_setown(filp, arg, 1);
        err = 0;
        break;
    case F_GETOWN_EX:
        err = f_getown_ex(filp, arg);
        break;
    case F_SETOWN_EX:
        err = f_setown_ex(filp, arg);
        break;
...
    }
    return err;
}

static int f_setown_ex(struct file *filp, unsigned long arg)
{
    struct f_owner_ex __user *owner_p = (void __user *)arg;
    struct f_owner_ex owner;
    struct pid *pid;
    int type;
    int ret;

    ret = copy_from_user(&owner, owner_p, sizeof(owner));
    if (ret)
        return -EFAULT;

    switch (owner.type) {
    case F_OWNER_TID:
        type = PIDTYPE_MAX;
        break;
...
    }

    rcu_read_lock();
    pid = find_vpid(owner.pid);
    if (owner.pid && !pid)
        ret = -ESRCH;
    else __f_setown(filp, pid, type, 1);
    rcu_read_unlock();

    return ret;
}

static void f_modown(struct file *filp, struct pid *pid, enum pid_type type,
                     int force)
{
    write_lock_irq(&filp->f_owner.lock);
    if (force || !filp->f_owner.pid) {
        put_pid(filp->f_owner.pid);
        filp->f_owner.pid = get_pid(pid);
        filp->f_owner.pid_type = type;

        if (pid) {
            const struct cred *cred = current_cred();
            filp->f_owner.uid = cred->uid;
            filp->f_owner.euid = cred->euid;
        }
    }
    write_unlock_irq(&filp->f_owner.lock);
}

設置線程F_SETOWN_EX的屬性以下:

    struct f_owner_ex owner_ex;
    owner_ex.pid = syscall(SYS_gettid);
    owner_ex.type = F_OWNER_TID;
    fcntl(enc->fd_enc, F_SETOWN_EX, &owner_ex);
    //fcntl(enc->fd_enc, F_SETOWN, syscall(SYS_gettid));  /* this thread will receive SIGIO */
    oflags = fcntl(enc->fd_enc, F_GETFL);
    fcntl(enc->fd_enc, F_SETFL, oflags | FASYNC);   /* set ASYNC notification flag */

 

F_SETOWN_EX相比於F_SETOWN多設置了一個進程屬性,這就告訴內核此文件SIGIO信號綁定的對象是線程,而不是進程。

後面kill_fasync()發送SIGIO信號的group就是線程範圍的0,而不是進程範圍的1。

綜上所述,將SIGIO和snap線程綁定是一個較好的方法。經過在snap線程中設置 file->f_owner的屬性。

 

5. 信號什麼時候被處理

 在系統調用、異常、中斷等返回用戶空間前,內核都會檢查是否有信號在當前進程中掛起。

若是有信號處於pending狀態,即經過TIF_SIGPENDING,就調用do_notify_resume()處理信號。

asmlinkage void
do_notify_resume(unsigned int thread_flags, struct pt_regs *regs, int syscall)
{
    if (thread_flags & _TIF_SIGPENDING)------------------------------------若是進程標誌位TIF_SIGPENDING置位,表示進程有未處理的信號。
        do_signal(regs, syscall);
...
}

static void do_signal(struct pt_regs *regs, int syscall)
{
    unsigned int retval = 0, continue_addr = 0, restart_addr = 0;
    struct ksignal ksig;

    if (!user_mode(regs))
        return;
...
    if (try_to_freeze())
        goto no_signal;

    if (get_signal(&ksig)) {----------------------------------------------從當前進程task_struct->pending或者task_struct->signal->shared_pending獲取pending的信號。
        sigset_t *oldset;

        if (regs->pc == restart_addr) {
            if (retval == -ERESTARTNOHAND
                    || (retval == -ERESTARTSYS
                        && !(ksig.ka.sa.sa_flags & SA_RESTART))) {
                regs->a0 = -EINTR;
                regs->pc = continue_addr;
            }
        }
...
        if (handle_signal(ksig.sig, &ksig.ka, &ksig.info, oldset, regs) == 0) {
            if (test_thread_flag(TIF_RESTORE_SIGMASK))
                clear_thread_flag(TIF_RESTORE_SIGMASK);
        }
        return;
    }
...
}


int get_signal(struct ksignal *ksig)
{
    struct sighand_struct *sighand = current->sighand;
    struct signal_struct *signal = current->signal;
    int signr;
...
    for (;;) {
        struct k_sigaction *ka;
...
        signr = dequeue_signal(current, &current->blocked, &ksig->info);-----------------先從task_struct->pending列表取信號,其次從task_struct->signal->shared_pending上取信號。 if (!signr)
            break; /* will return 0 */

        if (unlikely(current->ptrace) && signr != SIGKILL) {
            signr = ptrace_signal(signr, &ksig->info);
            if (!signr)
                continue;
        }

        ka = &sighand->action[signr-1];

        trace_signal_deliver(signr, &ksig->info, ka);-------------------------------------對應signal_deliver()trace events,這裏表示真正將信號發送到了進程,將要進行處理。 if (ka->sa.sa_handler == SIG_IGN) /* Do nothing.  */
            continue;
        if (ka->sa.sa_handler != SIG_DFL) {
            ksig->ka = *ka;

            if (ka->sa.sa_flags & SA_ONESHOT)
                ka->sa.sa_handler = SIG_DFL;

            break; /* will return non-zero "signr" value */
        }
...
        do_group_exit(ksig->info.si_signo);
    }
    spin_unlock_irq(&sighand->siglock);

    ksig->sig = signr;
    return ksig->sig > 0;
}

static int
handle_signal(int sig, struct k_sigaction *ka, siginfo_t *info,
        sigset_t *oldset, struct pt_regs *regs)
{
    struct task_struct *tsk = current;
    int ret;

/* set up the stack frame, regardless of SA_SIGINFO, and pass info anyway. */
    ret = setup_rt_frame(sig, ka, info, oldset, regs);---------------------------------爲執行信號handler進行棧準備。 if (ret != 0) {
        force_sigsegv(sig, tsk);
        return ret;
    }

    spin_lock_irq(&current->sighand->siglock);
    sigorsets(&current->blocked, &current->blocked, &ka->sa.sa_mask);
if (!(ka->sa.sa_flags & SA_NODEFER))
        sigaddset(&current->blocked, sig);
    recalc_sigpending();
    spin_unlock_irq(&current->sighand->siglock);
return 0;
}

static int setup_rt_frame (int sig, struct k_sigaction *ka, siginfo_t *info,
        sigset_t *set, struct pt_regs *regs)
{
    struct rt_sigframe *frame;
    int err = 0;

    struct csky_vdso *vdso = current->mm->context.vdso;

    frame = get_sigframe(ka, regs, sizeof(*frame));
...
    /* Set up registers for signal handler */
    regs->usp = (unsigned long) frame;
    regs->pc = (unsigned long) ka->sa.sa_handler;-------------------------------------準備pc指針以及返回值等。
    regs->lr = (unsigned long)vdso->rt_signal_retcode;

adjust_stack:
    regs->a0 = sig; /* first arg is signo */
    regs->a1 = (unsigned long)(&(frame->info)); /* second arg is (siginfo_t*) */
    regs->a2 = (unsigned long)(&(frame->uc));/* third arg pointer to ucontext */
    return err;

give_sigsegv:
    if (sig == SIGSEGV)
        ka->sa.sa_handler = SIG_DFL;
    force_sig(SIGSEGV, current);
    goto adjust_stack;
}

 

 

6. 使用信號的侷限性

可是其實在使用信號做爲驅動和應用之間異步通知,存在必定侷限性。

6.1 信號丟失

在《Linux/UNIX系統編程手冊》 20.13 改變信號處置:sigaction()中有這麼一段話,說明在信號處理期間若是同一信號收到屢次,那麼只處理一次。這就存在丟失信號的可能性。

sa_mask字段定義了一組新號,在調用由sa_handler所定義的處理器程序時將阻塞該組信號。

當調用信號處理程序時,會在調用信號處理程序以前,將該組信號中當前未處於進程掩碼之列的任何信號自動添加到進程掩碼中。這些信號將保留在進程掩碼中,直至信號處理程序返回,屆時將自動刪除這些信號。

利用sa_mask字段可指定一組信號,不容許它們中斷此處理程序的執行。-------------------------------------------------sa_mask指定handler期間屏蔽的信號

此外,引起處理程序調用的信號將自動添加到進程信號掩碼中。-------------------------------------------------------------handler處理期間自動阻塞自己信號

handler處理期間屏蔽自己信號,意味着,當正在執行handler時,若是同一個信號實例第二次抵達,信號handler將不會遞歸中斷本身。因爲不會對在遭阻塞的信號進行排隊處理,若是在handler執行過程當中重複產生這些信號中的任何信號,(稍後)對信號的傳遞將是一次性的。

struct sigaction結構體以下: 

struct sigaction {
 unsigned int sa_flags; __sighandler_t sa_handler;  __sigrestore_t sa_restorer;  sigset_t sa_mask; /* mask last for extensibility */ };

 

6.2 信號handler的進程屬性

在一個多線程環境下,sa_mask是線程屬性的,意味着每一個線程都有本身的掩碼,信號也能夠和線程綁定。

可是信號handler是進程屬性的,也即一個進程範圍內,一個信號只能有一個handler。

這就形成不一樣線程設置signal handler,會覆蓋,這就形成不肯定性。

因此在多線程環境下,同一信號沒法具有多handler,也即沒法使用同一信號達到不一樣目的

 

通過研究後,發覺這兩個侷限性均可以經過fcntl(fd, F_SETSIG, sig)來解決,一是能夠指定特定的sig來指定不一樣handler,從而避開同一SIGIO帶來的問題;而是自定義的實時信號,還具有queue的功能,同時還具有優先級概念,除非隊列溢出。

 

參考文檔:

1. 《Linux/UNIX系統編程手冊》 第63.3章 信號驅動I/O,尤爲63.3.2 優化信號驅動I/O的使用

2. 《Linux/UNIX系統編程手冊》第20章 信號:基本概念、第21章 信號:信號處理函數、第22章 信號:高級特性。

相關文章
相關標籤/搜索