Linux網絡編程「驚羣」問題總結

時間 2019-11-06

標籤 linux 網絡編程問題總結欄目 Linux 简体版

原文原文鏈接

一、前言html

　　我從事Linux系統下網絡開發將近4年了，常常仍是遇到一些問題，只是知其然而不知其因此然，有時候和其餘人交流，搞得很是尷尬。現在計算機都是多核了，網絡編程框架也逐步豐富多了，我所知道的有多進程、多線程、異步事件驅動經常使用的三種模型。最經典的模型就是Nginx中所用的Master-Worker多進程異步驅動模型。今天和你們一塊兒討論一下網絡開發中遇到的「驚羣」現象。以前只是據說過這個現象，網上查資料也瞭解了基本概念，在實際的工做中還真沒有遇到過。今天週末，結合本身的理解和網上的資料，完全將「驚羣」弄明白。須要弄清楚以下幾個問題：算法

（1）什麼是「驚羣」，會產生什麼問題？編程

（2）「驚羣」的現象怎麼用代碼模擬出來？服務器

（3）如何處理「驚羣」問題，處理「驚羣」後的現象又是怎麼樣呢？網絡

二、何爲驚羣多線程

　　現在網絡編程中常常用到多進程或多線程模型，大概的思路是父進程建立socket，bind、listen後，經過fork建立多個子進程，每一個子進程繼承了父進程的socket，調用accpet開始監聽等待網絡鏈接。這個時候有多個進程同時等待網絡的鏈接事件，當這個事件發生時，這些進程被同時喚醒，就是「驚羣」。這樣會致使什麼問題呢？咱們知道進程被喚醒，須要進行內核從新調度，這樣每一個進程同時去響應這一個事件，而最終只有一個進程能處理事件成功，其餘的進程在處理該事件失敗後從新休眠或其餘。網絡模型以下圖所示：負載均衡

簡而言之，驚羣現象（thundering herd）就是當多個進程和線程在同時阻塞等待同一個事件時，若是這個事件發生，會喚醒全部的進程，但最終只可能有一個進程/線程對該事件進行處理，其餘進程/線程會在失敗後從新休眠，這種性能浪費就是驚羣。框架

三、編碼模擬「驚羣」現象異步

　　咱們已經知道了「驚羣」是怎麼回事，那麼就按照上面的圖編碼實現看一下效果。我嘗試使用多進程模型，建立一個父進程綁定一個端口監聽socket，而後fork出多個子進程，子進程們開始循環處理（好比accept）這個socket。測試代碼以下所示：socket

 1 #include <stdio.h>
 2 #include <unistd.h>
 3 #include <sys/types.h>  
 4 #include <sys/socket.h>  
 5 #include <netinet/in.h>  
 6 #include <arpa/inet.h>  
 7 #include <assert.h>  
 8 #include <sys/wait.h>
 9 #include <string.h>
10 #include <errno.h>
11 
12 #define IP   "127.0.0.1"
13 #define PORT  8888
14 #define WORKER 4
15 
16 int worker(int listenfd, int i)
17 {
18     while (1) {
19         printf("I am worker %d, begin to accept connection.\n", i);
20         struct sockaddr_in client_addr;  
21         socklen_t client_addrlen = sizeof( client_addr );  
22         int connfd = accept( listenfd, ( struct sockaddr* )&client_addr, &client_addrlen );  
23         if (connfd != -1) {
24             printf("worker %d accept a connection success.\t", i);
25             printf("ip :%s\t",inet_ntoa(client_addr.sin_addr));
26             printf("port: %d \n",client_addr.sin_port);
27         } else {
28             printf("worker %d accept a connection failed,error:%s", i, strerror(errno));
　　　　　　　　 close(connfd);
29         }
30     }
31     return 0;
32 }
33 
34 int main()
35 {
36     int i = 0;
37     struct sockaddr_in address;  
38     bzero(&address, sizeof(address));  
39     address.sin_family = AF_INET;  
40     inet_pton( AF_INET, IP, &address.sin_addr);  
41     address.sin_port = htons(PORT);  
42     int listenfd = socket(PF_INET, SOCK_STREAM, 0);  
43     assert(listenfd >= 0);  
44 
45     int ret = bind(listenfd, (struct sockaddr*)&address, sizeof(address));  
46     assert(ret != -1);  
47 
48     ret = listen(listenfd, 5);  
49     assert(ret != -1);  
50 
51     for (i = 0; i < WORKER; i++) {
52         printf("Create worker %d\n", i+1);
53         pid_t pid = fork();
54         /*child  process */
55         if (pid == 0) {
56             worker(listenfd, i);
57         }
58 
59         if (pid < 0) {
60             printf("fork error");
61         }
62     }
63 
64     /*wait child process*/
65     int status;
66     wait(&status);
67     return 0;
68 }

編譯執行，在本機上使用telnet 127.0.0.1 8888測試，結果以下所示：

按照「驚羣"現象，指望結果應該是4個子進程都會accpet到請求，其中只有一個成功，另外三個失敗的狀況。而實際的結果顯示，父進程開始建立4個子進程，每一個子進程開始等待accept鏈接。當telnet鏈接來的時候，只有worker2 子進程accpet到請求，而其餘的三個進程並無接收到請求。

這是什麼緣由呢？難道驚羣現象是假的嗎？因而趕忙google查一下，驚羣究竟是怎麼出現的。

其實在Linux2.6版本之後，內核內核已經解決了accept()函數的「驚羣」問題，大概的處理方式就是，當內核接收到一個客戶鏈接後，只會喚醒等待隊列上的第一個進程或線程。因此，若是服務器採用accept阻塞調用方式，在最新的Linux系統上，已經沒有「驚羣」的問題了。

可是，對於實際工程中常見的服務器程序，大都使用select、poll或epoll機制，此時，服務器不是阻塞在accept，而是阻塞在select、poll或epoll_wait，這種狀況下的「驚羣」仍然須要考慮。接下來以epoll爲例分析：

使用epoll非阻塞實現代碼以下所示：

  1 #include <sys/types.h>
  2 #include <sys/socket.h>
  3 #include <sys/epoll.h>
  4 #include <netdb.h>
  5 #include <string.h>
  6 #include <stdio.h>
  7 #include <unistd.h>
  8 #include <fcntl.h>
  9 #include <stdlib.h>
 10 #include <errno.h>
 11 #include <sys/wait.h>
 12 #include <unistd.h>
 13 
 14 #define IP   "127.0.0.1"
 15 #define PORT  8888
 16 #define PROCESS_NUM 4
 17 #define MAXEVENTS 64
 18 
 19 static int create_and_bind ()
 20 {
 21     int fd = socket(PF_INET, SOCK_STREAM, 0);
 22     struct sockaddr_in serveraddr;
 23     serveraddr.sin_family = AF_INET;
 24     inet_pton( AF_INET, IP, &serveraddr.sin_addr);  
 25     serveraddr.sin_port = htons(PORT);
 26     bind(fd, (struct sockaddr*)&serveraddr, sizeof(serveraddr));
 27     return fd;
 28 }
 29 
 30 static int make_socket_non_blocking (int sfd)
 31 {
 32     int flags, s;
 33     flags = fcntl (sfd, F_GETFL, 0);
 34     if (flags == -1) {
 35         perror ("fcntl");
 36         return -1;
 37     }
 38     flags |= O_NONBLOCK;
 39     s = fcntl (sfd, F_SETFL, flags);
 40     if (s == -1) {
 41         perror ("fcntl");
 42         return -1;
 43     }
 44     return 0;
 45 }
 46 
 47 void worker(int sfd, int efd, struct epoll_event *events, int k) {
 48     /* The event loop */
 49     while (1) {
 50         int n, i;
 51         n = epoll_wait(efd, events, MAXEVENTS, -1);
 52         printf("worker  %d return from epoll_wait!\n", k);
 53         for (i = 0; i < n; i++) {
 54             if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP) || (!(events[i].events &EPOLLIN))) {
 55                 /* An error has occured on this fd, or the socket is not ready for reading (why were we notified then?) */
 56                 fprintf (stderr, "epoll error\n");
 57                 close (events[i].data.fd);
 58                 continue;
 59             } else if (sfd == events[i].data.fd) {
 60                 /* We have a notification on the listening socket, which means one or more incoming connections. */
 61                 struct sockaddr in_addr;
 62                 socklen_t in_len;
 63                 int infd;
 64                 char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];
 65                 in_len = sizeof in_addr;
 66                 infd = accept(sfd, &in_addr, &in_len);
 67                 if (infd == -1) {
 68                     printf("worker %d accept failed!\n", k);
 69                     break;
 70                 }
 71                 printf("worker %d accept successed!\n", k);
 72                 /* Make the incoming socket non-blocking and add it to the list of fds to monitor. */
 73                 close(infd); 
 74             }
 75         }
 76     }
 77 }
 78 
 79 int main (int argc, char *argv[])
 80 {
 81     int sfd, s;
 82     int efd;
 83     struct epoll_event event;
 84     struct epoll_event *events;
 85     sfd = create_and_bind();
 86     if (sfd == -1) {
 87         abort ();
 88     }
 89     s = make_socket_non_blocking (sfd);
 90     if (s == -1) {
 91         abort ();
 92     }
 93     s = listen(sfd, SOMAXCONN);
 94     if (s == -1) {
 95         perror ("listen");
 96         abort ();
 97     }
 98     efd = epoll_create(MAXEVENTS);
 99     if (efd == -1) {
100         perror("epoll_create");
101         abort();
102     }
103     event.data.fd = sfd;
104     event.events = EPOLLIN;
105     s = epoll_ctl(efd, EPOLL_CTL_ADD, sfd, &event);
106     if (s == -1) {
107         perror("epoll_ctl");
108         abort();
109     }
110 
111     /* Buffer where events are returned */
112     events = calloc(MAXEVENTS, sizeof event);
113     int k;
114     for(k = 0; k < PROCESS_NUM; k++) {
115         printf("Create worker %d\n", k+1);
116         int pid = fork();
117         if(pid == 0) {
118             worker(sfd, efd, events, k);
119         }
120     }
121     int status;
122     wait(&status);
123     free (events);
124     close (sfd);
125     return EXIT_SUCCESS;
126 }

父進程中建立套接字，並設置爲非阻塞，開始listen。而後fork出4個子進程，在worker中調用epoll_wait開始accpet鏈接。使用telnet測試結果以下：

從結果看出，與上面是同樣的，只有一個進程接收到鏈接，其餘三個沒有收到，說明沒有發生驚羣現象。這又是爲何呢？

在早期的Linux版本中，內核對於阻塞在epoll_wait的進程，也是採用所有喚醒的機制，因此存在和accept類似的「驚羣」問題。新版本的的解決方案也是只會喚醒等待隊列上的第一個進程或線程，因此，新版本Linux 部分的解決了epoll的「驚羣」問題。所謂部分的解決，意思就是：對於部分特殊場景，使用epoll機制，已經不存在「驚羣」的問題了，可是對於大多數場景，epoll機制仍然存在「驚羣」。

epoll存在驚羣的場景以下：在worker保持工做的狀態下，都會被喚醒，例如在epoll_wait後調用sleep一次。改寫woker函數以下：

void worker(int sfd, int efd, struct epoll_event *events, int k) {
    /* The event loop */
    while (1) {
        int n, i;
        n = epoll_wait(efd, events, MAXEVENTS, -1);
        /*keep running*/ sleep(2);
        printf("worker  %d return from epoll_wait!\n", k); 
        for (i = 0; i < n; i++) {
            if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP) || (!(events[i].events &EPOLLIN))) {
                /* An error has occured on this fd, or the socket is not ready for reading (why were we notified then?) */
                fprintf (stderr, "epoll error\n");
                close (events[i].data.fd);
                continue;
            } else if (sfd == events[i].data.fd) {
                /* We have a notification on the listening socket, which means one or more incoming connections. */
                struct sockaddr in_addr;
                socklen_t in_len;
                int infd;
                char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];
                in_len = sizeof in_addr;
                infd = accept(sfd, &in_addr, &in_len);
                if (infd == -1) {
                    printf("worker %d accept failed,error:%s\n", k, strerror(errno));
                    break;
                }   
                printf("worker %d accept successed!\n", k); 
                /* Make the incoming socket non-blocking and add it to the list of fds to monitor. */
                close(infd); 
            }   
        }   
    }   
}

測試結果以下所示：

終於看到驚羣現象的出現了。

四、解決驚羣問題

　　Nginx中使用mutex互斥鎖解決這個問題，具體措施有使用全局互斥鎖，每一個子進程在epoll_wait()以前先去申請鎖，申請到則繼續處理，獲取不到則等待，並設置了一個負載均衡的算法（當某一個子進程的任務量達到總設置量的7/8時，則不會再嘗試去申請鎖）來均衡各個進程的任務量。後面深刻學習一下Nginx的驚羣處理過程。

五、參考網址

http://blog.csdn.net/russell_tao/article/details/7204260

http://pureage.info/2015/12/22/thundering-herd.html

http://blog.chinaunix.net/uid-20671208-id-4935141.html