圖解 epoll 是如何工做的

時間 2019-12-06

標籤圖解 epoll 如何简体版

原文原文鏈接

本文包含如下內容：html

epoll是如何工做的

本文不包含如下內容：node

epoll 的用法
epoll 的缺陷

我實在很是喜歡像 epoll這樣使用方便、原理不深卻有大用處的東西，即便它可能已經比較老了

select 和 poll 的缺點

epoll 對於動輒須要處理上萬鏈接的網絡服務應用的意義能夠說是革命性的。對於普通的本地應用，select 和 poll可能就很好用了，但對於像C10K這類高併發的網絡場景，select 和 poll就捉襟見肘了。linux

看看他們的API網絡

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);
           
int poll(struct pollfd *fds, nfds_t nfds, int timeout);

它們有一個共同點，用戶須要將監控的文件描述符集合打包當作參數傳入，每次調用時，這個集合都會從用戶空間拷貝到內核空間，這麼作的緣由是內核對這個集合是無記憶的。對於絕大部分應用，這是一種十足的浪費，由於應用須要監控的描述符在大部分時間內基本都是不變的,也許會有變化,但都不大.數據結構

epoll 對此的改進

epoll對此的改進也正是它的實現方式,它須要完成如下兩件事併發

描述符添加---內核能夠記下用戶關心哪些文件的哪些事件.
事件發生---內核能夠記下哪些文件的哪些事件真正發生了,當用戶前來獲取時,能把結果提供給用戶.

描述符添加

既然要有記憶,那麼理所固然的內核須要須要一個數據結構來記, 這個數據結構簡單點就像下面這個圖中的epoll_instance, 它有一個鏈表頭，鏈表上的元素epoll_item就是用戶添加上去的，每一項都記錄了描述符fd和感興趣的事件組合event
socket

事件發生

事件有多種類型, 其中POLLIN表示的可讀事件是用戶使用的最多的。好比:tcp

當一個TCP的socket收到報文，它會變得可讀；
當一個pipe受到對端發送的數據，它會變得可讀;
當一個timerfd對應的定時器超時，它會變得可讀;

那麼如今須要將這些可讀事件和前面的epoll_instance關聯起來。linux中，每個文件描述符在內核都有一個struct file結構對應, 這個struct file有一個private_data指針，根據文件的實際類型，它們指向不一樣的數據結構。函數

那麼我能想到的最方便的作法就是epoll_item中增長一個指向struct file的指針，在struct file中增長一個指回epoll item的指針。
高併發

爲了能記錄有事件發生的文件，咱們還須要在epoll_instance中增長一個就緒鏈表readylist，在private_data指針指向的各類數據結構中增長一個指針回指到 struct file，在epoll item中增長一個掛接點字段，當一個文件可讀時，就把它對應的epoll item掛接到epoll_instance

在這以後，用戶經過系統調用下來讀取readylist就能夠知道哪些文件就緒了。

好了，以上純屬我我的一拍腦殼想到的epoll大概的工做方式，其中必定包含很多缺陷。

不過真實的epoll的實現思想上與上面也差很少，下面來講一下

建立 epoll 實例

如同上面的epoll_instance，內核須要一個數據結構保存記錄用戶的註冊項，這個結構在內核中就是struct eventpoll，當用戶使用epoll_create(2)或者epoll_create1(2)時，內核fs/eventpoll.c實際就會建立一個這樣的結構.

/*
 * Create the internal data structure ("struct eventpoll").
 */
error = ep_alloc(&ep);

這個結構中比較重要的部分就是幾個鏈表了，不過實例剛建立時它們都是空的，後續能夠看到它們的做用

epoll_create()最終會向用戶返回一個文件描述符，用來方便用戶以後操做該 epoll實例，因此在建立epoll實例以後，內核就會分配一個文件描述符fd和對應的struct file結構

/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));

file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
                           O_RDWR | (flags & O_CLOEXEC));

最後就是把它們和剛纔的epoll實例 關聯起來，而後向用戶返回fd

ep->file = file;

fd_install(fd, file);

return fd;

完成後，epoll實例 就成這樣了。

向 epoll 實例添加一個文件描述符

用戶能夠經過 epoll_ctl(2)向 epoll實例 添加要監控的描述符和感興趣的事件。如同前面的epoll item，內核實際建立的是一個叫struct epitem的結構做爲註冊表項。以下圖所示

爲了在描述符不少時的也能有較高的搜索效率, epoll實例 以紅黑樹的形式來組織每一個struct epitem (取代上面例子中鏈表)。struct epitem結構中ffd是用來記錄關聯文件的字段, 同時它也做爲該表項添加到紅黑樹上的Key；

rdllink的做用是當fd對應的文件準備好(關心的事件發生)時，內核會將它做爲掛載點掛接到epoll實例中ep->rdllist鏈表上
fllink的做用是做爲掛載點掛接到fd對應的文件的file->f_tfile_llink鏈表上，通常這個鏈表最多隻有一個元素，除非發生了dup。
pwqlist是一個鏈表頭，用來鏈接 poll wait queue。雖然它是鏈表，但其實鏈表上最多隻會再掛接一個元素。

建立struct epitem的代碼在fs/evnetpoll.c的ep_insert()中

if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
    return -ENOMEM;

以後會進行各個字段初始化

/* Item initialization follow here ... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;

而後是設置局部變量epq

struct ep_pqueue epq;

epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

epq的數據結構是struct ep_pqueue,它是poll table的一層包裝(加了一個struct epitem* 的指針)

struct  ep_pqueue{
    poll_table pt;
    struct epitem* epi;
}

poll table包含一個函數和一個事件掩碼

typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);

typedef struct poll_table_struct {
    poll_queue_proc _qproc;
    unsigned long _key;   //  store the interested event masks
}poll_table;

這個poll table用在哪裏呢 ? 答案是,用在了struct file_operations的poll操做 (這和本文開始說的select`poll`不是一個東西)

struct file_operations { 
   // code omitted...
   unsigned int (*poll)(struct file*,  struct poll_table_struct*);
   // code omitted...
}

不一樣的文件有不一樣poll實現方式, 但通常它們的實現方式差很少是下面這種形式

static unsigned int XXXX_poll(struct file *file, poll_table *wait)
{
    私有數據 = file->private_data;
    unsigned int events = 0;
    
    poll_wait(file, &私有數據->wqh, wait);

    if (文件可讀了)
        events |= POLLIN;
    
    return events;
}

它們主要實現兩個功能

將XXX放到文件私有數據的等待隊列上 (通常file->private_data中都有一個等待隊列頭wait_queue_head_t wqh), 至於XXX是啥,各類類型文件實現各異,取決於poll_table參數
查詢是否真的有事件了,如有則返回.

有興趣的讀者能夠 timerfd_poll() 或者 pipe_poll() 它們的實現

poll_wait的實現很簡單, 就是調用poll_table中設置的函數, 將文件私有的等待隊列看成了參數.

static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
    if (p && p->_qproc && wait_address)
        p->_qproc(filp, wait_address, p);
}

回到 ep_insert()

因此這裏設置的poll_table就是ep_ptable_queue_proc().

而後

revents = ep_item_poll(epi, &epq.pt);

看其實現能夠看到,其實就是主動去調用文件的poll函數. 這裏以TCP socket文件爲例好了(畢竟網絡應用是最普遍的)

/*
 *   ep_item_poll  -> sock_poll -> tcp_poll
 */
unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)  
{
    sock_poll_wait(file, sk_sleep(sk), wait);  // will call poll_wait()   
    // code omitted...
}

能夠看到,最終仍是調用到了poll_wait(),因此註冊的ep_ptable_queue_proc()會執行

struct epitem *epi = ep_item_from_epqueue(pt);
    struct eppoll_entry *pwq; 

    pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL)

這裏面, 又分配了一個struct eppoll_entry結構. 其實它和struct epitem 結構是一一對應的.

隨後就是一些初始化

init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); //  set func:ep_poll_callback
    pwq->whead = whead;  
    pwq->base = epi;

    add_wait_queue(whead, &pwq->wait) 
    list_add_tail(&pwq->llink, &epi->pwqlist);  
    epi->nwait++;

這其中比較重要的是設置pwd->wait.func = ep_poll_callback。

如今, struct epitem 和struct eppoll_entry的關係就像下面這樣

文件可讀以後

對於TCP socket, 當收到對端報文後,最初設置的sk->sk_data_ready函數將被調用

void sock_init_data(struct socket *sock, struct sock *sk)
{
    // code omitted...
     sk->sk_data_ready  =   sock_def_readable;
    // code omitted...
}

通過層層調用,最終會調用到 __wake_up_common 這裏面會遍歷掛在socket.wq上的等待隊列上的函數

static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
            int nr_exclusive, int wake_flags, void *key)
{
    wait_queue_t *curr, *next;

    list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
        unsigned flags = curr->flags;

        if (curr->func(curr, mode, wake_flags, key) &&
                (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
            break;
    }
}

因而, 順着圖中的這條紅色軌跡, 就會調用到咱們設置的ep_poll_callback,那麼接下來就是要讓epoll實例可以知有文件已經可讀了

先從入參中取出當前表項epi和ep

struct epitem *epi = ep_item_from_wait(wait);
    struct eventpoll *ep = epi->ep;

再把epi掛到ep的就緒隊列

if (!ep_is_linked(&epi->rdllink)) {
        list_add_tail(&epi->rdllink, &ep->rdllist)
    }

接着喚醒阻塞在(若是有)該epoll實例的用戶.

waitqueue_active(&ep->wq)

用戶獲取事件

誰有可能阻塞在epoll實例的等待隊列上呢? 固然就是使用epoll_wait來從epoll實例獲取發生了感興趣事件的的描述符的用戶.
epoll_wait會調用到ep_poll()函數.

if (!ep_events_available(ep)) {
        /*
         * We don't have any available event to return to the caller.
         * We need to sleep here, and we will be wake up by
         * ep_poll_callback() when events will become available.
         */
        init_waitqueue_entry(&wait, current);
        __add_wait_queue_exclusive(&ep->wq, &wait);

若是沒有事件,咱們就將本身掛在epoll實例的等待隊列上而後睡去.....
若是有事件,那麼咱們就要將事件返回給用戶