libevent源碼閱讀筆記（一）：libevent對epoll的封裝

時間 2019-11-11

原文原文鏈接

title: libevent源碼閱讀筆記（一）：libevent對epoll的封裝

最近開始閱讀網絡庫libevent的源碼，閱讀源碼以前，大體看了張亮寫的幾篇博文（libevent源碼深度剖析 http://blog.csdn.net/sparkliang/article/details/4957667 ），對libevent網絡庫有了整體上的認識，而後開始源碼的閱讀。數組

與總體把握不一樣，我是先從局部開始閱讀libevent的源碼，固然，前提是我已經大體瞭解了整個libevent的框架結構，今天先寫寫libevent對epoll的封裝部分。網絡

libevent對epoll的封裝主要是在epoll.c文件
首先是epollop結構體，封裝epoll文件描述符，event事件數組app

struct epollop {
    struct epoll_event *events;   //epoll_event數組
    int nevents;    //事件數量
    int epfd;     //epollfd
};

定義了三個靜態函數對epoll進行操做，其中event_base是整個libevent框架封裝的結構體，也就是反應堆，咱們利用epoll註冊的事件，最終都會加到反應堆event_base中框架

static void *epoll_init(struct event_base *);
  static int epoll_dispatch(struct event_base *, struct timeval *);
  static void epoll_dealloc(struct event_base *);

這裏的幾個函數都定義成函數指針，是由於libevent對多路複用IO封裝成統一的接口，在安裝libevent的時候根據系統對IO複用的支持選擇合適的函數。
統一封裝的接口爲eventop，其每個成員都是一個函數指針：
const struct eventop epollops = {
"epoll",
epoll_init,
epoll_nochangelist_add,
epoll_nochangelist_del,
epoll_dispatch,
epoll_dealloc,
1, /* need reinit */
EV_FEATURE_ET|EV_FEATURE_O1,
0
};
下面依次看下這些函數的實現：
（1）先看 epoll_init 函數，主要是epoll的初始化操做，建立一個epoll文件描述符，初始化event數組大小，最後設置了關於信號監聽處理方面的初始化工做（這個後面單獨講解）。socket

static void *
epoll_init(struct event_base *base)
{
    int epfd;
    struct epollop *epollop;

    /* Initialize the kernel queue.  (The size field is ignored since
     * 2.6.8.) */
    if ((epfd = epoll_create(32000)) == -1) {
        if (errno != ENOSYS)
            event_warn("epoll_create");
        return (NULL);
    }
    //設置使用execl執行的程序裏，此描述符被關閉，子進程中不關
    evutil_make_socket_closeonexec(epfd);

    if (!(epollop = mm_calloc(1, sizeof(struct epollop)))) {
        close(epfd);
        return (NULL);
    }

    epollop->epfd = epfd;

    /* Initialize fields */
    epollop->events = mm_calloc(INITIAL_NEVENT, sizeof(struct epoll_event));
    if (epollop->events == NULL) {
        mm_free(epollop);
        close(epfd);
        return (NULL);
    }
    epollop->nevents = INITIAL_NEVENT;
    //使用changelist
    if ((base->flags & EVENT_BASE_FLAG_EPOLL_USE_CHANGELIST) != 0 ||
        ((base->flags & EVENT_BASE_FLAG_IGNORE_ENV) == 0 &&
        evutil_getenv("EVENT_EPOLL_USE_CHANGELIST") != NULL))
        base->evsel = &epollops_changelist;
    //對於信號的初始化
    evsig_init(base);

    return (epollop);
}

（2）epoll_nochangelist_add
對於epoll事件的添加，主要採用了區分爲是否使用changelist，使用changelist效率更高，咱們先分析不使用changelist的狀況，也就是epoll_nochangelist_add函數，該函數設置須要添加的事件，調用epoll_apply_one_change函數完成添加函數

static int
epoll_nochangelist_add(struct event_base *base, evutil_socket_t fd,
    short old, short events, void *p)
{
    struct event_change ch;
    ch.fd = fd;
    ch.old_events = old;
    ch.read_change = ch.write_change = 0;
    //判斷讀寫事件是否須要修改
    if (events & EV_WRITE)
        ch.write_change = EV_CHANGE_ADD |
            (events & EV_ET);  //EV_ET是邊緣觸發
    if (events & EV_READ)
        ch.read_change = EV_CHANGE_ADD |
            (events & EV_ET);
    //nochangelist方法中，直接調用epoll_apply_one_change，底層是系統調用epoll_ctl的封裝
    return epoll_apply_one_change(base, base->evbase, &ch);
}

其中使用了event_change結構封裝事件的修改,主要包括一個句柄，改變以前的事件old_events，但願改變的讀事件read_change和寫事件write_change：ui

struct event_change {
    /** The fd or signal whose events are to be changed */
    evutil_socket_t fd;
    /* The events that were enabled on the fd before any of these changes
       were made.  May include EV_READ or EV_WRITE. */
    short old_events;

    /* The changes that we want to make in reading and writing on this fd.
     * If this is a signal, then read_change has EV_CHANGE_SIGNAL set,
     * and write_change is unused. */
    ev_uint8_t read_change;
    ev_uint8_t write_change;
};

上面提到的epoll_apply_one_change函數，實際是對epoll_ctl的封裝，採用了ADD和MOD的方式進行嘗試，最後都是調用epoll_ctl完成事件的添加this

static int
epoll_apply_one_change(struct event_base *base,
    struct epollop *epollop,
    const struct event_change *ch)
{
    struct epoll_event epev;
    int op, events = 0;

    if (1) {
        /* The logic here is a little tricky.  If we had no events set
           on the fd before, we need to set op="ADD" and set
           events=the events we want to add.  If we had any events set
           on the fd before, and we want any events to remain on the
           fd, we need to say op="MOD" and set events=the events we
           want to remain.  But if we want to delete the last event,
           we say op="DEL" and set events=the remaining events.  What
           fun!
        */

        /* TODO: Turn this into a switch or a table lookup. */

        if ((ch->read_change & EV_CHANGE_ADD) ||
            (ch->write_change & EV_CHANGE_ADD)) {
            /* If we are adding anything at all, we'll want to do
             * either an ADD or a MOD. */
            events = 0;
            op = EPOLL_CTL_ADD;
            //讀監聽
            if (ch->read_change & EV_CHANGE_ADD) {
                events |= EPOLLIN;
            } else if (ch->read_change & EV_CHANGE_DEL) {
                ;
            } else if (ch->old_events & EV_READ) {
                events |= EPOLLIN;
            }
            //寫監聽
            if (ch->write_change & EV_CHANGE_ADD) {
                events |= EPOLLOUT;
            } else if (ch->write_change & EV_CHANGE_DEL) {
                ;
            } else if (ch->old_events & EV_WRITE) {
                events |= EPOLLOUT;
            }
            //是否邊緣觸發
            if ((ch->read_change|ch->write_change) & EV_ET)
                events |= EPOLLET;

            if (ch->old_events) {
                /* If MOD fails, we retry as an ADD, and if
                 * ADD fails we will retry as a MOD.  So the
                 * only hard part here is to guess which one
                 * will work.  As a heuristic, we'll try
                 * MOD first if we think there were old
                 * events and ADD if we think there were none.
                 *
                 * We can be wrong about the MOD if the file
                 * has in fact been closed and re-opened.
                 *
                 * We can be wrong about the ADD if the
                 * the fd has been re-created with a dup()
                 * of the same file that it was before.
                 */
                op = EPOLL_CTL_MOD;
            }
        } else if ((ch->read_change & EV_CHANGE_DEL) ||
            (ch->write_change & EV_CHANGE_DEL)) {
            /* If we're deleting anything, we'll want to do a MOD
             * or a DEL. */
            op = EPOLL_CTL_DEL;

            if (ch->read_change & EV_CHANGE_DEL) {
                if (ch->write_change & EV_CHANGE_DEL) {
                    events = EPOLLIN|EPOLLOUT;
                } else if (ch->old_events & EV_WRITE) {
                    events = EPOLLOUT;
                    op = EPOLL_CTL_MOD;
                } else {
                    events = EPOLLIN;
                }
            } else if (ch->write_change & EV_CHANGE_DEL) {
                if (ch->old_events & EV_READ) {
                    events = EPOLLIN;
                    op = EPOLL_CTL_MOD;
                } else {
                    events = EPOLLOUT;
                }
            }
        }

        if (!events)
            return 0;

        memset(&epev, 0, sizeof(epev));
        epev.data.fd = ch->fd;
        epev.events = events;
        if (epoll_ctl(epollop->epfd, op, ch->fd, &epev) == -1) {
            if (op == EPOLL_CTL_MOD && errno == ENOENT) {
                /* If a MOD operation fails with ENOENT, the
                 * fd was probably closed and re-opened.  We
                 * should retry the operation as an ADD.
                 */
                if (epoll_ctl(epollop->epfd, EPOLL_CTL_ADD, ch->fd, &epev) == -1) {
                    event_warn("Epoll MOD(%d) on %d retried as ADD; that failed too",
                        (int)epev.events, ch->fd);
                    return -1;
                } else {
                    event_debug(("Epoll MOD(%d) on %d retried as ADD; succeeded.",
                        (int)epev.events,
                        ch->fd));
                }
            } else if (op == EPOLL_CTL_ADD && errno == EEXIST) {
                /* If an ADD operation fails with EEXIST,
                 * either the operation was redundant (as with a
                 * precautionary add), or we ran into a fun
                 * kernel bug where using dup*() to duplicate the
                 * same file into the same fd gives you the same epitem
                 * rather than a fresh one.  For the second case,
                 * we must retry with MOD. */
                if (epoll_ctl(epollop->epfd, EPOLL_CTL_MOD, ch->fd, &epev) == -1) {
                    event_warn("Epoll ADD(%d) on %d retried as MOD; that failed too",
                        (int)epev.events, ch->fd);
                    return -1;
                } else {
                    event_debug(("Epoll ADD(%d) on %d retried as MOD; succeeded.",
                        (int)epev.events,
                        ch->fd));
                }
            } else if (op == EPOLL_CTL_DEL &&
                (errno == ENOENT || errno == EBADF ||
                errno == EPERM)) {
                /* If a delete fails with one of these errors,
                 * that's fine too: we closed the fd before we
                 * got around to calling epoll_dispatch. */
                event_debug(("Epoll DEL(%d) on fd %d gave %s: DEL was unnecessary.",
                    (int)epev.events,
                    ch->fd,
                    strerror(errno)));
            } else {
                event_warn("Epoll %s(%d) on fd %d failed.  Old events were %d; read change was %d (%s); write change was %d (%s)",
                    epoll_op_to_string(op),
                    (int)epev.events,
                    ch->fd,
                    ch->old_events,
                    ch->read_change,
                    change_to_string(ch->read_change),
                    ch->write_change,
                    change_to_string(ch->write_change));
                return -1;
            }
        } else {
            event_debug(("Epoll %s(%d) on fd %d okay. [old events were %d; read change was %d; write change was %d]",
                epoll_op_to_string(op),
                (int)epev.events,
                (int)ch->fd,
                ch->old_events,
                ch->read_change,
                ch->write_change));
        }
    }
    return 0;
}

（3）epoll_nochangelist_del
與epoll_nochangelist_add操做相似，從反應堆中刪除對應的監聽事件spa

static int
epoll_nochangelist_del(struct event_base *base, evutil_socket_t fd,
    short old, short events, void *p)
{
    struct event_change ch;
    ch.fd = fd;
    ch.old_events = old;
    ch.read_change = ch.write_change = 0;
    if (events & EV_WRITE)
        ch.write_change = EV_CHANGE_DEL;
    if (events & EV_READ)
        ch.read_change = EV_CHANGE_DEL;

    return epoll_apply_one_change(base, base->evbase, &ch);
}

（4）epoll_dispatch
再來看epoll_dispatch函數，實際是對epoll_wait的封裝，對反應堆中已經註冊添加的監聽事件調用epoll_wait，同時設置超時時間，對監聽到的事件，加入反應堆的激活隊列中（反應堆會處理激活隊列中中的事件）.net

static int
epoll_dispatch(struct event_base *base, struct timeval *tv)
{
    struct epollop *epollop = base->evbase;
    struct epoll_event *events = epollop->events;
    int i, res;
    long timeout = -1;

    //
    if (tv != NULL) {
        timeout = evutil_tv_to_msec(tv);  //轉換成毫秒，後面設置epoll_wait的超時時間
        if (timeout < 0 || timeout > MAX_EPOLL_TIMEOUT_MSEC) {
            /* Linux kernels can wait forever if the timeout is
             * too big; see comment on MAX_EPOLL_TIMEOUT_MSEC. */
            timeout = MAX_EPOLL_TIMEOUT_MSEC;
        }
    }

    //註冊event_base中添加的每個監聽事件(針對changelist)
    epoll_apply_changes(base);
    //清空changelist(針對changelist)
    event_changelist_remove_all(&base->changelist, base);

    EVBASE_RELEASE_LOCK(base, th_base_lock);

    res = epoll_wait(epollop->epfd, events, epollop->nevents, timeout);

    EVBASE_ACQUIRE_LOCK(base, th_base_lock);

    if (res == -1) {
        if (errno != EINTR) {
            event_warn("epoll_wait");
            return (-1);
        }

        return (0);
    }

    event_debug(("%s: epoll_wait reports %d", __func__, res));
    EVUTIL_ASSERT(res <= epollop->nevents);

    for (i = 0; i < res; i++) {
        int what = events[i].events;
        short ev = 0;
        //EPOLLHUP是文件描述符被掛斷    EPOLLERR是文件描述符出現錯誤
        if (what & (EPOLLHUP|EPOLLERR)) {
            ev = EV_READ | EV_WRITE;
        } else {
            if (what & EPOLLIN)
                ev |= EV_READ;
            if (what & EPOLLOUT)
                ev |= EV_WRITE;
        }

        if (!ev)
            continue;
        //將事件加入激活隊列中
        evmap_io_active(base, events[i].data.fd, ev | EV_ET);
    }

    //epollop中註冊監聽的事件都觸發，代表須要增長epollop中可以容納的事件大小
    if (res == epollop->nevents && epollop->nevents < MAX_NEVENT) {
        /* We used all of the event space this time.  We should
           be ready for more events next time. */
        int new_nevents = epollop->nevents * 2;
        struct epoll_event *new_events;

        new_events = mm_realloc(epollop->events,
            new_nevents * sizeof(struct epoll_event));
        if (new_events) {
            epollop->events = new_events;
            epollop->nevents = new_nevents;
        }
    }

    return (0);
}

（5）epoll_dealloc
這個函數就不用說了，將反應堆中epoll的對應內存釋放，句柄關閉~

static void
epoll_dealloc(struct event_base *base)
{
    struct epollop *epollop = base->evbase;

    evsig_dealloc(base);
    if (epollop->events)
        mm_free(epollop->events);
    if (epollop->epfd >= 0)
        close(epollop->epfd);

    memset(epollop, 0, sizeof(struct epollop));
    mm_free(epollop);
}

changleist模式

epoll在封裝時，實際上採用的是效率更高的changelist模式，先來看一下changelist的結構，該結構用於記錄反應堆兩次監聽（dispatch）之間，對須要監聽的文件描述符所作更改的保存，而並非對於每一次更改當即調用系統調用epoll_ctl，效率更高

struct event_changelist {
    struct event_change *changes;    //event_change數組首地址
    int n_changes;    //event_change數組中event_change個數
    int changes_size;   //分配的event_change數組容量
};

在以上結構基礎上，再來理解event_changelist_add和event_changelist_del函數

int
event_changelist_add(struct event_base *base, evutil_socket_t fd, short old, short events,
    void *p)
{
    //獲取反應堆中的changelist
    struct event_changelist *changelist = &base->changelist;
    struct event_changelist_fdinfo *fdinfo = p;
    struct event_change *change;

    event_changelist_check(base);
    //從changelist中查找是否存在fd的event_change，若是有，返回，若是沒有構造一個加到event_changelist中，並返回
    change = event_changelist_get_or_construct(changelist, fd, old, fdinfo);
    if (!change)
        return -1;

    /* An add replaces any previous delete, but doesn't result in a no-op,
     * since the delete might fail (because the fd had been closed since
     * the last add, for instance. */
    //添加操做能夠替代以前的刪除，但不會致使在此文件描述符fd上的不操做，由於此前的刪除操做可能失敗，
    //注意這裏是替代，也就是說，本來該fd上如果刪除，這裏直接修改成添加
    if (events & (EV_READ|EV_SIGNAL)) {
        change->read_change = EV_CHANGE_ADD |
            (events & (EV_ET|EV_PERSIST|EV_SIGNAL));
    }
    if (events & EV_WRITE) {
        change->write_change = EV_CHANGE_ADD |
            (events & (EV_ET|EV_PERSIST|EV_SIGNAL));
    }

    event_changelist_check(base);
    return (0);

}

其中值得注意的是event_changelist_get_or_construct函數：這裏在event_changelist中查找對應fd的event_change並無遍歷，緣由在於使用了event_changelist_fdinfo結構保存了該fd在event_changelist中的下標加1（若該值爲0，表示event_changelist中不存在該fd的event_change）

static struct event_change *
event_changelist_get_or_construct(struct event_changelist *changelist,
    evutil_socket_t fd,
    short old_events,
    struct event_changelist_fdinfo *fdinfo)
{
    struct event_change *change;
    //不存在，增長
    if (fdinfo->idxplus1 == 0) {
        int idx;
        EVUTIL_ASSERT(changelist->n_changes <= changelist->changes_size);
        //容量不夠，擴容
        if (changelist->n_changes == changelist->changes_size) {
            if (event_changelist_grow(changelist) < 0)
                return NULL;
        }

        idx = changelist->n_changes++;
        change = &changelist->changes[idx];
        fdinfo->idxplus1 = idx + 1;  //idxplus1爲list的下標+1

        memset(change, 0, sizeof(struct event_change));
        change->fd = fd;
        change->old_events = old_events;
    } else {
    //存在，直接返回change指針
        change = &changelist->changes[fdinfo->idxplus1 - 1];
        EVUTIL_ASSERT(change->fd == fd);
    }
    return change;
}

相似的，event_changelist_del函數也很好理解，與event_changelist_add不一樣的是，del操做能夠抵消此前在該fd上的add操做

    int
event_changelist_del(struct event_base *base, evutil_socket_t fd, short old, short events,
    void *p)
{
    struct event_changelist *changelist = &base->changelist;
    struct event_changelist_fdinfo *fdinfo = p;
    struct event_change *change;

    event_changelist_check(base);
    change = event_changelist_get_or_construct(changelist, fd, old, fdinfo);
    event_changelist_check(base);
    if (!change)
        return -1;

    /* A delete removes any previous add, rather than replacing it:
       on those platforms where "add, delete, dispatch" is not the same
       as "no-op, dispatch", we want the no-op behavior.

       As well as checking the current operation we should also check
       the original set of events to make sure were not ignoring
       the case where the add operation is present on an event that
       was already set.

       If we have a no-op item, we could remove it it from the list
       entirely, but really there's not much point: skipping the no-op
       change when we do the dispatch later is far cheaper than rejuggling
       the array now.

       As this stands, it also lets through deletions of events that are
       not currently set.
     */

    //對於已經設置過添加監聽的fd，刪除操做抵消添加操做變爲不操做
    if (events & (EV_READ|EV_SIGNAL)) {
        if (!(change->old_events & (EV_READ | EV_SIGNAL)) &&
            (change->read_change & EV_CHANGE_ADD))
            change->read_change = 0;
        else
            change->read_change = EV_CHANGE_DEL;
    }
    if (events & EV_WRITE) {
        if (!(change->old_events & EV_WRITE) &&
            (change->write_change & EV_CHANGE_ADD))
            change->write_change = 0;
        else
            change->write_change = EV_CHANGE_DEL;
    }

    event_changelist_check(base);
    return (0);
}