Linux System Programming 學習筆記(四) 高級I/O

1. Scatter/Gather I/O

a single system call  to  read or write data between single data stream and multiple buffers
This type of I/O is so named because the data is scattered into or gathered from the given vector of buffers
Scatter/Gather I/O 相比於 C標準I/O有以下優點:
(1)更天然的編碼模式
若是數據是天然分隔的(預先定義的結構體字段),那麼 Scatter/Gather I/O  更直觀
(2)效率高
A single vectored I/O operation can replace multiple linear I/O operations
(3)性能好
減小了系統調用數
(4)原子性
Scatter/Gather I/O 是 原子的,而 multiple linear I/O operations 則有多個進程交替I/O的風險
 
#include <sys/uio.h>

struct iovec {
       void *iov_base;    /* pointer to start of buffer */
       size_t iov_len;    /* size of buffer in bytes */
};

 

Both functions always operate on the segments in order, starting with iov[0], then iov[1], and so on, through iov[count-1].

 

/* The readv() function reads count segments from the file descriptor fd into the buffers described by iov */
ssize_t readv (int fd, const struct iovec *iov, int count);
/* The writev() function writes at most count segments from the buffers described by iov into the file descriptor fd */
ssize_t writev (int fd, const struct iovec *iov, int count);

注意:在Scatter/Gather I/O操做過程當中,內核必須分配內部數據結構來表示每一個buffer分段,正常狀況下,是根據分段數count進行動態內存分配的,node

可是當分段數count較小時(通常<=8),內核直接在內核棧上分配,這顯然比在堆中動態分配要快算法

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <sys/uio.h>

int main(int argc, char* argv[])
{
    struct iovec iov[3];
    char* buf[] = {
        "The term buccaneer comes from the word boucan.\n",
        "A boucan is a wooden frame used for cooking meat.\n",
        "Buccaneer is the West Indies name for a pirate.\n" };

    int fd = open("wel.txt", O_WRONLY | O_CREAT | O_TRUNC);
    if (fd == -1) {
        fprintf(stderr, "open error\n");
        return 1;
    }

    /* fill out three iovec structures */
    for (int i = 0; i < 3; ++i) {
        iov[i].iov_base = buf[i];
        iov[i].iov_len  = strlen(buf[i]) + 1;
    }

    /* with a single call, write them out all */
    ssize_t nwrite = writev(fd, iov, 3);
    if (nwrite == -1) {
        fprintf(stderr, "writev error\n");
        return 1;
    }
    fprintf(stdout, "wrote %d bytes\n", nwrite);
    if (close(fd)) {
        fprintf(stdout, "close error\n");
        return 1;
    }

    return 0;
}

 

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <sys/stat.h>

int main(int argc, char* argv[])
{
    char foo[48], bar[51], baz[49];
    struct iovec iov[3];
    int fd = open("wel.txt", O_RDONLY);
    if (fd == -1) {
        fprintf(stderr, "open error\n");
        return 1;
    }

    /* set up our iovec structures */
    iov[0].iov_base = foo;
    iov[0].iov_len =  sizeof(foo);
    iov[1].iov_base = bar;
    iov[1].iov_len =  sizeof(bar);
    iov[2].iov_base = baz;
    iov[2].iov_len =  sizeof(baz);

    /* read into the structures with a single call */
    ssize_t nread = readv(fd, iov, 3);
    if (nread == -1) {
        fprintf(stderr, "readv error\n");
        return 1;
    }

    for (int i = 0; i < 3; ++i) {
        fprintf(stdout, "%d: %s", i, (char*)iov[i].iov_base);
    }
    if (close(fd)) {
        fprintf(stderr, "close error\n");
        return 1;
    }

    return 0;
}

 

writev的簡單實現:數據結構

#include <unistd.h>
#include <sys/uio.h>

ssize_t my_writev(int fd, const struct iovec* iov, int count)
{
    ssize_t ret = 0;
    for (int i = 0; i < count; ++i) {
        ssize_t nr = write(fd, iov[i].iov_base, iov[i].iov_len);
        if (nr == -1) {
            if (errno == EINTR)
                continue;
            ret -= 1;
            break;
        }
        ret += nr;
    }
    return nr;
}

 In fact, all I/O inside the Linux kernel is vectored; read() and write() are implemented as vectored I/O with a vector of only one segmentapp

 

2. epoll

poll 和 select 每次調用都須要監測整個文件描述符集,當描述符集增大時, 每次調用對描述符集全局掃描就成了性能瓶頸
 
(1) creating a new epoll instance
/* A successful call to epoll_create1() instantiates a new epoll instance and returns a file descriptor associated with the instance */
#include <sys/epoll.h>
int epoll_create(int size);

parameter size used to provide a  hint about the number of file descriptors to be watched; 異步

nowadays the kernel dynamically sizes the required data structures and this parameter just needs to be greater than zeroide

 

(2) controling epolloop

 

/* The epoll_ctl() system call can be used to add file descriptors to and remove file descriptors from a given epoll context */
#include <sys/epoll.h>
int epoll_ctl(int epfd, int op, int fd, struct epoll_event* event);

struct epoll_event {
    __u32 events;    /* events */
    union {
              void* ptr;
              int     fd;
              __u32 u32;
              __u64 u64;
    } data;
};

a. op parameter性能

EPOLL_CTL_ADD   // Add a monitor on the file associated with the file descriptor fd to the epoll instance associated with epfd
EPOLL_CTL_DEL  // Remove a monitor on the file associated with the file descriptor fd from the epoll instance associated with epfd
EPOLL_CTL_MOD  //  Modify an existing monitor of fd with the updated events specified by event

b. event parameterui

EPOLLET     // Enables edge-triggered behavior for the monitor of the file ,The default behavior is level-triggered
EPOLLIN     // The file is available to be read from without blocking
EPOLLOUT  // The file is available to be written to without blocking

對於結構體struct epoll_event 裏的data成員,一般作法是將data聯合體裏的fd設置爲第二個參數fd,即 event.data.fd = fdthis

 

To add a new watch on the file associated with fd to the epoll instance  epfd : 

#include <sys/epoll.h>

struct epoll_event event;
event.data.fd = fd;
event.events = EPOLLIN | EPOLLOUT

int ret = epll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
if (ret) {
    fprintf(stderr, "epll_ctl error\n");
}

 

To modify an existing event on the file associated with fd on the epoll instance epfd : 

#include <sys/epoll.h>

struct epoll_event event;
event.data.fd = fd;
event.events = EPOLLIN;

int ret = epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &event);
if (ret) {
    fprintf(stderr, "epoll_ctl error\n");
}

 

To remove an existing event on the file associated with fd from the epoll instance epfd : 

#include <sys/epoll.h>

struct epoll_event event;

int ret = epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &event);
if (ret) {
    fprintf(stderr, "epoll_ctl error\n");
}

 

(3) waiting for events with epoll

#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event* events, int maxevents, int timeout);

The return value is the number of events, or −1 on error

#include <sys/epoll.h>

#define MAX_EVENTS 64

struct epoll_event* events = malloc(sizeof(struct epoll_event) * MAX_EVENTS);
if (events == NULL) {
    fprintf(stdout, "malloc error\n");
    return 1;
}

int nready = epoll_wait(epfd, events, MAX_EVENTS, -1);
if (nready < 0) {
    fprintf(stderr, "epoll_wait error\n");
    free(events);
    return 1;
}

for (int i = 0; i < nready; ++i) {
    fprintf(stdout, "event=%ld on fd=%d\n", events[i].events, events[i].data.fd);
    /* we now can operate on events[i].data.fd without blocking */
}
free(events);
(4) Level-Triggered  vs. Edge-Triggered
考慮一個生產者和一個消費者經過Unix 管道通訊:
a. 生產者向管道寫 1KB數據
b. 消費者在管道上執行 epoll_wait, 等待管道中數據可讀
若是是 水平觸發,epoll_wait 調用會當即返回,代表管道已經 讀操做就緒
若是是 邊緣觸發,epoll_wait 調用不會當即返回,直到 步驟a 發生,也就是說,即便管道在eoll_wait調用的時候是可讀的,該調用也不會當即返回直至數據已經寫到管道。
 
select和epoll的默認工做方式是 水平觸發, 邊緣觸發一般是應用於 non-blocking I/O, 而且必須當心檢查EAGAIN

 

3. Mapping Files into Memory

/* A call to mmap() asks the kernel to map len bytes of the object represented by the file  descriptor fd, 
starting at offset bytes into the file, into memory
*/ #include <sys/mman.h> void* mmap(void* addr, size_t len, int prot, int flags, int fd, off_t offset);
void* ptr = mmap(0, len, PROT_READ, MAP_SHARED, fd, 0);

 

參數addr和offset必須是內存頁大小的整數倍
若是文件大小不是內存頁大小的整數倍,那麼就存在填補爲0的空洞,讀這段區域將返回0,寫這段區域將不會影響到被映射的文件內容
Because addr and offset are usually 0, this requirement is not overly difficult to meet
 
int munmap (void *addr, size_t len);

munmap() removes any mappings that contain pages located anywhere in the process address space starting at addr,

which must be page-aligned, and continuing for len bytes

 

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>

int main(int argc, char* argv[])
{
    if (argc < 2) {
        fprintf(stderr, "usage:%s <file>\n", argv[0]);
        return 1;
    }

    int fd = open(argv[1], O_RDONLY);
    if (fd == -1) {
        fprintf(stderr, "open error\n");
        return 1;
    }

    struct stat sbuf;
    if (fsat(fd, &sbuf) == -1) {
        fprintf(stderr, "fstat error\n");
        return 1;
    }

    if (!S_ISREG(sbuf.st_mode)) {
        fprintf(stderr, "%s is not a file\n", argv[1]);
        return 1;
    }
    void* ptr = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0);
    if (ptr == MAP_FAILED) {
        fprintf(stderr, "mmap error\n");
        return 1;
    }

    if (close(fd)) {
        fprintf(stderr, "close error\n");
        return 1;
    }

    for (int i = 0; i < sbuf.st_size; ++i) {
        fputc(ptr[i], stdout);
    }

    if (munmap(ptr, sbuf.st_size) == -1) {
        fprintf(stderr, "munmap error\n");
        return 1;
    }
    return 0;
}

 

mmap的優勢:
(1) 讀寫內存映射文件 能夠避免 read 和 write 系統調用發生的拷貝
(2) 除了潛在的內存頁錯誤,讀寫內存映射文件不會有 read 和 write 系統調用時的上下文切換開銷
(3) 當多個進程映射相同內存時,內存數據能夠在多個進程之間共享
(4) 內存映射定位時只須要日常的指針操做,而不須要 lseek 系統調用
 
mmap的缺點:
(1) 內存映射必須是頁大小的整數倍, 一般在 調用返回的內存頁 和 被映射的實際文件之間存在 剩餘空間,例如內存頁大小是4KB,那麼映射7Byte文件將浪費4089Byte內存
(2) 內存映射必須適合進程的地址空間,32位地址空間,若是有大量的內存映射,將產生碎片,將很難尋找到連續的大區域
(3) 建立和維護 內存映射和內核相關數據結構 有必定開銷
 
文件和映射內存之間的同步:
#include <sys/mman.h>
int msync (void *addr, size_t len, int flags);
msync() flushes back to disk any changes made to a file mapped via mmap(), synchronizing the mapped file with the mapping 
Without invocation of msync(), there is no guarantee that a dirty mapping will be written back to disk until the file is unmapped
 

4. 同步 異步

 
 
  
 

5. I/O調度和I/O性能

a single disk seek can average over 8 milliseconds,25 million times  longer than a single processor cycle!
I/O schedulers perform two basic operations: merging and sorting
Merging is the process of taking two or more adjacent I/O requests and combining them into a single request
假設一個請求讀磁盤塊5,另外一個請求讀磁盤塊6和7,調度器能夠合併成單一請求讀磁盤塊5到7,這將減小一半的I/O操做數
Sorting is the process of arranging pending I/O requests in ascending block order (儘量保證順序讀寫,而不是隨機讀寫)
 
If an I/O scheduler always sorted new requests by the order of insertion,it would be possible to starve requests to
far-off blocks indefinitely (I/O調度算法,避免飢餓現象)
使用 電梯調度算法避免某個I/O請求長時間得不到響應
 
Linus Elevator I/O scheduler
The Deadline I/O Scheduler
The Anticipatory I/O Scheduler
The CFQ I/O Scheduler
The Noop I/O Scheduler

 

Sorting by inode number is the most commonly used method for scheduling I/O requests in user space
相關文章
相關標籤/搜索