Linux System Programming 學習筆記(二) 文件I/O

時間 2019-11-12

標籤 linux system programming 學習筆記文件欄目 Linux 简体版

原文原文鏈接

1.每一個Linux進程都有一個最大打開文件數，默認狀況下，最大值是1024

文件描述符不只能夠引用普通文件，也能夠引用套接字socket，目錄，管道(everything is a file)

默認狀況下，子進程會得到其父進程文件表的完整拷貝

2.打開文件

open系統調用必須包含 O_RDONLY，O_WRONLY，O_RDWR 三種存取模式之一

注意 O_NONBLOCK模式

int fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, 0644) 
int fd = creat(filename, 0644)

3.讀文件

read系統調用會有如下結果：

(1)返回值與請求數len相同，全部len字節都存儲在buf內

(2)返回值小於請求數len，可是大於0。發生此種狀況有不少緣由：

a.read系統調用中途被信號中斷

b.read系統調用中途發生錯誤

c.可讀字節數大於0，可是小於len

d.在讀完len字節以前遇到EOF

(3)返回0，表示EOF

(4)調用被阻塞，由於當前沒有可讀，這種狀況不會發生在非阻塞模式

(5)返回-1，errno設置爲EINTR，表示在讀任一字節以前就接收到信號

(6)返回-1，errno設置爲EAGAIN，表示讀操做被阻塞，由於當前並無可讀字節，這隻發生在非阻塞模式

(7)返回-1，errno設置爲 EINTR，EAGAIN以外的值，表示發生其餘更嚴重的錯誤

讀完全部字節：

size_t readn(int fd, void* buf, size_t len)
{
    size_t  tmp = len;
    ssize_t ret = 0;
    while (len != 0 && (ret = read(fd, buf, len)) != 0) {
        if (ret == -1) {
            if (errno == EINTR) {
                continue;
            }
            fprintf(stderr, "read error\n");
            break;
        }
        len -= ret;
        buf += ret;
    }
    return tmp - len;
}

非阻塞讀：

有時咱們並不但願當沒有可讀數據時read系統調用被阻塞，而是但願調用能夠當即返回，代表沒有數據可讀，這就是非阻塞I/O

4.寫文件

write系統調用沒有EOF，對於普通文件，write默認操做是所有寫，除非是發生錯誤返回-1

對於其餘文件就有可能發生部分寫，最典型的是網絡編程中socket讀寫時，應該以下寫：

size_t writen(int fd, void* buf, size_t len)
{
    ssize_t ret = 0;
    size_t  tmp = len;
    while (len != 0 && (ret = write(fd, buf, len)) != 0) {
        if (ret == -1) {
            if (errno == EINTR) {
                continue;
            }
            fprintf(stderr, "write error\n");
            break;
        }
        len -= ret;
        buf += ret;
    }
    return tmp - len;
}

追加模式能夠確保文件的當前位置老是位於文件末尾，而且能夠把文件偏移更新操做當作原子操做，因此該模式對於多任務追加寫很是有用

5.文件同步

當調用write時，內核從用戶buffer拷貝數據到內核buffer，但並不必定是當即寫到目的地，內核一般是執行一些檢查，將數據從用戶buffer拷貝到一個dirty buffer，後而內核收集全部這些dirty buffer( contain data newer than what is on disk)，最後才寫回磁盤。

這種延遲寫並無改變POSIX語義，反而能夠提高讀寫性能

if a read is issued for just-written data that lives in a dirty buffer and is not yet on disk, the request will be satisfied from the buffer and not cause a read from the "stale" data on disk. so the read is satisfied from an in-memory cache without having to go to disk.

延遲寫能夠大幅提高性能，可是有時候須要控制寫回磁盤的文件，這是須要確保文件同步

fsync系統調用確保fd關聯的文件數據已經寫回到磁盤

int ret = fsync(fd);

open調用時 O_SYNC標誌表示文件必須同步編程

int fd = open(file, O_WRONLY | O_SYNC);

O_SYNC致使I/O等待時間消耗巨大，通常地，須要確保文件寫回到磁盤時咱們使用 fsync函數

6.文件定位

顯式的文件定位函數：

a. 將文件偏移定位到1825

off_t ret = lseek(fd, (off_t)1825, SEEK_SET);

b. 將文件便宜定位到文件末尾處

off_t ret = lseek(fd, 0, SEEK_END);

c. 將文件偏移定位到文件開始處

off_t ret = lseek(fd, 0, SEEK_CUR)

文件定位是能夠超出文件末尾的，此時對該文件寫操做會填補0，造成空洞，空洞是不佔有物理磁盤空間的。

This implies that the total size of all files on a filesystem can add up to more than the physical size of the disk

7.截斷文件

int ftruncate(int fd, off_t len);

將給定文件截斷爲給定長度，這裏的給定長度是能夠小於文件大小，也能夠大於文件大小(會形成空洞)數組

8.多路I/O

阻塞I/O：若是read系統調用時，文件(例如管道輸入)沒有可讀數據，這時進程會一直阻塞等待，直到有可讀數據。效率低下，不能同時進行多個文件讀寫操做

多路I/O能夠容許程序併發地阻塞在多個文件上，而且當任一文件變爲可讀或可寫的時候會立馬接收到通知

Multiplexed I/O becomes the pivot point for the application，designed similarly to the following activity：
a. Multiplexed I/O ： Tell me when any of these file descriptors becomes ready for I/O
b. Nothing ready? Sleep until one or more file descriptors are ready.
c. Woken up ! What is ready?
d. Handle all file descriptors ready for I/O, without bolocking
e. Go back to step a

9.select

int select(int nfds, fd_set* readfds, fd_set* writefds, fd_set* exceptfds, struct timeval* timeout);
FD_CLR(int fd, fd_set* set);      // removes a fd from a given set
FD_ISSET(int fd, fd_set* set);   // test whether a fd is part of a given set
FD_SET(int fd, fd_set* set);      // adds a fd to a given set
FD_ZERO(int fd, fd_set* set);  // removes all fds from specified set. shoule be called before every invocation of select()

由於fd_set是靜態分配的，系統有一個文件描述符的最大打開數 FD_SETSIZE，在Linux中，該值爲 1024網絡

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

#define TIMEOUT 5     /* select timeout in seconds */
#define BUFLEN  1024  /* read buffer in bytes */

int main(int argc, char* argv[])
{
    struct timeval tv;
    tv.tv_sec = TIMEOUT;
    tv.tv_usec = 0;

    /* wait on stdin for input */
    fd_set readfds;
    FD_ZERO(&readfds);
    FD_SET(STDIN_FILENO, &readfds);

    int ret = select(STDIN_FILENO + 1, &readfds, NULL, NULL, &tv);
    if (ret == -1) {
        fprintf(stderr, "select error\n");
        return 1;
    } else if (!ret) {
        fprintf(stderr, "%d seconds elapsed.\n", TIMEOUT);
        return 0;
    }
    if (FD_ISSET(STDIN_FILENO, &readfds)) {
        char buf[BUFLEN + 1];
        int len = read(STDIN_FILENO, buf, BUFLEN);
        if (len == -1) {
            fprintf(stderr, "read error\n");
            return 1;
        }
        if (len != 0) {
            buf[BUFLEN] = '\0';
            fprintf(stdout, "read:%s\n", buf);
        }
        return 0;
    } else {
        fprintf(stderr, "This should not happen\n");
        return 1;
    }

}

10. poll

int poll(struct pollfd* fds, nfds_t  nfds, int timeout);

This is a program that uses poll() to check whether a read from stdin and a write to stdout will block併發

#include <unistd.h>
#include <poll.h>

#define TIMEOUT 5

int main(int argc, char* argv[])
{
    struct pollfd fds[2];

    /* watch stdin for input */
    fds[0].fd = STDIN_FILENO;
    fds[0].events = POLLIN;

    /* watch stdout for alibity to write */
    fds[1].fd = STDOUT_FILENO;
    fds[1].events = POLLOUT;

    int ret = poll(fds, 2, TIMEOUT * 1000);
    if (ret == -1) {
        fprintf(stderr, "poll error\n");
        return 1;
    }

    if (!ret) {
        fprintf(stdout, "%d seconds elapsed.\n", TIMEOUT);
        return 0;
    }

    if (fds[0].revents & POLLIN) {
        fprintf(stdout, "stdin is readable\n");
    }
    if (fds[1].revents & POLLOUT) {
        fprintf(stdout, "stdout is writable\n");
    }
    return 0;
}

poll vs select

a. poll不須要用戶計算並傳遞文件描述符參數(select中必須將該值設爲最大描述符數加1)

b. select的fd_set是靜態分配的，有一個最大文件數限制FD_SETSIZE，poll就沒有這個限制，只須要建立一個合適大小的結構體數組

c. select移植性更好，支持select的unix更多

d. select支持更精細的timeout，poll只支持毫秒

11.內核實現

Linux內核主要由 virtual filesystem, page cache, page write-back 來支持有效且強大的I/O機制

(1) virtual filesystem

The virtual filesystem (also called a virtual file switch) is a mechanism of abstraction that allows the Linux kernel to call filesystem functions and manipulate filesystem data without knowing the specific type of filesystem being used.

So, a single system call can read any filesystem on any medium, All filesystems support the same concepts, the same interfaces, and the same calls

(2) page cache

The page cache is an in-memory store of recently accessed data from an on-disk filesystem.

Storing requested data in memory allows the kernel to fulfill subsequent requests for the same data from memory, avoiding repeated disk access

The page cache exploits the concept of temporal locality, which says that a resource accessed at one point has a high probability of being accessed again in the near future

時間局部性：

The page cache is the first place that kernel looks for filesystem data. The first time any item of sata is read, it is transferred from the disk into the page cache, and is returned to the application from the cache.

空間局部性：

The data is often referenced sequentially. The kernel implements page cache readahead（預讀）. Readahead is the act of reading extra data off the disk and into the page cache following each read request. In effect, reading a little bit ahead.

(3) page write-back

When a process issues a write request, the data is copied into a buffer, and the buffer is marked dirty, denoting that the in-memory copy is newer than the on-disk copy.

Eventually, the dirty buffers need to be committed to disk, sync the on-disk files with the data in memory.