爲何用ls和du顯示出來的文件大小有差異？

時間 2019-12-06

原文原文鏈接

曾經有幾回，我用ls和du查看一個文件的大小，發現兩者顯示出來的大小並不一致，例如： app

bl@d3:~/test/sparse_file$ ls -l fs.img
-rw-r--r-- 1 bl bl 1073741824 2012-02-17 05:09 fs.img

bl@d3:~/test/sparse_file$ du -sh fs.img
0       fs.img

這裏ls顯示出fs.img的大小是1073741824字節（1GB），而du顯示出fs.img的大小是0。優化

原來一直沒有深究這個問題，今天特來補上。 spa

形成這兩者不一樣的緣由主要有兩點： .net

稀疏文件（sparse file）
ls和du顯示出的size有不一樣的含義

先來看一下稀疏文件。稀疏文件只文件中有「洞」（hole）的文件，例若有C寫一個建立有「洞」的文件：指針

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h> int main(int argc, char *argv[])
{ int fd = open("sparse.file", O_RDWR|O_CREAT);
    lseek(fd, 1024, SEEK_CUR);
    write(fd, "\0", 1); return 0;
}

從這個文件能夠看出，建立一個有「洞」的文件主要是用lseek移動文件指針超過文件末尾，而後write，這樣就造成了一個「洞」。 code

用Shell也能夠建立稀疏文件： ip

$ dd if=/dev/zero of=sparse_file.img bs=1M seek=1024 count=0
0+0 records in
0+0 records out

使用稀疏文件的優勢以下（Wikipedia上的原文）： ci

The advantage of sparse files is that storage is only allocated when actually needed: disk space is saved, and large files can be created even if there is insufficient free space on the file system. get

即稀疏文件中的「洞」能夠不佔存儲空間。 it

再來看一下ls和du輸出的文件大小的含義（Wikipedia上的原文）：

The du command which prints the occupied space, while ls print the apparent size。

換句話說，ls顯示文件的「邏輯上」的size，而du顯示文件「物理上」的size，即du顯示的size是文件在硬盤上佔據了多少個block計算出來的。舉個例子：

bl@d3:~/test/sparse_file$ echo -n 1 > 1B.txt
bl@d3:~/test/sparse_file$ ls -l 1B.txt
-rw-r--r-- 1 bl bl 1 2012-02-19 05:17 1B.txt
bl@dl3:~/test/sparse_file$ du -h 1B.txt
4.0K    1B.txt

這裏咱們先建立一個文件1B.txt，大小是一個字節，ls顯示出的size就是1Byte，而1B.txt這個文件在硬盤上會佔用N個block，而後根據每一個block的大小計算出來的。這裏之因此用了N，而不是一個具體的數字，是由於隱藏在幕後的細節還不少，例如Fragment size，咱們之後再討論。

固然，上述這些都是ls和du的缺省行爲，ls和du分別提供了不一樣參數來改變這些行爲。好比ls的-s選項（print the allocated size of each file, in blocks）和du的--apparent-size選項（print apparent sizes, rather than disk usage; although the apparent size is usually smaller, it may be larger due to holes in (`sparse') files, internal fragmentation, indirect blocks, and the like）。

此外，對於拷貝稀疏文件，cp缺省狀況下會作一些優化，以加快拷貝的速度。例如：

strace cp fs.img fs.img.copy >log 2>&1

打開log文件，咱們發現cp命令只是read和lseek，並無write。

stat("fs.img.copy", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("fs.img", {st_mode=S_IFREG|0644, st_size=1073741824, ...}) = 0
stat("fs.img.copy", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
open("fs.img", O_RDONLY)                = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1073741824, ...}) = 0
open("fs.img.copy", O_WRONLY|O_TRUNC)   = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 532480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90df965000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 524288
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 1048576
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 1572864

這和cp的關於sparse的選項有關，看cp的manpage：

By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.

看了一下cp的源代碼，發現每次read以後，cp會判斷讀到的內容是否是都是0，若是是就只lseek而不write。

固然對於sparse文件的處理，對於用戶都是透明的。