rsync 文件校驗及同步原理及rsync server配置

參考:http://rsync.samba.org/how-rsync-works.htmlhtml

咱們關注的是其發送與接收校驗文件的算法,這裏附上原文和我老婆(^_^)的翻譯:

The Sender算法

The sender process reads the file index numbers and associated block checksum sets one at a time from the generator.shell

發送進程一次從生成器讀取一個文件索引號和關聯的塊校驗集合api

For each file id the generator sends it will store the block checksums and build a hash index of them for rapid lookup.緩存

 對於生成器發送的每一個文件ID,它會存儲數據塊校驗和並生成它們的哈希索引,以進行快速查找 。

Then the local file is read and a checksum is generated for the block beginning with the first byte of the local file. This block checksum is looked for in the set that was sent by the generator, and if no match is found, the non-matching byte will be appended to the non-matching data and the block starting at the next byte will be compared. This is what is referred to as the 「rolling checksum」app

 而後會讀取本地文件,併爲以本地文件的第一個字節開頭的數據塊生成校驗和。此數據塊校驗和在由生成器發送的集中查找,若是未找到匹配, 則會將非匹配字節附加到非匹配數據,而且會比較如下一字節開頭的數據塊。  這稱爲「rolling checksum」 .

If a block checksum match is found it is considered a matching block and any accumulated non-matching data will be sent to the receiver followed by the offset and length in the receiver's file of the matching block and the block checksum generator will be advanced to the next byte after the matching block.dom

 若是找到數據塊校驗和匹配,則會將它視爲匹配塊,全部累積的非匹配數據將被加上在接收端的文件中的匹配數據塊的偏移量和長度以後發送到接收端,而且數據塊校驗和生成器將提早到匹配塊以後的下一字節。

Matching blocks can be identified in this way even if the blocks are reordered or at different offsets. This process is the very heart of the rsync algorithm.socket

能夠以這種方式標識匹配塊,即便從新排列數據塊的順序或數據塊的偏移量不一樣。此過程是 rsync 算法的核心。ide

In this way, the sender will give the receiver instructions for how to reconstruct the source file into a new destination file. These instructions detail all the matching data that can be copied from the basis file (if one exists for the transfe), and includes any raw data that was not available locally. At the end of each file's processing a whole-file checksum is sent and the sender proceeds with the next file.性能

Generating the rolling checksums and searching for matches in the checksum set sent by the generator require a good deal of CPU power. Of all the rsync processes it is the sender that is the most CPU intensive.

 

The Receiver

The receiver will read from the sender data for each file identified by the file index number. It will open the local file (called the basis) and will create a temporary file.

The receiver will expect to read non-matched data and/or to match records all in sequence for the final file contents. When non-matched data is read it will be written to the temp-file. When a block match record is received the receiver will seek to the block offset in the basis file and copy the block to the temp-file. In this way the temp-file is built from beginning to end.

The file's checksum is generated as the temp-file is built. At the end of the file, this checksum is compared with the file checksum from the sender. If the file checksums do not match the temp-file is deleted. If the file fails once it will be reprocessed in a second phase, and if it fails twice an error is reported.

After the temp-file has been completed, its ownership and permissions and modification time are set. It is then renamed to replace the basis file.

Copying data from the basis file to the temp-file make the receiver the most disk intensive of all the rsync processes. Small files may still be in disk cache mitigating this but for large files the cache may thrash as the generator has moved on to other files and there is further latency caused by the sender. As data is read possibly at random from one file and written to another, if the working set is larger than the disk cache, then what is called a seek storm can occur, further hurting performance.

將數據從基礎文件複製到臨時文件會使receiver在全部rsync進程中最耗磁盤。小文件能夠仍處於緩解此做用的磁盤緩存中,但對於大型文件,因爲生成器已移動到其餘文件,而且存在sender引發的進一步延遲,緩存可能會"抖動"(thrash)。 數據可能從一個文件隨機讀取,寫入另外一文件,若是工做集大於磁盤緩存,則會發生"尋道風暴"(seek storm),進一步影響性能。 

 

看到這兒可能仍是一頭霧水,好吧,剛bing一下,前面已經有人栽樹了,有圖有真相:

http://coolshell.cn/articles/7425.html#more-7425

 

附:rsync server配置:

1.修改/etc/rsyncd.conf文件

uid=nobody
gid=nobody
use chroot = yes
max connections = 4
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
lock file = /var/run/rsync.lock
[MYSERVER]
path = /
read only = no
write only = no
list = yes
uid = root
gid = root
auth users = root
secrets file = /root/rsyncd.secrets


2. 建立/root/rsyncd.secrets
root:123456

設置該文件權限爲400,不然會提示錯誤
chmod 400 /root/rsyncd.secrets

3. 修改/etc/xinetd.d/rsync文件

# default: off
# description: The rsync server is a good addition to an ftp server, as it \
# allows crc checksumming etc.
service rsync
{
disable = no
flags = IPv6
socket_type = stream
wait = no
user = root
server = /usr/bin/rsync
server_args = --daemon /etc/rsyncd.conf
log_on_failure += USERID
}

4. chkconfig rsync on

yum install xinetd

5. service xinetd restart

 

客戶端

1.建立/tmp/rsync.pass文件
123456

設置該文件權限爲400,不然會提示錯誤
chmod 400 /tmp/rsync.pass

2.運行rsync(手動測試時使用)
rsync -az --password-file=/tmp/rsync.pass --progress /local/file1 root@192.168.20.221::MYSERVER/remote/

 

下面附上最近的POC的結果:

Scenrio 1 (transfer if file content changed):
command: rsync --password-file=/tmp/rsync.pass --progress scptest2.vmdk root@9.112.224.244::MYSERVER/root/rsync/
result: only transfer the changed data.

Scenario 2( transfer resume if network break ):
command: rsync --password-file=/tmp/rsync.pass --progress scptest2.vmdk root@9.112.224.244::MYSERVER/root/rsync/
result: the rsync client will wait utill the network recovery and rescue the transfer.

Scenario 3(partial transfer):
command: rsync --password-file=/tmp/rsync.pass --progress --partial scptest2.vmdk root@9.112.224.244::MYSERVER/root/rsync/
result: if rsync use --partial parameter before transfer broken, it will continue to transfer the remaining data.

if scenario 1,2,3 happened concurrently, rsync client still can resume and continue to finish the transfer. because rsync has it's specialalgorithm to check file changed.

相關文章
相關標籤/搜索