shell腳本示例：批量比較多個文件的內容是否相同

時間 2019-11-06

標籤 shell 腳本示例批量比較多個文件內容是否相同欄目 Unix 简体版

原文原文鏈接

bash&shell系列文章：http://www.cnblogs.com/f-ck-need-u/p/7048359.htmlhtml

要比較兩個文件的內容是否徹底一致，能夠簡單地使用diff命令。例如：shell

diff file1 file2 &>/dev/null;echo $?

可是diff命令只能給定兩個文件參數，所以沒法一次性比較多個文件(目錄也被看成文件)，並且diff比較非文本類文件或者極大的文件時效率極低。數組

這時可使用md5sum來實現，相比diff的逐行比較，md5sum的速度快的多的多。bash

md5sum的使用方法見：Linux中文件MD5校驗。oop

但md5sum只能經過查看md5值來間接比較文件是否相同，要實現批量自動比較，則須要寫成循環。腳本以下：post

#!/bin/bash
###########################################################
#  description: compare many files one time               #
#  author     : 駿馬金龍                                   #
#  blog       : http://www.cnblogs.com/f-ck-need-u/       #
###########################################################

# filename: md5.sh
# Usage: $0 file1 file2 file3 ...

IFS=$'\n'
declare -A md5_array

# If use while read loop, the array in while statement will
# auto set to null after the loop, so i use for statement
# instead the while, and so, i modify the variable IFS to
# $'\n'.

# md5sum format: MD5  /path/to/file
# such as:80748c3a55b726226ad51a4bafa1c4aa /etc/fstab
for line in `md5sum "$@"`
do
    index=${line%% *}
    file=${line##* }
    md5_array[$index]="$file ${md5_array[$index]}"
done

# Traverse the md5_array
for i in ${!md5_array[@]}
do
    echo -e "the same file with md5: $i\n--------------\n`echo ${md5_array[$i]}|tr ' ' '\n'`\n"
done

爲了測試該腳本，先複製幾個文件，並修改其中幾個文件的內容，例如：測試

[root@xuexi ~]# for i in `seq -s' ' 6`;do cp -a /etc/fstab /tmp/fs$i;done
[root@xuexi ~]# echo ha >>/tmp/fs4
[root@xuexi ~]# echo haha >>/tmp/fs5

如今，/tmp目錄下有6個文件fs一、fs二、fs三、fs四、fs5和fs6，其中fs4和fs5被修改，剩餘4個文件內容徹底相同。spa

[root@xuexi tmp]# ./md5.sh /tmp/fs[1-6]
the same file with md5: a612cd5d162e4620b442b0ff3474bf98
--------------------------
/tmp/fs6
/tmp/fs3
/tmp/fs2
/tmp/fs1

the same file with md5: 80748c3a55b726226ad51a4bafa1c4aa
--------------------------
/tmp/fs4

the same file with md5: 30dd43dba10521c1e94267bbd117877b
--------------------------
/tmp/fs5

更具通用性地比較方法：比較多個目錄下的同名文件。3d

[root@xuexi tmp]# find /tmp -type f -name "fs[0-9]" -print0 | xargs -0 ./md5.sh  
the same file with md5:a612cd5d162e4620b442b0ff3474bf98
--------------------------
/tmp/fs6
/tmp/fs3
/tmp/fs2
/tmp/fs1

the same file with md5:80748c3a55b726226ad51a4bafa1c4aa
--------------------------
/tmp/fs4

the same file with md5:30dd43dba10521c1e94267bbd117877b
--------------------------
/tmp/fs5

腳本說明：code

(1).md5sum計算的結果格式爲"MD5 /path/to/file"，所以要在結果中既輸出MD5值，又輸出相同MD5對應的文件，考慮使用數組。

(2).一開始的時候我使用while循環，從標準輸入中讀取每一個文件md5sum的結果。語句以下：

md5sum "$@" | while read index file;do
    md5_array[$index]="$file ${md5_array[$index]}"
done

但因爲管道使得while語句在子shell中執行，因而while中賦值的數組md5_array在循環結束時將失效。因此可改寫爲：

while read index file;do
    md5_array[$index]="$file ${md5_array[$index]}"
done <<<"$(md5sum "$@")"

不過我最終仍是使用了更繁瑣的for循環：

IFS=$'\n'
for line in `md5sum "$@"`
do
    index=${line%% *}
    file=${line##* }
    md5_array[$index]="$file ${md5_array[$index]}"
done

但md5sum的每行結果中有兩列，而for循環採用默認的IFS會將這兩列分割爲兩個值，所以還修改了IFS變量的值爲$'\n'，使得一行賦值一次變量。

(3).index和file變量是爲了將md5sum的每一行結果拆分紅兩個變量，MD5部分做爲數組的index，file做爲數組變量值的一部分。所以，數組賦值語句爲：

md5_array[$index]="$file ${md5_array[$index]}"

(4).數組賦值完成後，開始遍歷數組。遍歷的方法有多種。我採用的是遍歷數組的index列表，即每行的MD5值。

# Traverse the md5_array
for i in ${!md5_array[@]}
do
    echo -e "the same file with md5: $i\n--------------\n`echo ${md5_array[$i]}|tr ' ' '\n'`\n"
done