如何在 Linux 上識別一樣內容的文件

時間 2019-12-05

標籤如何 linux 識別一樣內容文件欄目 Linux 简体版

原文原文鏈接

有時文件副本至關於對硬盤空間的巨大浪費，並會在你想要更新文件時形成困擾。如下是用來識別這些文件的六個命令。html

在最近的帖子中，咱們看了如何識別並定位硬連接的文件（即，指向同一硬盤內容並共享 inode）。在本文中，咱們將查看能找到具備相同內容，卻不相連接的文件的命令。node

硬連接頗有用是由於它們可以使文件存放在文件系統內的多個地方卻不會佔用額外的硬盤空間。另外一方面，有時文件副本至關於對硬盤空間的巨大浪費，在你想要更新文件時也會有形成困擾之虞。在本文中，咱們將看一下多種識別這些文件的方式。linux

用 diff 命令比較文件

可能比較兩個文件最簡單的方法是使用 diff 命令。輸出會顯示你文件的不一樣之處。< 和 > 符號表明在當參數傳過來的第一個（<）或第二個（>）文件中是否有額外的文字行。在這個例子中，在 backup.html 中有額外的文字行。git

$ diff index.html backup.html
2438a2439,2441
> <pre>
> That's all there is to report. > </pre> 複製代碼

若是 diff 沒有輸出那表明兩個文件相同。github

$ diff home.html index.html
$
複製代碼

diff 的惟一缺點是它一次只能比較兩個文件而且你必須指定用來比較的文件，這篇帖子中的一些命令能夠爲你找到多個重複文件。bash

使用校驗和

cksum（checksum）命令計算文件的校驗和。校驗和是一種將文字內容轉化成一個長數字（例如2819078353 228029）的數學簡化。雖然校驗和並非徹底獨有的，可是文件內容不一樣校驗和卻相同的機率微乎其微。ide

$ cksum *.html
2819078353 228029 backup.html
4073570409 227985 home.html
4073570409 227985 index.html
複製代碼

在上述示例中，你能夠看到產生一樣校驗和的第二個和第三個文件是如何能夠被默認爲相同的。工具

使用 find 命令

雖然 find 命令並無尋找重複文件的選項，它依然能夠被用來經過名字或類型尋找文件並運行 cksum 命令。例如：ui

$ find . -name "*.html" -exec cksum {} \;
4073570409 227985 ./home.html
2819078353 228029 ./backup.html
4073570409 227985 ./index.html
複製代碼

使用 fslint 命令

fslint 命令能夠被特意用來尋找重複文件。注意咱們給了它一個起始位置。若是它須要遍歷至關多的文件，這就須要花點時間來完成。注意它是如何列出重複文件並尋找其它問題的，好比空目錄和壞 ID。spa

$ fslint .
-----------------------------------file name lint
-------------------------------Invalid utf8 names
-----------------------------------file case lint
----------------------------------DUPlicate files   <==
home.html
index.html
-----------------------------------Dangling links
--------------------redundant characters in links
------------------------------------suspect links
--------------------------------Empty Directories
./.gnupg
----------------------------------Temporary Files
----------------------duplicate/conflicting Names
------------------------------------------Bad ids
-------------------------Non Stripped executables
複製代碼

你可能須要在你的系統上安裝 fslint。你可能也須要將它加入你的命令搜索路徑：

$ export PATH=$PATH:/usr/share/fslint/fslint
複製代碼

使用 rdfind 命令

rdfind 命令也會尋找重複（相同內容的）文件。它的名字意即「重複數據搜尋」，而且它可以基於文件日期判斷哪一個文件是原件——這在你選擇刪除副本時頗有用由於它會移除較新的文件。

$ rdfind ~
Now scanning "/home/shark", found 12 files.
Now have 12 files in total.
Removed 1 files due to nonunique device and inode.
Total size is 699498 bytes or 683 KiB
Removed 9 files due to unique sizes from list.2 files left.
Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.
Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.
Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.
It seems like you have 2 files that are not unique
Totally, 223 KiB can be reduced.
Now making results file results.txt
複製代碼

你能夠在 dryrun 模式中運行這個命令（換句話說，僅僅彙報可能會另外被作出的改動）。

$ rdfind -dryrun true ~
(DRYRUN MODE) Now scanning "/home/shark", found 12 files.
(DRYRUN MODE) Now have 12 files in total.
(DRYRUN MODE) Removed 1 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 699352 bytes or 683 KiB
Removed 9 files due to unique sizes from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.
(DRYRUN MODE) It seems like you have 2 files that are not unique
(DRYRUN MODE) Totally, 223 KiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
複製代碼

rdfind 命令一樣提供了相似忽略空文檔（-ignoreempty）和跟蹤符號連接（-followsymlinks）的功能。查看 man 頁面獲取解釋。

-ignoreempty       ignore empty files
-minsize        ignore files smaller than speficied size
-followsymlinks     follow symbolic links
-removeidentinode   remove files referring to identical inode
-checksum       identify checksum type to be used
-deterministic      determiness how to sort files
-makesymlinks       turn duplicate files into symbolic links
-makehardlinks      replace duplicate files with hard links
-makeresultsfile    create a results file in the current directory
-outputname     provide name for results file
-deleteduplicates   delete/unlink duplicate files
-sleep          set sleep time between reading files (milliseconds)
-n, -dryrun     display what would have been done, but don't do it 複製代碼

注意 rdfind 命令提供了 -deleteduplicates true 的設置選項以刪除副本。但願這個命令語法上的小問題不會惹惱你。;-)

$ rdfind -deleteduplicates true .
...
Deleted 1 files.    <==
複製代碼

你將可能須要在你的系統上安裝 rdfind 命令。試驗它以熟悉如何使用它多是一個好主意。

使用 fdupes 命令

fdupes 命令一樣使得識別重複文件變得簡單。它同時提供了大量有用的選項——例如用來迭代的 -r。在這個例子中，它像這樣將重複文件分組到一塊兒：

$ fdupes ~
/home/shs/UPGRADE
/home/shs/mytwin

/home/shs/lp.txt
/home/shs/lp.man

/home/shs/penguin.png
/home/shs/penguin0.png
/home/shs/hideme.png
複製代碼

這是使用迭代的一個例子，注意許多重複文件是重要的（用戶的 .bashrc 和 .profile 文件）而且不該被刪除。

# fdupes -r /home
/home/shark/home.html
/home/shark/index.html

/home/dory/.bashrc
/home/eel/.bashrc

/home/nemo/.profile
/home/dory/.profile
/home/shark/.profile

/home/nemo/tryme
/home/shs/tryme

/home/shs/arrow.png
/home/shs/PNGs/arrow.png

/home/shs/11/files_11.zip
/home/shs/ERIC/file_11.zip

/home/shs/penguin0.jpg
/home/shs/PNGs/penguin.jpg
/home/shs/PNGs/penguin0.jpg

/home/shs/Sandra_rotated.png
/home/shs/PNGs/Sandra_rotated.png
複製代碼

fdupe 命令的許多選項列以下。使用 fdupes -h 命令或者閱讀 man 頁面獲取詳情。

-r --recurse     recurse
-R --recurse:    recurse through specified directories
-s --symlinks    follow symlinked directories
-H --hardlinks   treat hard links as duplicates
-n --noempty     ignore empty files
-f --omitfirst   omit the first file in each set of matches
-A --nohidden    ignore hidden files
-1 --sameline    list matches on a single line
-S --size        show size of duplicate files
-m --summarize   summarize duplicate files information
-q --quiet       hide progress indicator
-d --delete      prompt user for files to preserve
-N --noprompt    when used with --delete, preserve the first file in set
-I --immediate   delete duplicates as they are encountered
-p --permissions don't soncider files with different owner/group or permission bits as duplicates -o --order=WORD order files according to specification -i --reverse reverse order while sorting -v --version display fdupes version -h --help displays help 複製代碼

fdupes 命令是另外一個你可能須要安裝並使用一段時間才能熟悉其衆多選項的命令。