Linux 去重先sort再uniq

時間 2019-12-02

標籤 linux sort uniq 欄目 Linux 简体版

原文原文鏈接

從uniq命令的幫助信息中能夠看到，該命令只過濾相鄰的重複行．json

若是要去掉全部重複行，須要先排序，或者使用uniq -uapp

$ uniq --h
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines, one for each group
  -D                    print all duplicate lines
      --all-repeated[=METHOD]  like -D, but allow separating groups
                                 with an empty line;
                                 METHOD={none(default),prepend,separate}
  -f, --skip-fields=N   avoid comparing the first N fields
      --group[=METHOD]  show all items, separating groups with an empty line;
                          METHOD={separate(default),prepend,append,both}
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique          only print unique lines
  -z, --zero-terminated     line delimiter is NUL, not newline
  -w, --check-chars=N   compare no more than N characters in lines
      --help     display this help and exit
      --version  output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.  Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.

$ cat tmp.txt
aa
aa
bb
bb
bb
cc
cc
aa
cc
bb
$ cat tmp.txt | uniq
aa
bb
cc
aa
cc
bb

先sort再uniq能夠去除全部重複項：less

$ cat tmp.txt | sort | uniq
aa
bb
cc

或者使用uniq -u：this

$ cat tmp.txt | uniq -u
aa
cc
bb

可是這種方法不必定起效（參考下面的例子）spa

$ head info.json -n20 | jq .industry | awk -F '"' '{print $2}' | awk '{if (length > 0) print $0}' | uniq | sort # ==> 沒有去重
商務服務業
建築裝飾和其餘建築業
批發業
批發業
批發業
機動車、電子產品和日用產品修理業
研究和試驗發展
紡織服裝、服飾業
計算機、通訊和其餘電子設備製造業
軟件和信息技術服務業
道路運輸業
零售業
$ head info.json -n20 | jq .industry | awk -F '"' '{print $2}' | awk '{if (length > 0) print $0}' | uniq -u | sort  # ==> 去重不徹底
商務服務業
建築裝飾和其餘建築業
批發業
批發業
機動車、電子產品和日用產品修理業
研究和試驗發展
紡織服裝、服飾業
計算機、通訊和其餘電子設備製造業
軟件和信息技術服務業
道路運輸業
零售業
$ head info.json -n20 | jq .industry | awk -F '"' '{print $2}' | awk '{if (length > 0) print $0}' | sort | uniq  # ==> 去重成功
商務服務業
建築裝飾和其餘建築業
批發業
機動車、電子產品和日用產品修理業
研究和試驗發展
紡織服裝、服飾業
計算機、通訊和其餘電子設備製造業
軟件和信息技術服務業
道路運輸業
零售業

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

Linux 去重 先sort再uniq

Linux 去重先sort再uniq