百萬數量級的2個集合差別性對比的思考

　　最近在項目中遇到這樣一個問題，對百萬級的數據進行一個比對，大體有2個思路，一，將2個集合排序，將對象中需的屬性取出拼接成字符串，而後對憑藉的字符串進行摘要，最後對比2個集合的摘要值，2、重寫對象中hash和equal，直接對比2個集合，比較其不一樣。對於2個方案，作了一些對比。對比的主要點集中在耗時，內存的佔用，gc回收的次數java

　　1、兩種比較方式的代碼實現spring

摘要比對，這裏採用的是sha-1的方式進行摘要api

/** * 生成摘要 * @param content * @return
     */
    public static String getMessageDigest(String content) { MessageDigest messageDigest; StringBuffer sb =new StringBuffer(); try { long now = System.currentTimeMillis(); messageDigest = MessageDigest.getInstance("SHA-1"); messageDigest.update(content.getBytes("utf-8")); byte[] hash = messageDigest.digest(); for(int i = 0; i < hash.length; i++ ){ int v = hash[i] & 0xFF; if(v < 16) { sb.append("0"); } sb.append(Integer.toString(v,16).toUpperCase()); } } catch (NoSuchAlgorithmException e) { sb.append("生成摘要異常").append(e.getMessage()); e.printStackTrace(); } catch (UnsupportedEncodingException e) { sb.append("生成摘要異常").append(e.getMessage()); e.printStackTrace(); } return sb.toString(); }

全量比對，這裏若採用jdk自帶的幾種直接對集合操做的api的話，如removeall，containsAll等等，追蹤其源碼，都是採用雙重for循環來實現比較的，時間複雜度均爲O(N²)，因此咱們採用hashmap做爲媒介，採起一種時間複雜度爲0(N)的方式來比較，代碼以下，app

/** * 獲取兩個集合的不一樣元素 * @param collmax * @param collmin * @return
     */ @SuppressWarnings({ "rawtypes", "unchecked" }) public static Collection getDiffent(Collection collmax,Collection collmin) { //使用LinkeList防止差別過大時,元素拷貝
        Collection csReturn = new LinkedList(); Collection max = collmax; Collection min = collmin; //先比較大小,這樣會減小後續map的if判斷次數
        if(collmax.size()<collmin.size()) { max = collmin; min = collmax; } //直接指定大小,防止再散列
        Map<Object,Integer> map = new HashMap<Object,Integer>(max.size()); for (Object object : max) { map.put(object, 1); } for (Object object : min) { if(map.get(object)==null) { csReturn.add(object); }else{ map.put(object, 2); } } for (Map.Entry<Object, Integer> entry : map.entrySet()) { if(entry.getValue()==1) { csReturn.add(entry.getKey()); } } return csReturn; }

2、驗證，jvm

上面介紹了兩種方式來比對list，下面來作個測試性能

package com.example.demo; import com.example.demo.util.CollectionUtil; import org.springframework.boot.SpringApplication; import org.springframework.util.StringUtils; import java.util.*; import java.util.stream.Collectors; /** * 描述: 測試2個大的集合數據的對比 * * @author liuyao * @create 2018-12-18 14:14 */
public class CompareTest { public static void main(String[] args) { 　　　　 List<String> list1 = createLiet(1000000); List<String> list2 = createLiet2(1000000); System.out.println("開始對比---------"); long nowTime2 = System.currentTimeMillis(); Collection aa = CollectionUtil.getDiffent(list1,list2); System.out.println(aa); System.out.println("全量比對耗時"+(System.currentTimeMillis()-nowTime2)); long nowTime1 = System.currentTimeMillis(); String list1Str = CollectionUtil.getMessageDigest(StringUtils.collectionToDelimitedString(list1, ",")); String list2Str = CollectionUtil.getMessageDigest(StringUtils.collectionToDelimitedString(list2,",")); if (list1Str.equals(list2Str)){ System.out.println("2list相同"); } System.out.println("摘要耗時"+(System.currentTimeMillis()-nowTime1)); } private static List<String> createLiet2(int count) { Set<String> set = new HashSet<>(count); for (int i=10;i<count+10;i++) { set.add(new String(i+"測試數據abc")); } return new ArrayList<>(set); } private static List<String> createLiet(int count) { Set<String> set = new HashSet<>(count); for (int i=0;i<count;i++) { set.add(new String(i+"測試數據abc")); } return new ArrayList<>(set); } }

因爲數據量較大，gc回收也會影響較大的，加入jvm參數-XX:+PrintGCDetails以對比gc狀況測試

執行結果以下spa

能夠看出全量對比是比摘要對比要耗時的，可是咱們再看下gc日誌的狀況，在全量比對期間，發生了一次full gc，耗時683ms，full gc期間會暫停其餘進程，故結果誤差較大，下面咱們修改list大小，改成50萬，從新執行日誌

多個樣本比對，下面是200萬數據量的code

由此看出，全量比對較摘要比對性能更好，而且能獲得2個集合具體的差別。gc次數也更少。