HBase Compaction算法之ExploringCompactionPolicy

時間 2019-11-07

標籤 hbase compaction 算法 exploringcompactionpolicy 欄目 Hadoop 简体版

原文原文鏈接

在0.98版本中，默認的compaction算法換成了ExploringCompactionPolicy，以前是RatioBasedCompactionPolicy java

ExploringCompactionPolicy繼承RatioBasedCompactionPolicy，重寫了applyCompactionPolicy方法，applyCompactionPolicy是對minor compaction的選擇文件的策略算法。算法

applyCompactionPolicy方法內容： app

public List<StoreFile> applyCompactionPolicy(final List<StoreFile> candidates,
       boolean mightBeStuck, boolean mayUseOffPeak, int minFiles, int maxFiles) {
    //此ratio爲後面算法使用，可設置非高峯時間段的ratio（默認:5.0）從而合併更多的數據
    final double currentRatio = mayUseOffPeak
        ? comConf.getCompactionRatioOffPeak() : comConf.getCompactionRatio();

    // Start off choosing nothing.
    List<StoreFile> bestSelection = new ArrayList<StoreFile>(0);
    List<StoreFile> smallest = mightBeStuck ? new ArrayList<StoreFile>(0) : null;
    long bestSize = 0;
    long smallestSize = Long.MAX_VALUE;

    int opts = 0, optsInRatio = 0, bestStart = -1; // for debug logging
    // Consider every starting place.
    for (int start = 0; start < candidates.size(); start++) {
      // Consider every different sub list permutation in between start and end with min files.
      for (int currentEnd = start + minFiles - 1;
          currentEnd < candidates.size(); currentEnd++) {
        List<StoreFile> potentialMatchFiles = candidates.subList(start, currentEnd + 1);

        // Sanity checks
        if (potentialMatchFiles.size() < minFiles) {
          continue;
        }
        if (potentialMatchFiles.size() > maxFiles) {
          continue;
        }

        // Compute the total size of files that will
        // have to be read if this set of files is compacted.
        long size = getTotalStoreSize(potentialMatchFiles);

        // Store the smallest set of files.  This stored set of files will be used
        // if it looks like the algorithm is stuck.
        if (mightBeStuck && size < smallestSize) {
          smallest = potentialMatchFiles;
          smallestSize = size;
        }

        if (size > comConf.getMaxCompactSize()) {
          continue;
        }

        ++opts;
        if (size >= comConf.getMinCompactSize()
            && !filesInRatio(potentialMatchFiles, currentRatio)) {
          continue;
        }

        ++optsInRatio;
        if (isBetterSelection(bestSelection, bestSize, potentialMatchFiles, size, mightBeStuck)) {
          bestSelection = potentialMatchFiles;
          bestSize = size;
          bestStart = start;
        }
      }
    }
    if (bestSelection.size() == 0 && mightBeStuck) {
      LOG.debug("Exploring compaction algorithm has selected " + smallest.size()
          + " files of size "+ smallestSize + " because the store might be stuck");
      return new ArrayList<StoreFile>(smallest);
    }
    LOG.debug("Exploring compaction algorithm has selected " + bestSelection.size()
        + " files of size " + bestSize + " starting at candidate #" + bestStart +
        " after considering " + opts + " permutations with " + optsInRatio + " in ratio");
    return new ArrayList<StoreFile>(bestSelection);

從代碼得知，主要算法以下： less

從頭至尾遍歷文件，判斷全部符合條件的組合
選擇組合的文件數必須 >= minFiles（默認值：3）
選擇組合的文件數必須 <= maxFiles（默認值：10）
計算組合的文件總大小size，size必須 <= maxCompactSize(經過hbase.hstore.compaction.max.size配置，默認值：LONG.MAX_VALUE，至關於沒起做用，官方文檔裏面說只有以爲compaction常常發生而且沒有多大的用時，能夠修改這個值)
組合的文件大小 < minCompactSize 則是符合要求，若是 >= minCompactSize ，還須要判斷filesInRatio

filesInRatio算法：FileSize(i) <= ( Sum(0,N,FileSize(_)) - FileSize(i) ) * Ratio，也就是說組合裏面的全部單個文件大小都必須知足 singleFileSize <= (totalFileSize - singleFileSize) * currentRatio，此算法的意義是爲了限制太大的compaction，選擇出來的文件不至於有一個很大的，應該儘量先合併一些小的大小相差不大的文件，代碼以下

private boolean filesInRatio(final List<StoreFile> files, final double currentRatio) {
    if (files.size() < 2) {
      return true;
    }

    long totalFileSize = getTotalStoreSize(files);

    for (StoreFile file : files) {
      long singleFileSize = file.getReader().length();
      long sumAllOtherFileSizes = totalFileSize - singleFileSize;

      if (singleFileSize > sumAllOtherFileSizes * currentRatio) {
        return false;
      }
    }
    return true;
  }

尋找最有解，優先選擇文件組合文件數多的，當文件數同樣多時選擇文件數小的，此目的是爲了儘量合併更多的文件而且產生的IO越少越好

private boolean isBetterSelection(List<StoreFile> bestSelection,
      long bestSize, List<StoreFile> selection, long size, boolean mightBeStuck) {
    if (mightBeStuck && bestSize > 0 && size > 0) {
      // Keep the selection that removes most files for least size. That penaltizes adding
      // large files to compaction, but not small files, so we don't become totally inefficient
      // (might want to tweak that in future). Also, given the current order of looking at
      // permutations, prefer earlier files and smaller selection if the difference is small.
      final double REPLACE_IF_BETTER_BY = 1.05;
      double thresholdQuality = ((double)bestSelection.size() / bestSize) * REPLACE_IF_BETTER_BY;
      return thresholdQuality < ((double)selection.size() / size);
    }
    // Keep if this gets rid of more files.  Or the same number of files for less io.
    return selection.size() > bestSelection.size()
      || (selection.size() == bestSelection.size() && size < bestSize);
  }

主要算法至此結束，下面說說其餘細節及其優化部分： ide

步驟6的ratio默認值是1.2，可是打開了非高峯時間段的優化時，能夠有不一樣的值，非高峯的ratio默認值是5.0，此優化目的是爲了在業務低估時能夠合併更多的數據，目前此優化只能是天的小說時間段，還不算靈活。優化

算法中關於mightBeStuck的邏輯部分，這個參數是用來表示是否有可能compaction會被卡住，它的狀態是待選文件數 - 正在作compaction的文件數 + futureFiles(默認值是0，有正在作compaction的文件時是1) >= hbase.hstore.blockingStoreFiles （默認是10，此配置在flush中也會用到，之後分析flush的時候會補充），若是是true時： this

選擇文件算法還會去尋找一個最小解。在上文步驟4以前，會記錄一個文件大小最小的組合
isBetterSelection部分，算法改成 (bestSelection.size() / bestSize) * 1.05 < selection.size() / size，經過文件大小和文件數的比值去選擇一個合適的解
返回結果時，沒有合適的最優解或返回一個最小解。

mightBeStuck的優化部分，至關於保證在不少的文件數的狀況下，也能夠選出一個最小解去作compaction，而不用再讓文件繼續增加下去直到有一個合適的組合出現。 spa

此算法跟RatioBasedCompactionPolicy的區別，簡單的說就是RatioBasedCompactionPolicy是簡單的從頭至尾遍歷StoreFile列表，遇到一個符合Ratio條件的序列就選定執行Compaction。而ExploringCompactionPolicy則是從頭至尾遍歷的同時記錄下當前最優，而後從中選擇一個全局最優列表。 debug

相關標籤/搜索

exploringcompactionpolicy

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。