hadoop的原生比較器RawComparator public WritableCom...

時間 2019-11-19

標籤 hadoop 原生比較器 rawcomparator public writablecom 欄目 Hadoop 简体版

原文原文鏈接

hadoop爲序列化提供了優化，類型的比較對M/R而言相當重要，Key和Key的比較也是在排序階段完成的，hadoop提供了原生的比較器接口RawComparator<T>用於序列化字節間的比較，該接口容許其實現直接比較數據流中的記錄，無需反序列化爲對象，RawComparator是一個原生的優化接口類，它只是簡單的提供了用於數據流中簡單的數據對比方法，從而提供優化：java

public interface RawComparator<T> extends Comparator<T> {

  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}

該類並不是被多數的衍生類所實現，其具體的子類爲WritableComparator，多數狀況下是做爲實現Writable接口的類的內置類，提供序列化字節的比較。下面是RawComparator接口內置類的實現類圖：數組

首先，咱們看 RawComparator的具體實現類WritableComparator：安全

WritableComparator類相似於一個註冊表，裏面記錄了全部Comparator類的集合。函數

Comparators成員用一張Hash表記錄Key=Class，value=WritableComprator的註冊信息.oop

WritableComparator主要提供了兩個功能優化

1. 提供了對原始compare()方法的一個默認實現this

默認實現是 先反序列化爲對像 再經過 對像比較（有開銷的問題）spa

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {線程

try {code

buffer.reset(b1, s1, l1); // parse key1

key1.readFields(buffer);

buffer.reset(b2, s2, l2); // parse key2

key2.readFields(buffer);

} catch (IOException e) {

throw new RuntimeException(e);

}

return compare(key1, key2); // compare them

}

而對應的基礎數據類型的compare()的實現卻巧妙的利用了特定類型的泛化：（利用了writableComparable的compareTo方法）

public int compare(WritableComparable a, WritableComparable b) {

return a.compareTo(b);

}

例如IntWritable實例是調用了IntWritable裏的compareTo方法

public int compareTo(Object o) {

int thisValue = this.value;

int thatValue = ((IntWritable)o).value;

return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));

}

2. 充當RawComparable實例的工廠，以註冊Writable的實現

例如,爲了獲取IntWritable的Comparator，能夠直接調用其get方法。

WritableComparator：

關鍵代碼：

代碼1：registry 註冊器

----------------------------------------------------------------

// registry 註冊器：記載了WritableComparator類的集合

private static HashMap<Class, WritableComparator>comparators =

new HashMap<Class, WritableComparator>();

代碼2：獲取WritableComparator實例

說明：hashMap做爲容器類線程不安全，故須要synchronized同步，get方法根據key=Class返回對應的WritableComparator,若返回的是空值NUll，則調用protected Constructor進行構造，而其兩個protected的構造函數實則是調用了newKey()方法進行NewInstance

public static synchronized WritableComparator get(Class<? extends WritableComparable> c) {
    WritableComparator comparator = comparators.get(c);
    if (comparator == null)
      comparator = new WritableComparator(c, true);
    return comparator;
  }

代碼3：構造方法

---------------------------------------------------------------

new WritableComparator(c, true)

WritableComparator的構造函數源碼以下：

/*

   * keyClass,key1,key2和buffer都是用於WritableComparator的構造函數

   */

  private final Class<? extends WritableComparable> keyClass;

  private final WritableComparable key1;  //WritableComparable接口

  private final WritableComparable key2;    

  private final DataInputBuffer buffer;      //輸入緩衝流

protected WritableComparator(Class<? extends WritableComparable> keyClass,

      boolean createInstances) {

    this.keyClass = keyClass;

    if (createInstances) {

      key1 = newKey();

      key2 = newKey();

      buffer = new DataInputBuffer();

    } else {

      key1 = key2 = null;

      buffer = null;

    }

  }

上述的keyClass，key1,key2,buffer是記錄HashMap對應的key值，用於WritableComparator的構造函數，但由其構造函數中咱們能夠看出WritableComparator根據Boolean createInstance來判斷是否實例化key1,key2和buffer,而key1,key2做爲實現了WritableComparable接口的標識，在WritableComparator的構造函數裏面經過newKey()的方法去實例化實現WritableComparable接口的一個對象，下面是newKey（）的源碼，經過hadoop自身的反射去實例化了一個WritableComparable接口對象。

 public WritableComparable newKey() { return ReflectionUtils.newInstance(keyClass, null); }

代碼4：Compare（）方法

---------------------------------------------------------------------

1. public int compare(Object a, Object b)；

2. public int compare(WritableComparable a, WritableComparable b)；

3. public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)；

三個compare（）重載方法中，compare(Object a, Object b)利用子類塑形爲WritableComparable而調用了第2個compare方法，而第2個Compare（）方法則調用了Writable.compaerTo();最後一個compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)方法源碼以下：

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

    try {

      buffer.reset(b1, s1, l1);                   // parse key1

      key1.readFields(buffer);

     

      buffer.reset(b2, s2, l2);                   // parse key2

      key2.readFields(buffer);

     

    } catch (IOException e) {

      throw new RuntimeException(e);

    }

   

    return compare(key1, key2);                   // compare them

  }

Compare方法的一個缺省實現方式，根據接口key1,ke2反序列化爲對象再進行比較。

利用Buffer爲橋接中介，把字節數組存儲爲buffer後，調用key1（WritableComparable）的反序列化方法，再來比較key1,ke2，由此處能夠看出，該compare方法是將要比較的二進制流反序列化爲對象，再調用方法第2個重載方法進行比較。

代碼5：方法define方法

該方法用於註冊WritebaleComparaor對象到註冊表中，注意同時該方法也須要同步，代碼以下：

public static synchronized void define(Class c,
                                         WritableComparator comparator) {
    comparators.put(c, comparator);
  }

代碼5：餘下諸如readInt的靜態方法

---------------------------------------------------------------------

這些方法用於實現WritableComparable的各類實例，例如 IntWritable實例：內部類Comparator類須要根據本身的IntWritable類型重載WritableComparator裏面的compare（）方法，能夠說WritableComparator裏面的compare（）方法只是提供了一個缺省的實現，而真正的compare（）方法實現須要根據本身的類型如IntWritable進行重載，因此WritableComparator方法中的那些readInt..等方法只是底層的封裝的一個實現，方便內部Comparator進行調用而已。

下面咱們着重看下BooleanWritable類的內置RawCompartor<T>的實現過程:

/** 
   * A Comparator optimized for BooleanWritable. 
   */ 
  public static class Comparator extends WritableComparator {
    public Comparator() {//調用父類的Constructor初始化keyClass=BooleanWrite.class
      super(BooleanWritable.class);
    }
    //重寫父類的序列化比較方法，用些類用到父類提供的缺省方法
    public int compare(byte[] b1, int s1, int l1,
                       byte[] b2, int s2, int l2) {
      boolean a = (readInt(b1, s1) == 1) ? true : false;
      boolean b = (readInt(b2, s2) == 1) ? true : false;
      return ((a == b) ? 0 : (a == false) ? -1 : 1);
    }
  }
  //註冊
  static {
    WritableComparator.define(BooleanWritable.class, new Comparator());
  }

總結：

hadoop 相似於Java的類包，即提供了Comparable接口（對應於writableComparable接口）和Comparator類（對應於RawComparator類）用於實現序列化的比較，在hadoop 的IO包中已經封裝了JAVA的基本數據類型用於序列化和反序列化，通常本身寫的類實現序列化和反序列化須要繼承WritableComparable接口而且內置一個Comparator（繼承於WritableComparator）的格式來實現本身的對象。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。