Lucene NumericRangeQuery的初步理解

時間 2020-02-09

標籤 lucene numericrangequery 初步理解简体版

原文原文鏈接

理解NumericRangeQuery, 首先須要理解Lucene是如何存儲數值類型. 文本初步探討了Int和Float兩種數值類型在Lucene中的存儲實現,數值類型的分詞原理, 最後給出NumericRangeQuery的簡單理解.java

Lucene最初設計是實現全文檢索功能, 即只處理字符串. 所以, 在處理數值時, Lucene也是將數值編碼爲字符串。python

將數值轉換成字符串, Lucene-5.2.0對應的實現類爲org.apache.lucene.util.NumericUtils。git

其編碼的方式以下：github

Int類型的編碼：apache

public static void main(String[] args){
    BytesRefBuilder act = new BytesRefBuilder();
    NumericUtils.intToPrefixCodedBytes(1, 0, act);

    BytesRef ref = act.get();
    System.out.println(ref.length);
}

能夠發現NumericUtils把Int類型編碼爲6byte. 其中的1byte用於區別原數據類型爲Int仍是Long,數組

SHIFT_START_INT  = 0x60;
SHIFT_START_LONG = 0x20;

另外的5byte表示原數. 咱們知道, Int是32位, 即4byte. 爲何這裏須要5byte呢?數據結構

咱們先思考另外一個問題: 如何將Int轉碼成字符串, 而且保證其順序呢? 即若是兩個整數x,y編碼成字符串a,b 要保證: Integer.compare(x,y) = String.compare(a,b)app

首先,整數的取值範圍(-2147483648,2147483647).ide

插一句, 除去符號位, -2147483648的補碼與0的補碼是同樣的, 實際上2147483648是溢出了的. 換個角度 -2147483648 = -0 關於0和-2147483648的編碼，能夠看出，除符號位外，二者是同樣的函數

public static void intToBytesRef(){

        BytesRefBuilder act1 = new BytesRefBuilder();
        NumericUtils.intToPrefixCodedBytes(Integer.MIN_VALUE, 0, act1);
        BytesRef ref1 = act1.get();
        System.out.println(ref1);

        BytesRefBuilder act2 = new BytesRefBuilder();
        NumericUtils.intToPrefixCodedBytes(0, 0, act2);
        BytesRef ref2 = act2.get();
        System.out.println(ref2.toString());
}

OK, 思路回到如何將Int轉碼成字符串, 而且保證其順序的問題. 若是咱們單獨只關注正數和負數, 那麼會發現:

對於正數, 其補碼範圍爲: 0x00 00 00 01(1)到0x7f ff ff ff(2147483647), 是有序的, 保證了: Integer.compare(x,y) = String.compare(a,b).

對於負數, 其補碼範圍爲: 0x80 00 00 00(-2147483648)到0xff ff ff ff(-1), 是有序的, 保證了: Integer.compare(x,y) = String.compare(a,b).

使用python的struct包, 能夠很方便地查看一個整數的補碼:

>>> from struct import *
>>> pack('>i',-2147483648)
'\x80\x00\x00\x00'
>>> pack('>i',0)
'\x00\x00\x00\x00'

若是但願直接查看32-bit的二進制碼, 以下:

>>>"".join([bin(ord(i))[2:].rjust(8,'0') for i in pack('>i', -2)])
'11111111111111111111111111111110'

還有一個問題: 從總體上, 負數獲得的編碼是大於正數獲得的編碼, 這就不符合Integer.compare(x,y) = String.compare(a,b). 如何處理這一狀況呢?

int sortableBits = val ^ 0x80000000;

採用二進制數的異域操做, 將正整數與負整數的符號位交換一下便可. 這樣就能夠保證整數編碼後的字符串總體有序了. 因此這裏取名sortableBits

接下來就回到將Int編碼爲5-byte的問題. For that integer values (32 bit or 64 bit) are made unsigned and the bits are converted to ASCII chars with each 7 bit.即每7bit爲了一個單位.

這是由於Lucene保存Unicode時使用的是UTF-8編碼，這種編碼的特色是，unicode值爲0-127的字符使用一個字節編碼。其實咱們能夠把32位的int看出5個7位的整數，這樣的utf8編碼就只有5個字節了.

到這裏, 再看NumericUtils.intToPrefixCodedBytes()的代碼就會很清晰了.

  public static void intToPrefixCodedBytes(final int val, final int shift, final BytesRefBuilder bytes) {
    // ensure shift is 0..31
    if ((shift & ~0x1f) != 0) {
      throw new IllegalArgumentException("Illegal shift value, must be 0..31; got shift=" + shift);
    }
    int nChars = (((31-shift)*37)>>8) + 1;    // i/7 is the same as (i*37)>>8 for i in 0..63
    bytes.setLength(nChars+1);   // one extra for the byte that contains the shift info
    bytes.grow(NumericUtils.BUF_SIZE_LONG);  // use the max
    bytes.setByteAt(0, (byte)(SHIFT_START_INT + shift));
    int sortableBits = val ^ 0x80000000;
    sortableBits >>>= shift;
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
      bytes.setByteAt(nChars--, (byte)(sortableBits & 0x7f));
      sortableBits >>>= 7;
    }
  }

關於shift參數, 因爲是前綴編碼PrefixCodedBytes, shift用於處理前綴問題,與本文討論的主題無關, 暫不考慮.

浮點數(Float/Double)在計算機中的存儲存儲遵循IEEE-754標準. 一般咱們用到的是單精度(float)和雙精度(double)這兩種,對應的字節數是4byte和和8byte. 下面以Float爲例, 來了解計算機是如何存儲浮點數. IEEE 754-1985 將存儲空間分紅三個部分，從左到右(最高位到最低位)的順序依次是：符號位(sign)、exponent(指數位)、fraction(分數位)。其中sign佔1-bit, exponent佔8-bit, fraction佔23-bit。

對於單精度: 1-8-23 (32)；對於雙精度: 1-11-52 (64) 例如單精度浮點數5.5，二進制表示以下:

------------------------------------------------
|   0 |1000 0001 |011 0000 0000 0000 0000 0000 |
------------------------------------------------
|Sign | exponent |        fraction             |
------------------------------------------------

接下來,咱們逆向思考: 上面這樣的二進制數, 如何轉換才獲得5.5的呢? 首先給出計算公式:

v = (-1)^s * 2^E * M

首先處理符號位 s=0, 因此 (-1)^0 = 1 ；

而後處理指數位. 指數位單獨拎出來計算, 其值爲

>>> int('10000001',2)129

2^E = 2^(129-127) = 4 ; 爲何要減去127呢? 這裏的指數位採用的是biased exponent, 翻譯過來就是有偏移的指數(原本應該是129, 無故減去127, 固然偏移了). 原本指數的取值範圍爲[-127,127], 可是爲了方便計算機對不一樣浮點數進行大小比較, 將指數偏移127, 使得負指數也表示成了一個正數.

最後處理分數位 23-bit fraction的處理與指數位不一樣, 我總結的8字祕訣就是exponent看值, fraction數個. 即對於23-bit fraction從左到右,

第 1位: 2^(-1) = 0.5

第 2位: 2^(-2) = 0.25 . .

第10位: 2^(-10) = 0.0009765625

. .

第23位: 2^(-23)= 1.1920928955078125e-07

因此對於fraction 011 0000 0000 0000 0000 0000

f = 1*2^(-2) + 1*2^(-3) = 0.375; 
M = f + 1 = 1.375

綜上所述: 5.5 = 1 * 4 * 1.375

其實能夠證實, fraction最大值近似爲1. 即2^(-1) +2^(-2) + ... + 2^(-n)的極限爲1.

對於fraction, 其值M的計算規則須要考慮exponent. 根據exponent的取值分爲3種狀況: e = 0 和 e =[1,254] 和 e=255. 因爲Float的exponent只有8位, 因此其最大值爲255.

e=[1,254] 是一般狀況, 覆蓋了99%以上的浮點數. 咱們稱之爲規格化的值, 此時 M= 1 + f

e=0 是第一種特殊狀況, 咱們稱之爲非規格化的值, 此時 M = f

e=255是第二種特殊狀況, 若fraction中23-bit全是0，表示無窮大(infinite); 不然表示NaN(Not a Number)

爲了可以多看幾個例子, 多作幾個實驗, 從而對這個轉化過程造成感受. 用python實現了兩個簡單的函數. 一個是將浮點數轉換成二進制字符串, 一個是將二進制字符串轉換成浮點數.感謝stackoverflow貢獻了如此精妙的實現方法.

>>> import struct
>>> def float2bin(num):
...   return ''.join(bin(ord(c)).replace('0b', '').rjust(8, '0') for c in struct.pack('!f', num))
... 
>>> 
>>> def bin2float(bits):
...   return struct.unpack('f',struct.pack('I',int(bits,2)))
... 
>>> float2bin(0.1)
'00111101110011001100110011001101'
>>> float2bin(1.0)
'00111111100000000000000000000000'
>>> float2bin(0.5)
'00111111000000000000000000000000'
>>> float2bin(2.0)
'01000000000000000000000000000000'
>>>

固然, 也能夠用Java查看一個Float的二進制字符串

System.out.println(Integer.toBinaryString(Float.floatToIntBits(5.5f)));

多解析幾個實例後, 就可以理解Float的二進制存儲機制.

瞭解了Float的存儲原理後, 再學習Lucene對Float的處理方法, 就簡明不少了.

首先看一個簡單的浮點數存儲和檢索的例子

package learn.learn;

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FloatField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;

public class NumericRangeQueryDemo {
    static Directory d = null;
    public static void index() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexWriterConfig conf = new IndexWriterConfig(new StandardAnalyzer());
        IndexWriter iw = new IndexWriter(d, conf);
        Document doc = new Document();
        doc.add(new FloatField("f2", 2.5f, Field.Store.YES));
        iw.addDocument(doc);
        doc = new Document();
        iw.close();

    }

    public static void search() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexReader r = DirectoryReader.open(d);
        IndexSearcher searcher = new IndexSearcher(r);

        BytesRefBuilder act = new BytesRefBuilder();
        NumericUtils.intToPrefixCodedBytes(NumericUtils.floatToSortableInt(2.5f), 0, act);

        TopDocs n = searcher.search(new TermQuery(new Term("f2",act.get())), 2);
        System.out.println(n.totalHits);
        Document doc = searcher.doc(0);
        System.out.println(doc);

    }

    public static void main(String[] args) throws IOException {
        index();
        search();
    }
}

前面講到Lucene處理Int類型是將int轉換成6字節有序的字符串. 對於Float類型, 則是先轉換成int, 而後按int類型的方式處理. 關鍵點在於NumericUtils.floatToSortableInt() . 題外話: 理解Lucene處理數值的原理,關鍵點在於理解NumericUtils類.

分析Float型數據, 與前面分析Int型數據同樣, 正負拆開. 若是這個float是正數，那麼把它當作int也是正數，並且根據前面的說明，指數在前，因此順序也是保持好的。若是它是個負數，把它當作int也是負數，可是順序就反了. 例如:

 float2bin(-1.0) = '10111111100000000000000000000000'
 float2bin(-2.0) = '11000000000000000000000000000000'

-1.0 > -2.0 可是, '10111111100000000000000000000000' < '11000000000000000000000000000000' 所以NumericUtils.floatToSortableInt()做了相應的處理

  // Lucene-5.2.0
  public static int sortableFloatBits(int bits) {
    return bits ^ (bits >> 31) & 0x7fffffff;
  }

根據運算符優先級, 計算順序爲bits ^ ( (bits >> 31) & 0x7fffffff ); 注意這裏的位移是算術位移, 即若是bits爲負數，　則左移31位後，就變成了0xffffffff.

即 符號位不變，　正數保持, 負數翻轉. 這樣作雖然會致使負數二進制字符串 > 正數二進制字符串的狀況出現, 可是NumericUtils.intToPrefixCoded()會作稍後的處理, 因此最終保證了 Integer.compare(x,y) = String.compare(a,b)

前面瞭解到Lucene對Int類型和Float類型處理機制以下: 1. 對因而Float類型, 將Float轉成Int, 而後按照Int類型處理. 2. 對於Int類型, 將其轉換成Sortable Int, 而後按照7-bit爲一個單位轉換成長度爲6的字節數組.

本節的目標是瞭解Lucene對數值類型進行分詞的過程. 瞭解了這一過程, 就很容易理解Lucene數值類型的查詢原理, 好比NumericRangeQuery.

咱們知道, Lucene對英文分詞, 基本上就是按空格進行切分, 好比"show me the code", 分詞後的形式就是["show", "me", "the", "code"] 數值類型分詞與文本分詞不一樣, 好比整數1, 轉換成字節數組後,其值爲[60 8 0 0 0 1](注意數組中是16進制, 而非10進制)

// Lucene-5.2.0
public static void main(String[] args) throws IOException {
    BytesRefBuilder bytes = new BytesRefBuilder();
    NumericUtils.intToPrefixCodedBytes(1, 0, bytes);
    System.out.println(bytes.toBytesRef()); // [60 8 0 0 0 1]
}

對於[60 8 0 0 0 1], 若是按照默認的precisionStep=8, 則分詞的結果爲:

[60 8 0 0 0 1]
[68 4 0 0 0]
[70 2 0 0]
[78 1 0]

分詞的代碼爲:

public static void main(String[] args) throws IOException {
    final NumericTokenStream stream= new NumericTokenStream(8).setIntValue(1);
    final TermToBytesRefAttribute bytesAtt = stream.getAttribute(TermToBytesRefAttribute.class);
    final TypeAttribute typeAtt = stream.getAttribute(TypeAttribute.class);
    final NumericTokenStream.NumericTermAttribute numericAtt = stream.getAttribute(NumericTokenStream.NumericTermAttribute.class);
    final BytesRef bytes = bytesAtt.getBytesRef();
    stream.reset();
    for (int shift=0; shift<32; shift+=NumericUtils.PRECISION_STEP_DEFAULT_32) {
      stream.incrementToken();
      bytesAtt.fillBytesRef();
      System.out.println(bytesAtt.getBytesRef());
    }
    stream.end();
    stream.close();

}

數值分詞其實就是拆分前綴. 上面的結果不像一般理解的前綴關係,這是由於添加了shift信息. 若是同時對多個數進行分詞, 排序後對比, 就能體會到前綴的意義了.

前綴的比特數由precisionStep決定, 對於NumericUtils.intToPrefixCodedBytes(), 對應着參數shift

  public static void intToPrefixCodedBytes(final int val, final int shift, final BytesRefBuilder bytes) {
    // ensure shift is 0..31
    if ((shift & ~0x1f) != 0) {
      throw new IllegalArgumentException("Illegal shift value, must be 0..31; got shift=" + shift);
    }
    int nChars = (((31-shift)*37)>>8) + 1;    // i/7 is the same as (i*37)>>8 for i in 0..63
    bytes.setLength(nChars+1);   // one extra for the byte that contains the shift info
    bytes.grow(NumericUtils.BUF_SIZE_LONG);  // use the max
    bytes.setByteAt(0, (byte)(SHIFT_START_INT + shift));
    int sortableBits = val ^ 0x80000000;
    sortableBits >>>= shift;
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
      bytes.setByteAt(nChars--, (byte)(sortableBits & 0x7f));
      sortableBits >>>= 7;
    }
  }

上面的代碼, 在Lucene處理Int類型數據的方法與原理一文中也貼過. 再看上面的代碼, 是否以爲清晰了許多?

前綴具備什麼優良的特性呢? 在數據結構上, 前綴屬於典型的以空間換時間策略. 即經過存儲空間的消耗,換取到極短的查詢時間. 若是學習過Trie和線段數, 樹狀數組這些數據結構, 可能會更容易理解Lucene這裏的作法.

(說明,本圖來源於博客: http://blog.csdn.net/zhufenglonglove/article/details/51700898 致謝! )

咱們知道, Lucene存儲的是倒排索引, 即term ---> [docid, docid, ... ] . 假設有以下的需求: 查詢價格在[421, 448]的商品. 假如商品的價格信息以下: A=423, B=445 對於前綴索引, 其索引結構是這樣的:

423---> [A]
425 --> [A]
42  --> [A,B]
4   --> [A,B]

在查詢的時候, 只須要查詢前綴4, 就能夠了.

爲了對Lucene的前綴更有感受, 能夠對一系列的整數進行分詞, 而後查看分詞的結果. 代碼以下:

    public static void tokenAnalyzer(Set<String> list , int val) throws IOException{

        final NumericTokenStream stream= new NumericTokenStream(8).setIntValue(val);
        final TermToBytesRefAttribute bytesAtt = stream.getAttribute(TermToBytesRefAttribute.class);
        final TypeAttribute typeAtt = stream.getAttribute(TypeAttribute.class);
        final NumericTokenStream.NumericTermAttribute numericAtt = stream.getAttribute(NumericTokenStream.NumericTermAttribute.class);
        final BytesRef bytes = bytesAtt.getBytesRef();
        stream.reset();
        for (int shift=0; shift<32; shift+=NumericUtils.PRECISION_STEP_DEFAULT_32) {
          stream.incrementToken();
          bytesAtt.fillBytesRef();
          list.add(bytesAtt.getBytesRef().toString());

        }
        stream.end();
        stream.close();
    }

    public static void main(String[] args) throws IOException {
        TreeSet<String> list = new TreeSet<String>();
        for(int i=1;i<512;i++){
            tokenAnalyzer(list, i);
        }
        System.out.println("size of list is "+list.size());
        for(String s: list)System.out.println(s);
    }

結果以下:

size of list is 515
[60 8 0 0 0 10]
    ...
[60 8 0 0 3 e]
[60 8 0 0 3 f]
[68 4 0 0 0]
[68 4 0 0 1]
[70 2 0 0]
[78 1 0]

若是查詢區間[1,255]的文檔信息, 則只須要查詢[68 4 0 0 0]就OK了. 若是單純地使用BooleanQuery,不構建前綴索引, 則須要拼接255個TermQuery.二者之間的查詢性能, 可想而之.

前面說到, 前綴的缺點就是空間消耗. 這一點能夠在創建索引時經過precisionStep參數來調整. precisionStep越小, 空間消耗越大, precisionStep越大, 空間消耗越小. 須要注意的是, 在業務中,並非precisionStep越小, 查詢性能越好. 究竟precisionStep設置多大才能達到最佳的平衡點, 須要具體業務, 具體對待.

對於NumericRangeQuery的分析, NumericUtils.splitRange()是核心

搜索的樣例代碼以下:

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class NumericRangeQueryDemo {
    static Directory d = null;
    public static void index() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexWriterConfig conf = new IndexWriterConfig(new StandardAnalyzer());
        IndexWriter iw = new IndexWriter(d, conf);
        Document doc =null;
        for(int i=0;i<512;i++)
        {
            doc = new Document();
             doc.add(new IntField("f2", i, Field.Store.YES));
             iw.addDocument(doc);
        }

        iw.close();

    }

    public static void search() throws IOException{
        d = FSDirectory.open(Paths.get("indexfile"));
        IndexReader r = DirectoryReader.open(d);
        IndexSearcher searcher = new IndexSearcher(r);

        Query  query = NumericRangeQuery.newIntRange("f2", 0, 255, true, true);
        TopDocs n = searcher.search(query, 2);
        System.out.println(n.totalHits);
        Document doc = searcher.doc(0);
        System.out.println(doc);

    }

    public static void main(String[] args) throws IOException {
        index();
        search();
    }
}

咱們先無論splitRange()代碼的細節, 先根據前面理解到的知識, 來預測對於某一個[minBound,maxBound], spiltRange後在NumericRangeQuery.NumericRangeTermsEnum.rangeBounds中生成的結果是什麼?

例如:

當: precisitionStep=8, [minBound,maxBound]=[0, 16777215]時, rangeBounds=[[78 1 0], [78 1 0]]
當: precisitionStep=8, [minBound,maxBound]=[0, 65535]時, rangeBounds=[70 2 0 0], [70 2 0 0]
當: precisitionStep=8, [minBound,maxBound]=[0, 255]時, rangeBounds=[[68 4 0 0 0], [68 4 0 0 0]]
當: precisitionStep=8, [minBound,maxBound]=[0,1023]時, rangeBounds=[[68 4 0 0 0], [68 4 0 0 3]]
當: precisitionStep=8, [minBound,maxBound]=[0, 511]時, rangeBounds=[[68 4 0 0 0], [68 4 0 0 1]]
當: precisitionStep=8, [minBound,maxBound]=[0, 254]時, rangeBounds=[[60 8 0 0 0 0], [60 8 0 0 1 7e]]
當: precisitionStep=8, [minBound,maxBound]=[0, 127]時, rangeBounds=[[60 8 0 0 0 0], [60 8 0 0 0 7f]]
當: precisitionStep=8, [minBound,maxBound]=[10, 1023]時, rangeBounds=[[60 8 0 0 0 a], [60 8 0 0 1 7f], [68 4 0 0 1], [68 4 0 0 3]]

研究幾個案例後, 關於splitRange()的邏輯, 就比較有感受了. 例如: [minBound,maxBound]=[2, 1024]

首先會處理: [2,255], [1024,1024], 生成 [60 8 0 0 0 2], [60 8 0 0 1 7f], [60 8 0 0 8 0], [60 8 0 0 8 0]

而後會處理: [256,768], 生成 [68 4 0 0 1], [68 4 0 0 3] 因此最後splitRange生成的結果是[[60 8 0 0 0 2], [60 8 0 0 1 7f], [60 8 0 0 8 0], [60 8 0 0 8 0],[68 4 0 0 1], [68 4 0 0 3]] 結束.

整體的策略是先枝葉, 後主幹.

經過上面的案例,結合前面理解的NumericTokenStream, 能夠發現,在precisionStep=8時, [0,65535] 區間管理以下:

                 [0,65535]

[0,255], [256,511], ... , [62324,62579], [62580, 65535]

取值區間肯定後, 當拿到的term比較多時, 通常是超過16個, 則使用bitset, 不然使用booleanQuery, 代碼邏輯見MultiTermQueryConstantScoreWrapper.createWeight(). 在MultiTermQueryConstantScoreWrapper.createWeight()建立的ConstantScoreWeight對象的rewrite()方法.

最後, 再看具體代碼的實現, 理解做者編碼的細節, 每一個變量的做用.

  /** This helper does the splitting for both 32 and 64 bit. */
  private static void splitRange(
    final Object builder, final int valSize,
    final int precisionStep, long minBound, long maxBound
  ) {
    if (precisionStep < 1)
      throw new IllegalArgumentException("precisionStep must be >=1");
    if (minBound > maxBound) return;
    for (int shift=0; ; shift += precisionStep) {
      // calculate new bounds for inner precision
      /*
       * diff的做用就是將每一輪的處理控制在算精度範圍內, 以precisitionStep=8爲例: 
       * diff=2^8
       * diff=2^16
       * diff=2^24
       * 即每一次擴大8-位
       * */
      final long diff = 1L << (shift+precisionStep),
        /*
         * mask, 直譯就是掩碼, 以precisionStep=8爲例:
         * mask = 0x00000000000000ff
         * mask = 0x000000000000ff00
         * mask = 0x0000000000ff0000
         * */
        mask = ((1L<<precisionStep) - 1L) << shift;
      /*
       * hasLower/hasUpper 用於判別當前邊界是枝葉仍是樹幹. 主要做用於第一輪, 即shift=0時
       * */
      final boolean
        hasLower = (minBound & mask) != 0L,
        hasUpper = (maxBound & mask) != mask;
      /*
       * nextMinBound/nexMaxBound  能夠形象理解爲標記斷點
       * */
      final long
        nextMinBound = (hasLower ? (minBound + diff) : minBound) & ~mask,
        nextMaxBound = (hasUpper ? (maxBound - diff) : maxBound) & ~mask;
      final boolean
        lowerWrapped = nextMinBound < minBound,
        upperWrapped = nextMaxBound > maxBound;
      /*
       * 這下面的邏輯就是真正的剪枝了, 須要注意的是, addRange會從新調整maxBound.
       * 例如: 對於區間[0,1024], 在這裏看到的split後的區間是[0,768], [1024,1024],
       * 實際上,在addRange函數中,經過  maxBound |= (1L << shift) - 1L; 將區間修正爲
       * [0,1023], [1024,1024]
       * */
      if (shift+precisionStep>=valSize || nextMinBound>nextMaxBound || lowerWrapped || upperWrapped) {
        // We are in the lowest precision or the next precision is not available.
        addRange(builder, valSize, minBound, maxBound, shift);
        // exit the split recursion loop
        break;
      }

      if (hasLower)
        addRange(builder, valSize, minBound, minBound | mask, shift);
      if (hasUpper)
        addRange(builder, valSize, maxBound & ~mask, maxBound, shift);

      // recurse to next precision
      minBound = nextMinBound;
      maxBound = nextMaxBound;
    }
  }

參考:

http://blog.csdn.net/zhufenglonglove/article/details/51700898

http://blog.csdn.net/debiann/article/details/23012699

http://brokendreams.iteye.com/blog/2256239