lucene實戰--打分算法沒有那麼難？

時間 2020-11-30

標籤 html java git github 算法 apache api app dom eclipse 欄目 HTML 简体版

原文原文鏈接

準備工做html

1.1 下載最新源碼，https://github.com/apache/lucene-solrjava

1.2 編譯，按照說明，使用ant進行編譯(我使用了ant eclipse)git

1.3.將編譯後的文件導入到eclipse，sts或者idea中github

2.新建測試類算法

public void test() throws IOException, ParseException {
        Analyzer analyzer = new NGramAnalyzer();

        // Store the index in memory:
        Directory directory = new RAMDirectory();
        // To store an index on disk, use this instead:
        //Path path = FileSystems.getDefault().getPath("E:\\demo\\data", "access.data");
        //Directory directory = FSDirectory.open(path);
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter iwriter = new IndexWriter(directory, config);
        Document doc = new Document();
        String text = "我是中國人.";
        doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
        iwriter.addDocument(doc);
        iwriter.close();

        // Now search the index:
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);
        isearcher.setSimilarity(new BM25Similarity());
        // Parse a simple query that searches for "text":
        QueryParser parser = new QueryParser("fieldname", analyzer);
        Query query = parser.parse("中國,人");
        ScoreDoc[] hits = isearcher.search(query, 1000).scoreDocs;
        // Iterate through the results:
        for (int i = 0; i < hits.length; i++) {
          Document hitDoc = isearcher.doc(hits[i].doc);
          System.out.println(hitDoc.getFields().toString());
        }
        ireader.close();
        directory.close();
    }

      private static class NGramAnalyzer extends Analyzer {
            @Override
            protected TokenStreamComponents createComponents(String fieldName) {
              final Tokenizer tokenizer = new KeywordTokenizer();
              return new TokenStreamComponents(tokenizer, new NGramTokenFilter(tokenizer, 1, 4, true));
            }
          }

其中，分詞使用自定義的NGramAnalyzer，它繼承自Analyzer，Analyzer分析文本，並將文本轉換爲TokenStream。詳細以下：apache

/**
 * An Analyzer builds TokenStreams, which analyze text.  It thus represents a
 * policy for extracting index terms from text.
 * <p>
 * In order to define what analysis is done, subclasses must define their
 * {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.
 * The components are then reused in each call to {@link #tokenStream(String, Reader)}.
 * <p>
 * Simple example:
 * <pre class="prettyprint">
 * Analyzer analyzer = new Analyzer() {
 *  {@literal @Override}
 *   protected TokenStreamComponents createComponents(String fieldName) {
 *     Tokenizer source = new FooTokenizer(reader);
 *     TokenStream filter = new FooFilter(source);
 *     filter = new BarFilter(filter);
 *     return new TokenStreamComponents(source, filter);
 *   }
 *   {@literal @Override}
 *   protected TokenStream normalize(TokenStream in) {
 *     // Assuming FooFilter is about normalization and BarFilter is about
 *     // stemming, only FooFilter should be applied
 *     return new FooFilter(in);
 *   }
 * };
 * </pre>
 * For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.
 * <p>
 * For some concrete implementations bundled with Lucene, look in the analysis modules:
 * <ul>
 *   <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:
 *       Analyzers for indexing content in different languages and domains.
 *   <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:
 *       Exposes functionality from ICU to Apache Lucene. 
 *   <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:
 *       Morphological analyzer for Japanese text.
 *   <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:
 *       Dictionary-driven lemmatization for the Polish language.
 *   <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:
 *       Analysis for indexing phonetic signatures (for sounds-alike search).
 *   <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:
 *       Analyzer for Simplified Chinese, which indexes words.
 *   <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:
 *       Algorithmic Stemmer for the Polish Language.
 * </ul>
 *
 * @since 3.1
 */

ClassicSimilarity是TFIDFSimilarity的封裝，因TFIDFSimilarity是抽象方法，沒法直接new出實例.這個算法是lucene早期的默認打分實現。api

將測試類放入solr-lucene源碼中，並進行debug，若是想要分析TFIDF算法，能夠直接new ClassicSimilarity 而後放入IndexSearch，其它的相似。app

3.算法介紹dom

新版的lucene使用了BM25Similarity做爲默認打分實現。這裏顯式使用了BM25Similarity，算法詳細。這裏簡要介紹一下：eclipse

其中:

　　 D即文檔(Document),Q即查詢語句(Query),score(D,Q)指使用Q的查詢語句在該文檔下的打分函數。

　　IDF即倒排文件頻次(Inverse Document Frequency)指在倒排文檔中出現的次數，qi是Q分詞後term

![](https://s4.51cto.com/images/blog/202011/29/c2041d2f7a39e25d7f18abe98f4b48af.png?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=)其中，N是總的文檔數目，n(qi)是出現分詞qi的文檔數目。

　　f(qi,D)是qi分詞在文檔Document出現的頻次

　　 k1和b是可調參數，默認值爲1.2，0.75

　　|D|是文檔的單詞的個數，avgdl 指庫裏的平均文檔長度。

4.算法實現

1.IDF實現

　　單個IDF實現

/** Implemented as <code>log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))</code>. */
  protected float idf(long docFreq, long docCount) {
    return (float) Math.log(1 + (docCount - docFreq + 0.5D)/(docFreq + 0.5D));
  }

IDF的集合實現

@Override
  public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];
    float[] cache = new float[256];
    for (int i = 0; i < cache.length; i++) {
      oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);
      cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);
    }
    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);
  }

  /**
   * Computes a score factor for a phrase.
   * 
   * <p>
   * The default implementation sums the idf factor for
   * each term in the phrase.
   * 
   * @param collectionStats collection-level statistics
   * @param termStats term-level statistics for the terms in the phrase
   * @return an Explain object that includes both an idf 
   *         score factor for the phrase and an explanation 
   *         for each term.
   */
  public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats[]) {
    double idf = 0d; // sum into a double before casting into a float
    List<Explanation> details = new ArrayList<>();
    for (final TermStatistics stat : termStats ) {
      Explanation idfExplain = idfExplain(collectionStats, stat);
      details.add(idfExplain);
      idf += idfExplain.getValue();
    }
    return Explanation.match((float) idf, "idf(), sum of:", details);
  }

2.k1和b參數實現

public BM25Similarity(float k1, float b) {
    if (Float.isFinite(k1) == false || k1 < 0) {
      throw new IllegalArgumentException("illegal k1 value: " + k1 + ", must be a non-negative finite value");
    }
    if (Float.isNaN(b) || b < 0 || b > 1) {
      throw new IllegalArgumentException("illegal b value: " + b + ", must be between 0 and 1");
    }
    this.k1 = k1;
    this.b  = b;
  }

  /** BM25 with these default values:
   * <ul>
   *   <li>{@code k1 = 1.2}</li>
   *   <li>{@code b = 0.75}</li>
   * </ul>
   */
  public BM25Similarity() {
    this(1.2f, 0.75f);
  }

3.平均文檔長度avgdl 計算

/** The default implementation computes the average as <code>sumTotalTermFreq / docCount</code> */
  protected float avgFieldLength(CollectionStatistics collectionStats) {
    final long sumTotalTermFreq;
    if (collectionStats.sumTotalTermFreq() == -1) {
      // frequencies are omitted (tf=1), its # of postings
      if (collectionStats.sumDocFreq() == -1) {
        // theoretical case only: remove!
        return 1f;
      }
      sumTotalTermFreq = collectionStats.sumDocFreq();
    } else {
      sumTotalTermFreq = collectionStats.sumTotalTermFreq();
    }
    final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
    return (float) (sumTotalTermFreq / (double) docCount);
  }

4.參數Weigh的計算

/** Cache of decoded bytes. */
  private static final float[] OLD_LENGTH_TABLE = new float[256];
  private static final float[] LENGTH_TABLE = new float[256];

  static {
    for (int i = 1; i < 256; i++) {
      float f = SmallFloat.byte315ToFloat((byte)i);
      OLD_LENGTH_TABLE[i] = 1.0f / (f*f);
    }
    OLD_LENGTH_TABLE[0] = 1.0f / OLD_LENGTH_TABLE[255]; // otherwise inf

    for (int i = 0; i < 256; i++) {
      LENGTH_TABLE[i] = SmallFloat.byte4ToInt((byte) i);
    }
  }

  @Override
  public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];
    float[] cache = new float[256];
    for (int i = 0; i < cache.length; i++) {
      oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);
      cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);
    }
    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);
  }

至關於

5.WeightValue計算

BM25Stats(String field, float boost, Explanation idf, float avgdl, float[] oldCache, float[] cache) {
      this.field = field;
      this.boost = boost;
      this.idf = idf;
      this.avgdl = avgdl;
      this.weight = idf.getValue() * boost;
      this.oldCache = oldCache;
      this.cache = cache;
    }

    BM25DocScorer(BM25Stats stats, int indexCreatedVersionMajor, NumericDocValues norms) throws IOException {
      this.stats = stats;
      this.weightValue = stats.weight * (k1 + 1);
      this.norms = norms;
      if (indexCreatedVersionMajor >= 7) {
        lengthCache = LENGTH_TABLE;
        cache = stats.cache;
      } else {
        lengthCache = OLD_LENGTH_TABLE;
        cache = stats.oldCache;
      }
    }

至關於

紅色部分相乘

6.總的得分計算

@Override
    public float score(int doc, float freq) throws IOException {
      // if there are no norms, we act as if b=0
      float norm;
      if (norms == null) {
        norm = k1;
      } else {
        if (norms.advanceExact(doc)) {
          norm = cache[((byte) norms.longValue()) & 0xFF];
        } else {
          norm = cache[0];
        }
      }
      return weightValue * freq / (freq + norm);
    }

其中norm是從cache裏取的，cache是放入了

那麼整個公式就完整的出來了

7.深刻

打分的數據來源於CollectionStatistics,TermStatistics及freq，那麼它們是哪裏獲得的？

SynonymWeight(Query query, IndexSearcher searcher, float boost) throws IOException {
      super(query);
      CollectionStatistics collectionStats = searcher.collectionStatistics(terms[0].field());//1
      long docFreq = 0;
      long totalTermFreq = 0;
      termContexts = new TermContext[terms.length];
      for (int i = 0; i < termContexts.length; i++) {
        termContexts[i] = TermContext.build(searcher.getTopReaderContext(), terms[i]);
        TermStatistics termStats = searcher.termStatistics(terms[i], termContexts[i]);//2
        docFreq = Math.max(termStats.docFreq(), docFreq);
        if (termStats.totalTermFreq() == -1) {
          totalTermFreq = -1;
        } else if (totalTermFreq != -1) {
          totalTermFreq += termStats.totalTermFreq();
        }
      }
      TermStatistics[] statics=new TermStatistics[terms.length];
      for(int i=0;i<terms.length;i++) {
        TermStatistics pseudoStats = new TermStatistics(terms[i].bytes(), docFreq, totalTermFreq,query.getKeyword());
        statics[i]=pseudoStats;
      }

      this.similarity = searcher.getSimilarity(true);
      this.simWeight = similarity.computeWeight(boost, collectionStats, statics);
    }

CollectionStatistics的來源

/**
   * Returns {@link CollectionStatistics} for a field.
   * 
   * This can be overridden for example, to return a field's statistics
   * across a distributed collection.
   * @lucene.experimental
   */
  public CollectionStatistics collectionStatistics(String field) throws IOException {
    final int docCount;
    final long sumTotalTermFreq;
    final long sumDocFreq;

    assert field != null;

    Terms terms = MultiFields.getTerms(reader, field);
    if (terms == null) {
      docCount = 0;
      sumTotalTermFreq = 0;
      sumDocFreq = 0;
    } else {
      docCount = terms.getDocCount();
      sumTotalTermFreq = terms.getSumTotalTermFreq();
      sumDocFreq = terms.getSumDocFreq();
    }

    return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);
  }

TermStatistics的來源

/**
   * Returns {@link TermStatistics} for a term.
   * 
   * This can be overridden for example, to return a term's statistics
   * across a distributed collection.
   * @lucene.experimental
   */
  public TermStatistics termStatistics(Term term, TermContext context) throws IOException {
    return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq(),term.text());
  }

freq的來源(tf)

@Override
    protected float score(DisiWrapper topList) throws IOException {
      return similarity.score(topList.doc, tf(topList));
    }

    /** combines TF of all subs. */
    final int tf(DisiWrapper topList) throws IOException {
      int tf = 0;
      for (DisiWrapper w = topList; w != null; w = w.next) {
        tf += ((TermScorer)w.scorer).freq();
      }
      return tf;
    }

底層實現

Lucene50PostingsReader.BlockPostingsEnum

@Override
    public int nextDoc() throws IOException {
      if (docUpto == docFreq) {
        return doc = NO_MORE_DOCS;
      }
      if (docBufferUpto == BLOCK_SIZE) {
        refillDocs();
      }

      accum += docDeltaBuffer[docBufferUpto];
      freq = freqBuffer[docBufferUpto];
      posPendingCount += freq;
      docBufferUpto++;
      docUpto++;

      doc = accum;
      position = 0;
      return doc;
    }

8.總結

BM25算法的全稱是 Okapi BM25，是一種二元獨立模型的擴展，也能夠用來作搜索的相關度排序。本文經過和lucene的BM25Similarity的實現來深刻理解整個打分公式。

在此基礎之上，又分析了CollectionStatistics,TermStatistics及freq這些參數是如何計算的。

經過整個分析過程，咱們想要定製本身的打分公式，只須要實現Similarity或者SimilarityBase類，而後實現業務上的打分公式便可。

參考文獻

【1】https://en.wikipedia.org/wiki/Okapi_BM25

【2】https://www.elastic.co/cn/blog/found-bm-vs-lucene-default-similarity

【3】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html