Solr類似查詢——MoreLikeThis

時間 2019-11-07

標籤 solr 類似查詢 morelikethis 简体版

原文原文鏈接

前言

基本概念html

什麼是類似查詢

咱們在google上進行一個查詢，每個查詢結果都包含一個「類似頁面」連接，單擊該連接，就會發送另外一個搜索請求，查找出與起初結果相似的文檔。java

Solr中的類似查詢(MoreLikeThis)實現了同樣的功能。apache

爲何須要類似查詢

文檔中這樣解釋：數組

Generate "more like this" similarity queries.
Based on this mail:

Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough.  But looking up the docFreq() of every term in the document is
probably too slow.

You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
or at all.  Since you're trying to maximize a tf*idf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration.  Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer.  So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.

It all depends on what you're trying to do. If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless.  But if you're trying to provide a "more like this" button on a search results page that does a decent job and has good performance, such techniques might be useful. An efficient, effective "more-like-this" query generator would be a great contribution, if anyone's interested.  I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above.  The frequency and length thresholds could be parameters, etc.

Doug
複製代碼

上述內容的幾個關鍵點（核心思想）：bash

Lucene容許您使用IndexReader.docFreq()訪問某個詞在全部文檔頻率（出現次數）
docFreq()在單個文檔中調用很是快，爲了不過多地調用它，經過添加詞頻和文檔頻率加以控制
經過調整詞頻和文檔頻率就能夠作到「more like this」
它支持Lucene中某個字段或單獨一個Reader或String進行輸入項
它須要用到分析器Analyzer和similarity（打分）

MoreLikeThis實現方式

Lucene5中內置的MoreLikeThis的實現方式是使用打分的方式計算類似度，根據最終得分高低放入優先級隊列，評分高的天然在隊列最高處。app

如何實現MoreLikeThis

官方文檔中這樣解釋：less

There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link).ide

The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results.函數

The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.源碼分析

MoreLikeThisHandler。做爲一個單獨的handler來處理。這種方法最經常使用。能夠應用過濾等複雜操做。
SearchHandler中的MoreLikeThisComponent。做爲一個搜索組件，適用於簡單應用。可是他對返回結果中的每個文檔都執行此操做，下降搜索結果。
MoreLikeThisHandler with externally supplied text。帶有外部輸入文本的的request handler。相似於第一種方法。不過這種方法是根據輸入文檔的文原本索引類似文檔。

MoreLikeThis是如何工做的

官方文檔中這樣解釋：

MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the mlt.fl parameter, below). For best results, the fields should have stored term vectors in schema.xml. For example:
<field name="cat" ... termVectors="true" />
複製代碼
If term vectors are not stored, MoreLikeThis will generate terms from stored fields. A uniqueKey must also be stored in order for MoreLikeThis to work properly.

The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the mlt.qf parameter, below) and a new document set is returned.

MoreLikeThis 會基於索引文檔中的Term構造一個Lucene Query，而這些Term須要從定義的域列表上獲取，該域列表須要經過mlt.fl參數執行，爲了獲取最佳效果，那些域應該存儲Term向量信息，即域的term Vectors屬性須要設置爲true。

若是域的Term向量信息未存儲，那麼MoreLikeThis會自動從存儲域（stored = true 的域）上生成Term。爲了使MoreLikeThis正常工做，你還必須存儲UniqueKey。

好吧！其實我仍是不知道MoreLikeThis具體是怎麼實現的。可是咱們應該能從上面描述中獲取一些關鍵點：

MoreLikeThis 的實現是須要根據Term來構造一個Lucene Query
Term 則須要從給定的field列表獲取
這些field須要在mlt.fl參數中設置執行
這些field應該存儲term vectors信息，若是沒有，那麼就從存儲域上生成

咱們須要來看看源碼是怎麼實現的了

源碼實現

MoreLikeThis重要參數

參數	解釋
private Analyzer analyzer	分詞器
private int minTermFreq	詞頻最小值（默認2）
private int minDocFreq	文檔頻率最小值（默認5）
private int maxDocFreq	文檔頻率最大值（默認2147483647）
private int maxQueryTerms	查詢詞的數組大小（默認25）
private TFIDFSimilarity similarity	計算相關度

MoreLikeThis重要方法

public Query like(int docNum) 按lucene的documentId查詢
public Query like(String fieldName, Reader... readers) 按string類型輸入查詢

源碼分析

MoreLikeThis爲了實現與Lucene良好的互動，且擴充了Lucene：它提供了一個方法，該方法返回一個Query對象，即Lucene的查詢對象，只要Lucene經過這個對象檢索，就能得到類似結果；因此MoreLikeThis和Lucene徹底可以無縫結合。Solr中就提供了一個不錯的例子。咱們以 public Query like(int docNum) 方法來解釋類似查詢實現原理：

入口函數

public Query like(int docNum) throws IOException {
        if (this.fieldNames == null) {
            Collection<String> fields = MultiFields.getIndexedFields(this.ir);
            this.fieldNames = (String[])fields.toArray(new String[fields.size()]);
        }

        return this.createQuery(this.retrieveTerms(docNum));
    }
複製代碼

其中的docNum參數爲那個搜索結果的id，即你要經過這個搜索結果，來查找其餘與之類似搜索結果。 fieldNames能夠理解爲咱們選擇的一些域，咱們將取出該結果在這些域中的值，以此來分析類似度。程序很明顯，這些域是可選的。

流程

上述代碼實現邏輯以下：

private PriorityQueue retrieveTerms(int docNum): 用於提取 docNum 對應檢測結果在指定域 fieldNames 中的值。
rivate void addTermFrequencies(Map<String, Int> termFreqMap, Terms vector): 將每一個詞項和它出現的頻率添加到 termFreqMap 中。在（1）中調用。
private PriorityQueue createQueue(Map<String, Int> words): 將Map中的數據取出，進行一些類似計算，生成PriorityQueue。一樣在（1）中調用
private Query createQuery(PriorityQueue q): 根據PriorityQueue(優先級隊列)生成最終的Query對象。

具體函數分析

termVector是項向量。項向量其實就是根據Term在文檔中出現的頻率和文檔中包含Term的頻率創建的數學模型，計算兩個項向量的夾角的方式來判斷他們的類似性。

/** * Find words for a more-like-this query former. * * @param docNum the id of the lucene document from which to find terms */
  private PriorityQueue<ScoreTerm> retrieveTerms(int docNum) throws IOException {
    Map<String, Int> termFreqMap = new HashMap<>();
    for (String fieldName : fieldNames) {
      final Fields vectors = ir.getTermVectors(docNum);
      final Terms vector;
      if (vectors != null) {
        vector = vectors.terms(fieldName);
      } else {
        vector = null;
      }

      // field does not store term vector info
      //若是當前字段沒有存儲termVector，那麼須要從新計算。其實這裏就是分詞，並計算term詞頻的過程，注意他默認使用的是StandardAnalyzer分詞器！！！
      if (vector == null) {
        Document d = ir.document(docNum);
        IndexableField fields[] = d.getFields(fieldName);
        for (IndexableField field : fields) {
          final String stringValue = field.stringValue();
          if (stringValue != null) {
            addTermFrequencies(new StringReader(stringValue), termFreqMap, fieldName);
          }
        }
      } else {//若是以前保存了termVector就直接添加便可
        addTermFrequencies(termFreqMap, vector);
      }
    }

    return createQueue(termFreqMap);
  }
複製代碼

因爲TermVector中的term和field沒有關係，不論是標題仍是正文，只要term內容同樣就將其頻率累加。addTermFrequencies就作這個事情！

把累加的結果存放到termFreqMap中。

private void addTermFrequencies(Reader r, Map<String, MoreLikeThis.Int> termFreqMap, String fieldName) throws IOException {
        if (this.analyzer == null) {
            throw new UnsupportedOperationException("To use MoreLikeThis without term vectors, you must provide an Analyzer");
        } else {
            TokenStream ts = this.analyzer.tokenStream(fieldName, r);

            try {
                int tokenCount = 0;
                CharTermAttribute termAtt = (CharTermAttribute)ts.addAttribute(CharTermAttribute.class);
                ts.reset();

                while(true) {
                    if (ts.incrementToken()) {
                        String word = termAtt.toString();
                        ++tokenCount;
                        if (tokenCount <= this.maxNumTokensParsed) {
                            if (!this.isNoiseWord(word)) {
                                MoreLikeThis.Int cnt = (MoreLikeThis.Int)termFreqMap.get(word);
                                if (cnt == null) {
                                    termFreqMap.put(word, new MoreLikeThis.Int());
                                } else {
                                    ++cnt.x;
                                }
                            }
                            continue;
                        }
                    }

                    ts.end();
                    return;
                }
            } finally {
                IOUtils.closeWhileHandlingException(new Closeable[]{ts});
            }
        }
    }
複製代碼

上面操做中，咱們還須要進行降噪操做降噪操做有幾個原則：

是不是規定的停用詞
term長度是否過長或太短，這個範圍由minWordLen和maxWordLen控制。

/** * determines if the passed term is likely to be of interest in "more like" comparisons * * @param term The word being considered * @return true if should be ignored, false if should be used in further analysis */
  private boolean isNoiseWord(String term) {
    int len = term.length();
    if (minWordLen > 0 && len < minWordLen) {
      return true;
    }
    if (maxWordLen > 0 && len > maxWordLen) {
      return true;
    }
    return stopWords != null && stopWords.contains(term);
  }
複製代碼

這裏的queue應該是一個優先級隊列，上一步咱們得到了全部<term, frequency>，雖然作了去噪，可是term項目仍是太多了，還須要找出相對重要的前N個Term。

在這裏，咱們對每一個term進行了打分排序，主要仍是經過tf、idf進行計算。

/** * Create a PriorityQueue from a word->tf map. * * @param words a map of words keyed on the word(String) with Int objects as the values. */
  private PriorityQueue<ScoreTerm> createQueue(Map<String, Int> words) throws IOException {
    // have collected all words in doc and their freqs
    //獲取當前index的文檔總數。
    int numDocs = ir.numDocs();
    final int limit = Math.min(maxQueryTerms, words.size());
    FreqQ queue = new FreqQ(limit); //按照term的得分進行存放

    for (String word : words.keySet()) { //對每一個詞
      int tf = words.get(word).x; // 對應term的詞頻
      if (minTermFreq > 0 && tf < minTermFreq) {
        continue; // 和去噪相似，tf過小的term直接過掉。
      }

      // 遍歷全部的field，找到df最大的那個字段
      String topField = fieldNames[0];
      int docFreq = 0;
      for (String fieldName : fieldNames) {
        int freq = ir.docFreq(new Term(fieldName, word));
        topField = (freq > docFreq) ? fieldName : topField;
        docFreq = (freq > docFreq) ? freq : docFreq;
      }

      if (minDocFreq > 0 && docFreq < minDocFreq) {
        continue; // df過小的term也要直接過掉
      }

      if (docFreq > maxDocFreq) {
        continue; // df太大的term也要直接過掉
      }

      if (docFreq == 0) {
        continue; // index update problem?df==0的term也要直接過掉，怎麼會有df的term？？？這裏說是index文件的問題
      }

      float idf = similarity.idf(docFreq, numDocs);
      float score = tf * idf;

      //將結果存放到優先隊列中。
      if (queue.size() < limit) {
        // there is still space in the queue
        queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));
      } else {
        ScoreTerm term = queue.top();
        if (term.score < score) { // update the smallest in the queue in place and update the queue.
          term.update(word, topField, score, idf, docFreq, tf);
          queue.updateTop();
        }
      }
    }
    return queue;
  }
複製代碼

到這裏咱們已經將term的打分排序拿到了，分值越大的term更能表述整篇document的主要內容！（這樣想的話其實有點理解文章開始的意思了，咱們找一篇文章的類似文章，天然是相同的詞越多越好了）

/** * Create the More like query from a PriorityQueue */
  private Query createQuery(PriorityQueue<ScoreTerm> q) {
    BooleanQuery query = new BooleanQuery();
    ScoreTerm scoreTerm;
    float bestScore = -1;

    while ((scoreTerm = q.pop()) != null) {
      TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));
      //這裏還能夠對termquery進行boost的設置。默認爲false
      if (boost) {
        if (bestScore == -1) {
          bestScore = (scoreTerm.score);
        }
        float myScore = (scoreTerm.score);
        tq.setBoost(boostFactor * myScore / bestScore);
      }
      ////構建boolean query，should關聯。
      try {
        query.add(tq, BooleanClause.Occur.SHOULD);
      }
      catch (BooleanQuery.TooManyClauses ignore) {
        break;
      }
    }
    return query;
  }
複製代碼

這樣就根據一篇document和指定字段獲得了一個query。這個query做爲表明着document的靈魂，將尋找和他相似的documents。

參考文章

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。