Lucene類似搜索組件MoreLikeThis原理與代碼分析

時間 2019-12-06

標籤 lucene 類似搜索組件 morelikethis 原理代碼分析简体版

原文原文鏈接

MoreLikeThis 是 Lucene 的一個捐贈模塊，爲其Query相關的功能提供了至關不錯擴充。MoreLikeThis提供了一組可用於類似搜索的接口，已方便讓咱們實現本身的類似搜索。html

什麼是類似搜索：

類似搜索按我我的的理解，即：查找與某一條搜索結果相關的其餘結果。它爲用戶提供一種不一樣於標準搜索（查詢語句—>結果）的方式，經過一個比較符合本身意圖的搜索結果去搜索新的結果（結果—>結果）。java

MoreLikeThis 設計思路分析：

首先，MoreLikeThis 爲了實現與Lucene 良好的互動，且擴充Lucene；它提供一個方法，該方法返回一個Query對象，即Lucene的查詢對象，只要Lucene經過這個對象檢索，就能得到類似結果；因此 MoreLikeThis 和 Lucene 徹底可以無縫結合；Solr 中就提供了一個不錯的例子。MoreLikeThis 所提供的方法以下：數組

/**
     * Return a query that will return docs like the passed lucene document ID.
     *
     * @param docNum the documentID of the lucene doc to generate the 'More Like This" query for.
     * @return a query that will return docs like the passed lucene document ID.
     */
    public Query like(int docNum) throws IOException {
        if (fieldNames == null) {
            // gather list of valid fields from lucene
            Collection<String> fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
            fieldNames = fields.toArray(new String[fields.size()]);
        }

        return createQuery(retrieveTerms(docNum));
    }

其中的參數 docNum 爲那個搜索結果的id，即你要經過的這個搜索結果，來查找其餘與之類似搜索結果；而fieldNames能夠理解爲咱們選擇的一些域，咱們將取出該結果在這些域中的值，以此來分析類似度。程序很明顯，這些域是可選的。數據結構

其次，咱們來看看它的一個工做流程，是如何獲得這個類似查詢的（返回的那個Query），我本身畫了個流程圖一方便簡單說明：
優化

大體流程，圖中已經明晰，接下來，咱們看看MoreLikeThis的源代碼是怎麼實現，還有一些細節。this

MoreLikeThis 源代碼分析：

代碼的中主要經過4個方法實現上面所示的流程，它們分別是：搜索引擎

1. PriorityQueue<Object[]> retrieveTerms(int docNum)：用於提取 docNum 對應檢索結果在指定域fieldNames中的值。spa

2. void addTermFrequencies(Map<String,Int> termFreqMap, TermFreqVector vector)：它在1方法中被調用，用於封裝流程圖所提到的Map<String,int> 數據結構，即：每一個詞項以及它出現的頻率。.net

3. PriorityQueue<Object[]> createQueue(Map<String,Int> words)：它一樣再方法1中被調用，用於將Map中的數據取出，進行一些類似計算後，生成PriorityQueue，方便下一步的封裝。設計

4. Query createQuery(PriorityQueue<Object[]> q): 用於生成最終的Query，如流程圖的最後一步所言。

接下來，咱們依次看看源代碼的具體實現：

/**
     * Find words for a more-like-this query former.
     *
     * @param docNum the id of the lucene document from which to find terms
     */
    public PriorityQueue<Object[]> retrieveTerms(int docNum) throws IOException {
        Map<String,Int> termFreqMap = new HashMap<String,Int>();
        for (int i = 0; i < fieldNames.length; i++) {
            String fieldName = fieldNames[i];
            TermFreqVector vector = ir.getTermFreqVector(docNum, fieldName);

            // field does not store term vector info
            if (vector == null) {
            	Document d=ir.document(docNum);
            	String text[]=d.getValues(fieldName);
            	if(text!=null)
            	{
                    for (int j = 0; j < text.length; j++) {
                      addTermFrequencies(new StringReader(text[j]), termFreqMap, fieldName);
                    }
            	}
            }
            else {
		  addTermFrequencies(termFreqMap, vector);
            }

        }

        return createQueue(termFreqMap);
    }

其中第10行，經過 getTermFreqVector(docNum, fieldName) 返回 TermFreqVector 對象保存了一些字符串和整形數組（它們分別表示fieldName 域中某一個詞項的值，以及該詞項出項的頻率）

/**
	 * Adds terms and frequencies found in vector into the Map termFreqMap
	 * @param termFreqMap a Map of terms and their frequencies
	 * @param vector List of terms and their frequencies for a doc/field
	 */
	private void addTermFrequencies(Map<String,Int> termFreqMap, TermFreqVector vector)
	{
		String[] terms = vector.getTerms();
		int freqs[]=vector.getTermFrequencies();
		for (int j = 0; j < terms.length; j++) {
		    String term = terms[j];
		
			if(isNoiseWord(term)){
				continue;
			}
		    // increment frequency
		    Int cnt = termFreqMap.get(term);
		    if (cnt == null) {
		    	cnt=new Int();
				termFreqMap.put(term, cnt);
				cnt.x=freqs[j];				
		    }
		    else {
		        cnt.x+=freqs[j];
		    }
		}
	}

其中第8行，和第9行，經過上一步得到的TermFreqVector對象，得到詞項數組和頻率數組（terms, freqs），它們是一一對應的。而後10～25行將這些數據作了一些檢查後封裝到Map中，頻率freqs[]是累加的。

/**
     * Create a PriorityQueue from a word->tf map.
     *
     * @param words a map of words keyed on the word(String) with Int objects as the values.
     */
    private PriorityQueue<Object[]> createQueue(Map<String,Int> words) throws IOException {
        // have collected all words in doc and their freqs
        int numDocs = ir.numDocs();
        FreqQ res = new FreqQ(words.size()); // will order words by score

        Iterator<String> it = words.keySet().iterator();
        while (it.hasNext()) { // for every word
            String word = it.next();

            int tf = words.get(word).x; // term freq in the source doc
            if (minTermFreq > 0 && tf < minTermFreq) {
                continue; // filter out words that don't occur enough times in the source
            }

            // go through all the fields and find the largest document frequency
            String topField = fieldNames[0];
            int docFreq = 0;
            for (int i = 0; i < fieldNames.length; i++) {
                int freq = ir.docFreq(new Term(fieldNames[i], word));
                topField = (freq > docFreq) ? fieldNames[i] : topField;
                docFreq = (freq > docFreq) ? freq : docFreq;
            }

            if (minDocFreq > 0 && docFreq < minDocFreq) {
                continue; // filter out words that don't occur in enough docs
            }

            if (docFreq > maxDocFreq) {
                continue; // filter out words that occur in too many docs            	
            }

            if (docFreq == 0) {
                continue; // index update problem?
            }

            float idf = similarity.idf(docFreq, numDocs);
            float score = tf * idf;

            // only really need 1st 3 entries, other ones are for troubleshooting
            res.insertWithOverflow(new Object[]{word,                   // the word
                                    topField,               // the top field
                                    Float.valueOf(score),       // overall score
                                    Float.valueOf(idf),         // idf
                                    Integer.valueOf(docFreq),   // freq in all docs
                                    Integer.valueOf(tf)
            });
        }
        return res;
    }

首先第9行，生成一個優先級隊列；從12行起，開始逐個遍歷每一個詞項： word；

接着第21～27行：找出該詞項出現頻率最高的一個域，以此做爲該詞項的被檢索域。（由上面的過程，咱們能夠得出，同一個詞項的頻率值，可能來自多個域中的頻率的累加；但在Query中只能有一個檢索域，這裏選擇最高的）

第41行和42行，作了打分運算，獲得一個分值，對應後面要封裝的基本查詢對象TermQuery的一個權重值；在後面組和多個Query對象時，以此彰顯哪一個更爲重要；這裏用到了餘弦公式的思想來進行運算，由於Lucene的打分規則也是採用空間向量，判斷兩個向量的餘弦來計算類似度；具體可參考這兩篇博客：http://blog.csdn.net/forfuture1978/article/details/5353126，

http://www.cnblogs.com/ansen/articles/1906353.html 都寫得很是好。

另：在Lucene中能夠對3個元素加權重，已提升其對應的排序結果，它們分別是：域(field)，文檔(ducument)，查詢(query)。

最後封裝成隊列，並返回。

/**
     * Create the More like query from a PriorityQueue
     */
    private Query createQuery(PriorityQueue<Object[]> q) {
        BooleanQuery query = new BooleanQuery();
        Object cur;
        int qterms = 0;
        float bestScore = 0;

        while (((cur = q.pop()) != null)) {
            Object[] ar = (Object[]) cur;
            TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));

            if (boost) {
                if (qterms == 0) {
                    bestScore = ((Float) ar[2]).floatValue();
                }
                float myScore = ((Float) ar[2]).floatValue();

                tq.setBoost(boostFactor * myScore / bestScore);
            }

            try {
                query.add(tq, BooleanClause.Occur.SHOULD);
            }
            catch (BooleanQuery.TooManyClauses ignore) {
                break;
            }

            qterms++;
            if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {
                break;
            }
        }

        return query;
    }

第5行，生成一個複合查詢對象BooleanQuery，用於將基本查詢對象TermQuery依次填入。

從第10行開始，逐個從Queue隊列中取出數據，封裝TermQuery。

第14到21行，對每一個TermQuery都進行不一樣的加權，如前面提到的同樣

最後返回Query。

OK 整MoreLikeThis的實現分析結束，我的感受MoreLikeThis 在實際搜索被用到的並很少，但它給咱們提供種查找類似結果的思路，也許咱們能夠通過本身的改造和定義，來優化搜索引擎，使搜索結果更加滿意。

原創blog，轉載請註明http://my.oschina.net/BreathL/blog/41663