基本概念html
咱們在google上進行一個查詢,每個查詢結果都包含一個「類似頁面」連接,單擊該連接,就會發送另外一個搜索請求,查找出與起初結果相似的文檔。java
Solr中的類似查詢(MoreLikeThis)實現了同樣的功能。apache
文檔中這樣解釋:數組
Generate "more like this" similarity queries.
Based on this mail:
Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough. But looking up the docFreq() of every term in the document is
probably too slow.
You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
or at all. Since you're trying to maximize a tf*idf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration. Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.
It all depends on what you're trying to do. If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless. But if you're trying to provide a "more like this" button on a search results page that does a decent job and has good performance, such techniques might be useful. An efficient, effective "more-like-this" query generator would be a great contribution, if anyone's interested. I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above. The frequency and length thresholds could be parameters, etc.
Doug
複製代碼
上述內容的幾個關鍵點(核心思想):bash
Lucene5中內置的MoreLikeThis的實現方式是使用打分的方式計算類似度,根據最終得分高低放入優先級隊列,評分高的天然在隊列最高處。app
官方文檔中這樣解釋:less
There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link).ide
The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results.函數
The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.源碼分析
官方文檔中這樣解釋:
MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the mlt.fl parameter, below). For best results, the fields should have stored term vectors in schema.xml. For example:
<field name="cat" ... termVectors="true" /> 複製代碼
If term vectors are not stored, MoreLikeThis will generate terms from stored fields. A uniqueKey must also be stored in order for MoreLikeThis to work properly.
The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the mlt.qf parameter, below) and a new document set is returned.
MoreLikeThis 會基於索引文檔中的Term構造一個Lucene Query,而這些Term須要從定義的域列表上獲取,該域列表須要經過mlt.fl參數執行,爲了獲取最佳效果,那些域應該存儲Term向量信息,即域的term Vectors屬性須要設置爲true。
若是域的Term向量信息未存儲,那麼MoreLikeThis會自動從存儲域(stored = true 的域)上生成Term。爲了使MoreLikeThis正常工做,你還必須存儲UniqueKey。
好吧!其實我仍是不知道MoreLikeThis具體是怎麼實現的。 可是咱們應該能從上面描述中獲取一些關鍵點:
咱們須要來看看源碼是怎麼實現的了
參數 | 解釋 |
---|---|
private Analyzer analyzer | 分詞器 |
private int minTermFreq | 詞頻最小值(默認2) |
private int minDocFreq | 文檔頻率最小值(默認5) |
private int maxDocFreq | 文檔頻率最大值(默認2147483647) |
private int maxQueryTerms | 查詢詞的數組大小(默認25) |
private TFIDFSimilarity similarity | 計算相關度 |
MoreLikeThis爲了實現與Lucene良好的互動,且擴充了Lucene:它提供了一個方法,該方法返回一個Query對象,即Lucene的查詢對象,只要Lucene經過這個對象檢索,就能得到類似結果;因此MoreLikeThis和Lucene徹底可以無縫結合。Solr中就提供了一個不錯的例子。咱們以 public Query like(int docNum) 方法來解釋類似查詢實現原理:
public Query like(int docNum) throws IOException {
if (this.fieldNames == null) {
Collection<String> fields = MultiFields.getIndexedFields(this.ir);
this.fieldNames = (String[])fields.toArray(new String[fields.size()]);
}
return this.createQuery(this.retrieveTerms(docNum));
}
複製代碼
其中的docNum參數爲那個搜索結果的id,即你要經過這個搜索結果,來查找其餘與之類似搜索結果。 fieldNames能夠理解爲咱們選擇的一些域,咱們將取出該結果在這些域中的值,以此來分析類似度。 程序很明顯,這些域是可選的。
上述代碼實現邏輯以下:
termVector是項向量。項向量其實就是根據Term在文檔中出現的頻率和文檔中包含Term的頻率創建的數學模型,計算兩個項向量的夾角的方式來判斷他們的類似性。
/** * Find words for a more-like-this query former. * * @param docNum the id of the lucene document from which to find terms */
private PriorityQueue<ScoreTerm> retrieveTerms(int docNum) throws IOException {
Map<String, Int> termFreqMap = new HashMap<>();
for (String fieldName : fieldNames) {
final Fields vectors = ir.getTermVectors(docNum);
final Terms vector;
if (vectors != null) {
vector = vectors.terms(fieldName);
} else {
vector = null;
}
// field does not store term vector info
//若是當前字段沒有存儲termVector,那麼須要從新計算。其實這裏就是分詞,並計算term詞頻的過程,注意他默認使用的是StandardAnalyzer分詞器!!!
if (vector == null) {
Document d = ir.document(docNum);
IndexableField fields[] = d.getFields(fieldName);
for (IndexableField field : fields) {
final String stringValue = field.stringValue();
if (stringValue != null) {
addTermFrequencies(new StringReader(stringValue), termFreqMap, fieldName);
}
}
} else {//若是以前保存了termVector就直接添加便可
addTermFrequencies(termFreqMap, vector);
}
}
return createQueue(termFreqMap);
}
複製代碼
因爲TermVector中的term和field沒有關係,不論是標題仍是正文,只要term內容同樣就將其頻率累加。addTermFrequencies就作這個事情!
把累加的結果存放到termFreqMap中。
private void addTermFrequencies(Reader r, Map<String, MoreLikeThis.Int> termFreqMap, String fieldName) throws IOException {
if (this.analyzer == null) {
throw new UnsupportedOperationException("To use MoreLikeThis without term vectors, you must provide an Analyzer");
} else {
TokenStream ts = this.analyzer.tokenStream(fieldName, r);
try {
int tokenCount = 0;
CharTermAttribute termAtt = (CharTermAttribute)ts.addAttribute(CharTermAttribute.class);
ts.reset();
while(true) {
if (ts.incrementToken()) {
String word = termAtt.toString();
++tokenCount;
if (tokenCount <= this.maxNumTokensParsed) {
if (!this.isNoiseWord(word)) {
MoreLikeThis.Int cnt = (MoreLikeThis.Int)termFreqMap.get(word);
if (cnt == null) {
termFreqMap.put(word, new MoreLikeThis.Int());
} else {
++cnt.x;
}
}
continue;
}
}
ts.end();
return;
}
} finally {
IOUtils.closeWhileHandlingException(new Closeable[]{ts});
}
}
}
複製代碼
上面操做中,咱們還須要進行降噪操做 降噪操做有幾個原則:
/** * determines if the passed term is likely to be of interest in "more like" comparisons * * @param term The word being considered * @return true if should be ignored, false if should be used in further analysis */
private boolean isNoiseWord(String term) {
int len = term.length();
if (minWordLen > 0 && len < minWordLen) {
return true;
}
if (maxWordLen > 0 && len > maxWordLen) {
return true;
}
return stopWords != null && stopWords.contains(term);
}
複製代碼
這裏的queue應該是一個優先級隊列,上一步咱們得到了全部<term, frequency>,雖然作了去噪,可是term項目仍是太多了,還須要找出相對重要的前N個Term。
在這裏,咱們對每一個term進行了打分排序,主要仍是經過tf、idf進行計算。
/** * Create a PriorityQueue from a word->tf map. * * @param words a map of words keyed on the word(String) with Int objects as the values. */
private PriorityQueue<ScoreTerm> createQueue(Map<String, Int> words) throws IOException {
// have collected all words in doc and their freqs
//獲取當前index的文檔總數。
int numDocs = ir.numDocs();
final int limit = Math.min(maxQueryTerms, words.size());
FreqQ queue = new FreqQ(limit); //按照term的得分進行存放
for (String word : words.keySet()) { //對每一個詞
int tf = words.get(word).x; // 對應term的詞頻
if (minTermFreq > 0 && tf < minTermFreq) {
continue; // 和去噪相似,tf過小的term直接過掉。
}
// 遍歷全部的field,找到df最大的那個字段
String topField = fieldNames[0];
int docFreq = 0;
for (String fieldName : fieldNames) {
int freq = ir.docFreq(new Term(fieldName, word));
topField = (freq > docFreq) ? fieldName : topField;
docFreq = (freq > docFreq) ? freq : docFreq;
}
if (minDocFreq > 0 && docFreq < minDocFreq) {
continue; // df過小的term也要直接過掉
}
if (docFreq > maxDocFreq) {
continue; // df太大的term也要直接過掉
}
if (docFreq == 0) {
continue; // index update problem?df==0的term也要直接過掉,怎麼會有df的term???這裏說是index文件的問題
}
float idf = similarity.idf(docFreq, numDocs);
float score = tf * idf;
//將結果存放到優先隊列中。
if (queue.size() < limit) {
// there is still space in the queue
queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));
} else {
ScoreTerm term = queue.top();
if (term.score < score) { // update the smallest in the queue in place and update the queue.
term.update(word, topField, score, idf, docFreq, tf);
queue.updateTop();
}
}
}
return queue;
}
複製代碼
到這裏咱們已經將term的打分排序拿到了,分值越大的term更能表述整篇document的主要內容!(這樣想的話其實有點理解文章開始的意思了,咱們找一篇文章的類似文章,天然是相同的詞越多越好了)
/** * Create the More like query from a PriorityQueue */
private Query createQuery(PriorityQueue<ScoreTerm> q) {
BooleanQuery query = new BooleanQuery();
ScoreTerm scoreTerm;
float bestScore = -1;
while ((scoreTerm = q.pop()) != null) {
TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));
//這裏還能夠對termquery進行boost的設置。默認爲false
if (boost) {
if (bestScore == -1) {
bestScore = (scoreTerm.score);
}
float myScore = (scoreTerm.score);
tq.setBoost(boostFactor * myScore / bestScore);
}
////構建boolean query,should關聯。
try {
query.add(tq, BooleanClause.Occur.SHOULD);
}
catch (BooleanQuery.TooManyClauses ignore) {
break;
}
}
return query;
}
複製代碼
這樣就根據一篇document和指定字段獲得了一個query。這個query做爲表明着document的靈魂,將尋找和他相似的documents。