lucene搜索之拼寫檢查和類似度查詢提示（spellcheck）

時間 2019-11-11

標籤 lucene 搜索拼寫檢查類似查詢提示 spellcheck 简体版

原文原文鏈接

suggest應用場景

用戶的輸入行爲是不肯定的，而咱們在寫程序的時候老是想讓用戶按照指定的內容或指定格式的內容進行搜索，這裏就要進行人工干預用戶輸入的搜索條件了；咱們在用百度谷歌等搜索引擎的時候常常會看到按鍵放下的時候直接會提示用戶是否想搜索某些相關的內容，剛好lucene在開發的時候想到了這一點，lucene提供的suggest包正是用來解決上述問題的。java

suggest包聯想詞相關介紹

suggest包提供了lucene的自動補全或者拼寫檢查的支持；apache

拼寫檢查相關的類在org.apache.lucene.search.spell包下；編程

聯想相關的在org.apache.lucene.search.suggest包下；app

基於聯想詞分詞相關的類在org.apache.lucene.search.suggest.analyzing包下；oop

拼寫檢查原理

Lucene的拼寫檢查由org.apache.lucene.search.spell.SpellChecker類提供支持；
SpellChecker設置了默認精度0.5，若是咱們須要細粒度的支持能夠經過調用setAccuracy(float accuracy)來設定；
spellChecker會將外部來源的詞進行索引；

這些來源包括：測試

DocumentDictionary查詢document中的field對應的值；fetch

FileDictionary基於一個文本文件的Directionary,每行一項，詞組之間以"\t" TAB分隔符進行，每項中不能含有兩個以上的分隔符；ui

HighFrequencyDictionary從原有的索引文件中讀取某個term的值，並按照出現次數檢查；this

LuceneDictionary也是從原有索引文件中讀取某個term的值，可是不檢查出現次數；搜索引擎

PlainTextDictionary從文本中讀取內容，按行讀取，沒有分隔符；

其索引的原理以下：

對索引過程加syschronized同步；
檢查Spellchecker是否已經關閉，若是關閉，拋出異常，提示內容爲：Spellchecker has been closed；
對外部來源的索引進行遍歷，統計被遍歷的詞的長度，若是長度小於三，忽略該詞，反之構建document對象並索引到本地文件，建立索引的時候會對每一個單詞進行詳細拆分（對應addGram方法），其執行過程以下所示

[java] view plain copy

/**
* Indexes the data from the given {@link Dictionary}.
* @param dict Dictionary to index
* @param config {@link IndexWriterConfig} to use
* @param fullMerge whether or not the spellcheck index should be fully merged
* @throws AlreadyClosedException if the Spellchecker is already closed
* @throws IOException If there is a low-level I/O error.
*/
public final void indexDictionary(Dictionary dict, IndexWriterConfig config, boolean fullMerge) throws IOException {
synchronized (modifyCurrentIndexLock) {
ensureOpen();
final Directory dir = this.spellIndex;
final IndexWriter writer = new IndexWriter(dir, config);
IndexSearcher indexSearcher = obtainSearcher();
final List<TermsEnum> termsEnums = new ArrayList<>();
final IndexReader reader = searcher.getIndexReader();
if (reader.maxDoc() > 0) {
for (final LeafReaderContext ctx : reader.leaves()) {
Terms terms = ctx.reader().terms(F_WORD);
if (terms != null)
termsEnums.add(terms.iterator(null));
}
}
boolean isEmpty = termsEnums.isEmpty();
try {
BytesRefIterator iter = dict.getEntryIterator();
BytesRef currentTerm;
terms: while ((currentTerm = iter.next()) != null) {
String word = currentTerm.utf8ToString();
int len = word.length();
if (len < 3) {
continue; // too short we bail but "too long" is fine...
}
if (!isEmpty) {
for (TermsEnum te : termsEnums) {
if (te.seekExact(currentTerm)) {
continue terms;
}
}
}
// ok index the word
Document doc = createDocument(word, getMin(len), getMax(len));
writer.addDocument(doc);
}
} finally {
releaseSearcher(indexSearcher);
}
if (fullMerge) {
writer.forceMerge(1);
}
// close writer
writer.close();
// TODO: this isn't that great, maybe in the future SpellChecker should take
// IWC in its ctor / keep its writer open?
// also re-open the spell index to see our own changes when the next suggestion
// is fetched:
swapSearcher(dir);
}
}

對詞語進行遍歷拆分的方法爲addGram,其實現爲：

查看代碼可知，聯想詞的索引不只關注每一個詞的起始位置，也關注其倒數的位置；

聯想詞查詢的時候，先判斷grams裏邊是否包含有待查詢的詞拆分後的內容，若是有放到結果SuggestWordQueue中，最終結果爲遍歷SuggestWordQueue得來的String[],其代碼實現以下：[java] view plain copy
1. public String[] suggestSimilar(String word, int numSug, IndexReader ir,
2. String field, SuggestMode suggestMode, float accuracy) throws IOException {
3. // obtainSearcher calls ensureOpen
4. final IndexSearcher indexSearcher = obtainSearcher();
5. try {
6. if (ir == null || field == null) {
7. suggestMode = SuggestMode.SUGGEST_ALWAYS;
8. }
9. if (suggestMode == SuggestMode.SUGGEST_ALWAYS) {
10. ir = null;
11. field = null;
12. }
14. final int lengthWord = word.length();
16. final int freq = (ir != null && field != null) ? ir.docFreq(new Term(field, word)) : 0;
17. final int goalFreq = suggestMode==SuggestMode.SUGGEST_MORE_POPULAR ? freq : 0;
18. // if the word exists in the real index and we don't care for word frequency, return the word itself
19. if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && freq > 0) {
20. return new String[] { word };
21. }
23. BooleanQuery query = new BooleanQuery();
24. String[] grams;
25. String key;
27. for (int ng = getMin(lengthWord); ng <= getMax(lengthWord); ng++) {
29. key = "gram" + ng; // form key
31. grams = formGrams(word, ng); // form word into ngrams (allow dups too)
33. if (grams.length == 0) {
34. continue; // hmm
35. }
37. if (bStart > 0) { // should we boost prefixes?
38. add(query, "start" + ng, grams[0], bStart); // matches start of word
40. }
41. if (bEnd > 0) { // should we boost suffixes
42. add(query, "end" + ng, grams[grams.length - 1], bEnd); // matches end of word
44. }
45. for (int i = 0; i < grams.length; i++) {
46. add(query, key, grams[i]);
47. }
48. }
50. int maxHits = 10 * numSug;
52. // System.out.println("Q: " + query);
53. ScoreDoc[] hits = indexSearcher.search(query, maxHits).scoreDocs;
54. // System.out.println("HITS: " + hits.length());
55. SuggestWordQueue sugQueue = new SuggestWordQueue(numSug, comparator);
57. // go thru more than 'maxr' matches in case the distance filter triggers
58. int stop = Math.min(hits.length, maxHits);
59. SuggestWord sugWord = new SuggestWord();
60. for (int i = 0; i < stop; i++) {
62. sugWord.string = indexSearcher.doc(hits[i].doc).get(F_WORD); // get orig word
64. // don't suggest a word for itself, that would be silly
65. if (sugWord.string.equals(word)) {
66. continue;
67. }
69. // edit distance
70. sugWord.score = sd.getDistance(word,sugWord.string);
71. if (sugWord.score < accuracy) {
72. continue;
73. }
75. if (ir != null && field != null) { // use the user index
76. sugWord.freq = ir.docFreq(new Term(field, sugWord.string)); // freq in the index
77. // don't suggest a word that is not present in the field
78. if ((suggestMode==SuggestMode.SUGGEST_MORE_POPULAR && goalFreq > sugWord.freq) || sugWord.freq < 1) {
79. continue;
80. }
81. }
82. sugQueue.insertWithOverflow(sugWord);
83. if (sugQueue.size() == numSug) {
84. // if queue full, maintain the minScore score
85. accuracy = sugQueue.top().score;
86. }
87. sugWord = new SuggestWord();
88. }
90. // convert to array string
91. String[] list = new String[sugQueue.size()];
92. for (int i = sugQueue.size() - 1; i >= 0; i--) {
93. list[i] = sugQueue.pop().string;
94. }
96. return list;
97. } finally {
98. releaseSearcher(indexSearcher);
99. }
100. }

編程實踐

如下是我根據FileDirectory相關描述編寫的一個測試程序

[java] view plain copy

package com.lucene.search;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.search.suggest.FileDictionary;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;
public class SuggestUtil {
public static void main(String[] args) {
Directory spellIndexDirectory;
try {
spellIndexDirectory = FSDirectory.open(Paths.get("suggest", new String[0]));
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory );
Analyzer analyzer = new IKAnalyzer(true);
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
spellchecker.setAccuracy(0f);
//HighFrequencyDictionary dire = new HighFrequencyDictionary(reader, field, thresh)
spellchecker.indexDictionary(new FileDictionary(new FileInputStream(new File("D:\\hadoop\\lucene_suggest\\src\\suggest.txt"))),config,false);
String[] similars = spellchecker.suggestSimilar("中國", 10);
for (String similar : similars) {
System.out.println(similar);
}
spellIndexDirectory.close();
spellchecker.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}