拼寫檢查功能,能在搜索時,提供一個較好用戶體驗,因此,主流的搜索引擎都有這個功能。apache
那麼什麼是拼寫檢查,其實很好理解,就是你輸入的搜索詞,多是你輸錯了,也有可能在它的檢索庫裏面根本不存在這個詞,可是這時候它能給你返回,類似或相近的結果來幫助你校訂。
舉個例子,假如你在百度裏面輸入在在線電瓶,可能它的索引庫裏面就沒有,可是它有可能返回在線電影,在線電視,在線觀看等等一些詞,這些,就用到拼寫檢查的功能了。app
solr是一個基於lucene開發接口實現的成熟的搜索系統,經過不一樣的控件(Component)實現不一樣的搜索功能,其中一個SpellCheckComponent實現了拼寫檢查功能。ide
要在搜索過程當中添加拼寫檢查功能,必須在solr的solrconfig.xml中配置spellcheck控件,並在相關SearchHandler(select/query等)中添加拼寫檢查參數配置。測試
配置以下:ui
1 <!-- Spell Check 2 3 The spell check component can return a list of alternative spelling 4 suggestions. 5 6 http://wiki.apache.org/solr/SpellCheckComponent 7 --> 8 <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> 9 10 <!-- 將輸入關鍵詞當作text_general類型進行處理 --> 11 <str name="queryAnalyzerFieldType">text_general</str> 12 13 <!-- Multiple "Spell Checkers" can be declared and used by this 14 component 15 --> 16 17 <!-- a spellchecker built from a field of the main index --> 18 <lst name="spellchecker"> 19 <!-- 拼寫檢查模塊名 --> 20 <str name="name">default</str> 21 <!-- 對索引中的哪一個字段進行拼寫檢查 --> 22 <str name="field">text</str> 23 <!-- 自定義拼寫檢查,能夠用自定義拼寫檢查類代替默認類 --> 24 <str name="classname">solr.DirectSolrSpellChecker</str> 25 <!-- 拼寫檢查編輯距離, 默認使用internal levenshtein --> 26 <str name="distanceMeasure">internal</str> 27 <!-- minimum accuracy needed to be considered a valid spellcheck suggestion --> 28 <float name="accuracy">0.5</float> 29 <!-- 最大編輯距離,與輸入字符串編輯距離小於等於2的字符串被檢索出來做爲糾錯結果 --> 30 <int name="maxEdits">2</int> 31 <!-- 與輸入字符串最少有一個字符相等,才能被檢索出來 --> 32 <int name="minPrefix">1</int> 33 <!-- maximum number of inspections per result. 一次糾錯探測最大數量 --> 34 <int name="maxInspections">5</int> 35 <!-- 糾錯元詞最小長度,元詞長度小於4不進行糾錯 --> 36 <int name="minQueryLength">4</int> 37 <!-- maximum threshold of documents a query term can appear to be considered for correction --> 38 <float name="maxQueryFrequency">0.01</float> 39 <!-- uncomment this to require suggestions to occur in 1% of the documents 40 <float name="thresholdTokenFrequency">.01</float> 41 --> 42 </lst> 43 44 <!-- a spellchecker that can break or combine words. 不一樣實現的拼寫檢查模塊 --> 45 <lst name="spellchecker"> 46 <str name="name">wordbreak</str> 47 <str name="classname">solr.WordBreakSolrSpellChecker</str> 48 <str name="field">text</str> 49 <str name="combineWords">true</str> 50 <str name="breakWords">true</str> 51 <int name="maxChanges">10</int> 52 </lst> 53 54 <!-- 使用不一樣編輯距離的拼寫檢查模塊 --> 55 <!-- 56 <lst name="spellchecker"> 57 <str name="name">jarowinkler</str> 58 <str name="field">spell</str> 59 <str name="classname">solr.DirectSolrSpellChecker</str> 60 <str name="distanceMeasure"> 61 org.apache.lucene.search.spell.JaroWinklerDistance 62 </str> 63 </lst> 64 --> 65 66 <!-- a spellchecker that use an alternate comparator 67 68 comparatorClass be one of: 69 1. score (default) 70 2. freq (Frequency first, then score) 71 3. A fully qualified class name 72 --> 73 <!-- 74 <lst name="spellchecker"> 75 <str name="name">freq</str> 76 <str name="field">lowerfilt</str> 77 <str name="classname">solr.DirectSolrSpellChecker</str> 78 <str name="comparatorClass">freq</str> 79 --> 80 81 <!-- A spellchecker that reads the list of words from a file --> 82 <!-- 83 <lst name="spellchecker"> 84 <str name="classname">solr.FileBasedSpellChecker</str> 85 <str name="name">file</str> 86 <str name="sourceLocation">spellings.txt</str> 87 <str name="characterEncoding">UTF-8</str> 88 <str name="spellcheckIndexDir">spellcheckerFile</str> 89 </lst> 90 --> 91 </searchComponent>
配置好SpellCheckComponent組件後,須要配置相應的SearchHandler,在正式搜索系統應用中須要配置select和query,本身測試時能夠自定義一個,如:spell,配置以下:this
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> <lst name="defaults"> <str name="df">text</str> <!-- 下邊配置了兩個拼寫檢查子模塊,是前邊定義好的default和wordbreak,solr會分別用兩個模塊對輸入進行拼寫檢查,最終將結果整合到一塊 --> <str name="spellcheck.dictionary">default</str> <str name="spellcheck.dictionary">wordbreak</str> <str name="spellcheck">on</str> <!-- 爲糾錯後的提示詞添加額外信息,如在索引中的頻率 --> <str name="spellcheck.extendedResults">true</str> <!-- 一次糾錯返回結果數量 --> <str name="spellcheck.count">10</str> <!-- The maximum number of suggestions to return for terms that exist in the index --> <str name="spellcheck.alternativeTermCount">5</str> <!-- The maximum number of results the query can return while still triggering spelling suggestions --> <str name="spellcheck.maxResultsForSuggest">5</str> <!-- 是否添加校驗結果 --> <str name="spellcheck.collate">true</str> <!-- 是否添加校驗拓展結果 --> <str name="spellcheck.collateExtendedResults">true</str> <!-- The maximum # of collation possibilities to try before giving up. --> <str name="spellcheck.maxCollationTries">10</str> <!-- 返回校驗結果的最大數目 --> <str name="spellcheck.maxCollations">5</str> </lst> <!-- 必須將拼寫檢查控件添加到搜索控件序列中,若無此項則不進行拼寫檢查 --> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
在solr中測試以下:搜索引擎