solr拼寫檢查配置

拼寫檢查功能,能在搜索時,提供一個較好用戶體驗,因此,主流的搜索引擎都有這個功能。apache

那麼什麼是拼寫檢查,其實很好理解,就是你輸入的搜索詞,多是你輸錯了,也有可能在它的檢索庫裏面根本不存在這個詞,可是這時候它能給你返回,類似或相近的結果來幫助你校訂。
舉個例子,假如你在百度裏面輸入在在線電瓶,可能它的索引庫裏面就沒有,可是它有可能返回在線電影,在線電視,在線觀看等等一些詞,這些,就用到拼寫檢查的功能了。app

 

solr是一個基於lucene開發接口實現的成熟的搜索系統,經過不一樣的控件(Component)實現不一樣的搜索功能,其中一個SpellCheckComponent實現了拼寫檢查功能。ide

要在搜索過程當中添加拼寫檢查功能,必須在solr的solrconfig.xml中配置spellcheck控件,並在相關SearchHandler(select/query等)中添加拼寫檢查參數配置。測試

配置以下:ui

 1    <!-- Spell Check
 2 
 3         The spell check component can return a list of alternative spelling
 4         suggestions.
 5 
 6         http://wiki.apache.org/solr/SpellCheckComponent
 7      -->
 8   <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
 9 
10     <!-- 將輸入關鍵詞當作text_general類型進行處理 -->
11     <str name="queryAnalyzerFieldType">text_general</str>
12 
13     <!-- Multiple "Spell Checkers" can be declared and used by this
14          component
15       -->
16 
17     <!-- a spellchecker built from a field of the main index -->
18     <lst name="spellchecker">
19       <!-- 拼寫檢查模塊名 -->
20       <str name="name">default</str>
21       <!-- 對索引中的哪一個字段進行拼寫檢查 -->
22       <str name="field">text</str>
23       <!-- 自定義拼寫檢查,能夠用自定義拼寫檢查類代替默認類 -->
24       <str name="classname">solr.DirectSolrSpellChecker</str>
25       <!-- 拼寫檢查編輯距離, 默認使用internal levenshtein -->
26       <str name="distanceMeasure">internal</str>
27       <!-- minimum accuracy needed to be considered a valid spellcheck suggestion -->
28       <float name="accuracy">0.5</float>
29       <!-- 最大編輯距離,與輸入字符串編輯距離小於等於2的字符串被檢索出來做爲糾錯結果 -->
30       <int name="maxEdits">2</int>
31       <!-- 與輸入字符串最少有一個字符相等,才能被檢索出來 -->
32       <int name="minPrefix">1</int>
33       <!-- maximum number of inspections per result. 一次糾錯探測最大數量 -->
34       <int name="maxInspections">5</int>
35       <!-- 糾錯元詞最小長度,元詞長度小於4不進行糾錯 -->
36       <int name="minQueryLength">4</int>
37       <!-- maximum threshold of documents a query term can appear to be considered for correction -->
38       <float name="maxQueryFrequency">0.01</float>
39       <!-- uncomment this to require suggestions to occur in 1% of the documents
40           <float name="thresholdTokenFrequency">.01</float>
41       -->
42     </lst>
43 
44     <!-- a spellchecker that can break or combine words. 不一樣實現的拼寫檢查模塊 -->
45     <lst name="spellchecker">
46       <str name="name">wordbreak</str>
47       <str name="classname">solr.WordBreakSolrSpellChecker</str>
48       <str name="field">text</str>
49       <str name="combineWords">true</str>
50       <str name="breakWords">true</str>
51       <int name="maxChanges">10</int>
52     </lst>
53 
54     <!-- 使用不一樣編輯距離的拼寫檢查模塊 -->
55     <!--
56        <lst name="spellchecker">
57          <str name="name">jarowinkler</str>
58          <str name="field">spell</str>
59          <str name="classname">solr.DirectSolrSpellChecker</str>
60          <str name="distanceMeasure">
61            org.apache.lucene.search.spell.JaroWinklerDistance
62          </str>
63        </lst>
64      -->
65 
66     <!-- a spellchecker that use an alternate comparator
67 
68          comparatorClass be one of:
69           1. score (default)
70           2. freq (Frequency first, then score)
71           3. A fully qualified class name
72       -->
73     <!--
74        <lst name="spellchecker">
75          <str name="name">freq</str>
76          <str name="field">lowerfilt</str>
77          <str name="classname">solr.DirectSolrSpellChecker</str>
78          <str name="comparatorClass">freq</str>
79       -->
80 
81     <!-- A spellchecker that reads the list of words from a file -->
82     <!--
83        <lst name="spellchecker">
84          <str name="classname">solr.FileBasedSpellChecker</str>
85          <str name="name">file</str>
86          <str name="sourceLocation">spellings.txt</str>
87          <str name="characterEncoding">UTF-8</str>
88          <str name="spellcheckIndexDir">spellcheckerFile</str>
89        </lst>
90       -->
91   </searchComponent>

配置好SpellCheckComponent組件後,須要配置相應的SearchHandler,在正式搜索系統應用中須要配置select和query,本身測試時能夠自定義一個,如:spell,配置以下:this

  <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <str name="df">text</str>
      <!-- 下邊配置了兩個拼寫檢查子模塊,是前邊定義好的default和wordbreak,solr會分別用兩個模塊對輸入進行拼寫檢查,最終將結果整合到一塊 -->
      <str name="spellcheck.dictionary">default</str>
      <str name="spellcheck.dictionary">wordbreak</str>
      <str name="spellcheck">on</str>
      <!-- 爲糾錯後的提示詞添加額外信息,如在索引中的頻率 -->
      <str name="spellcheck.extendedResults">true</str>
      <!-- 一次糾錯返回結果數量 -->
      <str name="spellcheck.count">10</str>
      <!-- The maximum number of suggestions to return for terms that exist in the index -->
      <str name="spellcheck.alternativeTermCount">5</str>
      <!-- The maximum number of results the query can return while still triggering spelling suggestions -->
      <str name="spellcheck.maxResultsForSuggest">5</str>
      <!-- 是否添加校驗結果 -->
      <str name="spellcheck.collate">true</str>
      <!-- 是否添加校驗拓展結果 -->
      <str name="spellcheck.collateExtendedResults">true</str>
      <!-- The maximum # of collation possibilities to try before giving up. -->
      <str name="spellcheck.maxCollationTries">10</str>
      <!-- 返回校驗結果的最大數目 -->
      <str name="spellcheck.maxCollations">5</str>
    </lst>

    <!-- 必須將拼寫檢查控件添加到搜索控件序列中,若無此項則不進行拼寫檢查 -->
    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>

在solr中測試以下:搜索引擎

相關文章
相關標籤/搜索