Neo4j中實現自定義中文全文索引

<div class="markdown-text"><p>數據庫檢索效率時,通常首要優化途徑是從索引入手,而後根據需求再考慮更復雜的負載均衡、讀寫分離和分佈式水平/垂直分庫/表等手段;<br>索引經過信息冗餘來提升檢索效率,其以空間換時間並會下降數據寫入的效率;所以對索引字段的選擇很是重要。</p> <ul> <li>Neo4j可對指定Label的Node Create Index,當新增/更新符合條件的Node屬性時,Index會自動更新。Neo4j Index默認採用Lucene實現(可定製,如Spatial Index自定義實現的RTree索引),但默認新建的索引只支持精確匹配(get),模糊查詢(query)的話須要以全文索引,控制Lucene後臺的分詞行爲。</li> <li>Neo4j全文索引默認的分詞器是針對西方語種的,如默認的exact查詢採用的是lucene KeywordAnalyzer(關鍵詞分詞器),fulltext查詢採用的是 white-space tokenizer(空格分詞器),大小寫什麼的對中文沒啥意義;因此針對中文分詞須要掛一箇中文分詞器,如IK Analyzer,Ansj,至於相似梁廠長家的基於深度學習的分詞系統pullword,那就更厲害啦。</li> </ul> <p>本文以經常使用的IK Analyzer分詞器爲例,介紹如何在Neo4j中對字段新建全文索引實現模糊查詢。</p> <hr> <h1>IKAnalyzer分詞器</h1> <p><a href="https://github.com/wks/ik-analyzer" target="_blank">IKAnalyzer</a>是一個開源的,基於java語言開發的輕量級的中文分詞工具包。<br>IKAnalyzer3.0特性:</p> <ul> <li>採用了特有的「正向迭代最細粒度切分算法「,支持細粒度和最大詞長兩種切分模式;具備83萬字/秒(1600KB/S)的高速處理能力。</li> <li>採用了多子處理器分析模式,支持:英文字母、數字、中文詞彙等分詞處理,兼容韓文、日文字符優化的詞典存儲,更小的內存佔用。支持用戶詞典擴展定義</li> <li>針對Lucene全文檢索優化的查詢分析器IKQueryParser(做者吐血推薦);引入簡單搜索表達式,採用歧義分析算法優化查詢關鍵字的搜索排列組合,能極大的提升Lucene檢索的命中率。<br>IK Analyser目前尚未maven庫,還得本身手動下載install到本地庫,下次空了本身在github作一個maven私有庫,上傳這些maven central庫裏面沒有的工具包。</li> </ul> <h1>IKAnalyzer自定義用戶詞典</h1> <ul> <li>詞典文件<br>自定義詞典後綴名爲.dic的詞典文件,必須使用無BOM的UTF-8編碼保存的文件。</li> <li>詞典配置<br>詞典和IKAnalyzer.cfg.xml配置文件的路徑問題,IKAnalyzer.cfg.xml必須在src根目錄下。詞典能夠任意放,可是在IKAnalyzer.cfg.xml裏要配置對。以下這種配置,ext.dic和stopword.dic應當在同一目錄下。 <table> <tbody> <tr> <td> <div>1</div> <div>2</div> <div>3</div> <div>4</div> <div>5</div> <div>6</div> <div>7</div> <div>8</div> <div>9</div> <div>10</div> <div>11</div> </td> <td> <div><span><span>&lt;?xml version=<span>"1.0" encoding=<span>"UTF-8"<span>?&gt;</span></span></span></span></span></div> <div><span>&lt;!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"&gt;</span></div> <div><span>&lt;<span>properties&gt; </span></span></div> <div><span>&lt;<span>comment&gt;IK Analyzer 擴展配置<span>&lt;/<span>comment&gt;</span></span></span></span></div> <div>&nbsp;</div> <div><span>&lt;!--用戶能夠在這裏配置本身的擴展字典 --&gt;</span></div> <div><span>&lt;<span>entry <span>key=<span>"ext_dict"&gt;/ext.dic;<span>&lt;/<span>entry&gt;</span></span></span></span></span></span></div> <div>&nbsp;</div> <div><span>&lt;!--用戶能夠在這裏配置本身的擴展中止詞字典--&gt;</span></div> <div><span>&lt;<span>entry <span>key=<span>"ext_stopwords"&gt;/stopword.dic<span>&lt;/<span>entry&gt;</span></span></span></span></span></span></div> <div><span>&lt;/<span>properties&gt;</span></span></div> </td> </tr> </tbody> </table> </li> </ul> <h1>Neo4j全文索引構建</h1> <p>指定IKAnalyzer做爲luncene分詞的analyzer,並對全部Node的指定屬性新建全文索引</p> <table> <tbody> <tr> <td> <div>1</div> <div>2</div> <div>3</div> <div>4</div> <div>5</div> <div>6</div> <div>7</div> <div>8</div> <div>9</div> <div>10</div> <div>11</div> <div>12</div> <div>13</div> <div>14</div> <div>15</div> <div>16</div> <div>17</div> </td> <td> <div><span>[@Override](/user/Override)</span></div> <div><span><span>public <span>void <span>createAddressNodeFullTextIndex <span>() {</span></span></span></span></span></div> <div> <span>try (Transaction tx = graphDBService.beginTx()) {</span></div> <div> IndexManager index = graphDBService.index();</div> <div> Index&lt;Node&gt; addressNodeFullTextIndex =</div> <div> index.forNodes( <span>"addressNodeFullTextIndex", MapUtil.stringMap(IndexManager.PROVIDER, <span>"lucene", <span>"analyzer", IKAnalyzer.class.getName()));</span></span></span></div> <div>&nbsp;</div> <div> ResourceIterator&lt;Node&gt; nodes = graphDBService.findNodes(DynamicLabel.label( <span>"AddressNode"));</span></div> <div> <span>while (nodes.hasNext()) {</span></div> <div> Node node = nodes.next();</div> <div> <span>//對text字段新建全文索引</span></div> <div> Object text = node.getProperty( <span>"text", <span>null);</span></span></div> <div> addressNodeFullTextIndex.add(node, <span>"text", text);</span></div> <div> }</div> <div> tx.success();</div> <div> }</div> <div>}</div> </td> </tr> </tbody> </table> <p>&nbsp;</p> <h1>Neo4j全文索引測試</h1> <p>對關鍵詞(如’有限公司’),多關鍵詞模糊查詢(如’蘇州 教育 公司’)默認都能檢索,且檢索結果按關聯度已排好序。</p> <table> <tbody> <tr> <td> <div>1</div> <div>2</div> <div>3</div> <div>4</div> <div>5</div> <div>6</div> <div>7</div> <div>8</div> <div>9</div> <div>10</div> <div>11</div> <div>12</div> <div>13</div> <div>14</div> <div>15</div> <div>16</div> <div>17</div> <div>18</div> <div>19</div> <div>20</div> <div>21</div> <div>22</div> <div>23</div> <div>24</div> <div>25</div> <div>26</div> <div>27</div> <div>28</div> <div>29</div> <div>30</div> <div>31</div> <div>32</div> <div>33</div> <div>34</div> <div>35</div> <div>36</div> <div>37</div> <div>38</div> <div>39</div> <div>40</div> <div>41</div> <div>42</div> <div>43</div> <div>44</div> <div>45</div> </td> <td> <div><span>package uadb.tr.neodao.test;</span></div> <div>&nbsp;</div> <div><span>import org.junit.Test;</span></div> <div><span>import org.junit.runner.RunWith;</span></div> <div><span>import org.neo4j.graphdb.GraphDatabaseService;</span></div> <div><span>import org.neo4j.graphdb.Node;</span></div> <div><span>import org.neo4j.graphdb.Transaction;</span></div> <div><span>import org.neo4j.graphdb.index.Index;</span></div> <div><span>import org.neo4j.graphdb.index.IndexHits;</span></div> <div><span>import org.neo4j.graphdb.index.IndexManager;</span></div> <div><span>import org.neo4j.helpers.collection.MapUtil;</span></div> <div><span>import org.springframework.beans.factory.annotation.Autowired;</span></div> <div><span>import org.springframework.test.context.ContextConfiguration;</span></div> <div><span>import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;</span></div> <div><span>import org.wltea.analyzer.lucene.IKAnalyzer;</span></div> <div>&nbsp;</div> <div><span>import com.lt.uadb.tr.entity.adtree.AddressNode;</span></div> <div><span>import com.lt.util.serialize.JsonUtil;</span></div> <div>&nbsp;</div> <div><span>/**</span></div> <div> * AddressNodeNeoDaoTest</div> <div> *</div> <div> * <span>[@author](/user/author) geosmart</span></div> <div> */</div> <div><span>@RunWith(SpringJUnit4ClassRunner. <span><span>class)</span></span></span></div> <div>@<span>ContextConfiguration(<span>locations = { <span>"classpath:app.neo4j.cfg.xml" })</span></span></span></div> <div><span>public <span><span>class <span>AddressNodeNeoDaoTest {</span></span></span></span></div> <div> <span>[@Autowired](/user/Autowired)</span></div> <div> GraphDatabaseService graphDBService;</div> <div>&nbsp;</div> <div> <span>[@Test](/user/Test)</span></div> <div> <span><span>public <span>void <span>test_selectAddressNodeByFullTextIndex<span>() {</span></span></span></span></span></div> <div> <span>try (Transaction tx = graphDBService.beginTx()) {</span></div> <div> IndexManager index = graphDBService.index();</div> <div> Index&lt;Node&gt; addressNodeFullTextIndex = index.forNodes(<span>"addressNodeFullTextIndex" ,</span></div> <div> MapUtil. stringMap(IndexManager.PROVIDER, <span>"lucene", <span>"analyzer" , IKAnalyzer.class.getName()));</span></span></div> <div> IndexHits&lt;Node&gt; foundNodes = addressNodeFullTextIndex.query(<span>"text" , <span>"蘇州 教育 公司" );</span></span></div> <div> <span>for (Node node : foundNodes) {</span></div> <div> AddressNode entity = JsonUtil.ConvertMap2POJO(node.getAllProperties(), AddressNode. <span><span>class, <span>false, <span>true);</span></span></span></span></div> <div> System. out.println(entity.getAll地址實全稱());</div> <div> }</div> <div> tx.success();</div> <div> }</div> <div> }</div> <div>}</div> </td> </tr> </tbody> </table> <p>&nbsp;</p> <h1>CyperQL中使用自定義全文索引查詢</h1> <h2>正則查詢</h2> <table> <tbody> <tr> <td> <div>1</div> <div>2</div> <div>3</div> <div>4</div> </td> <td> <div>profile </div> <div>match (a:AddressNode{ruleabbr:'TOW',text:'惟亭鎮'})&lt;-[r:BELONGTO]-(b:AddressNode{ruleabbr:'STR'})</div> <div>where b.text=~ '金陵.*'</div> <div>return a,b</div> </td> </tr> </tbody> </table> <h2>全文索引查詢</h2> <table> <tbody> <tr> <td> <div>1</div> <div>2</div> <div>3</div> <div>4</div> <div>5</div> </td> <td> <div>profile</div> <div><span>START b=node:addressNodeFullTextIndex(<span>"text:金陵*")</span></span></div> <div><span>match (a:AddressNode{ruleabbr:<span>'TOW',<span>text:<span>'惟亭鎮'})&lt;-[r:BELONGTO]-(b:AddressNode)</span></span></span></span></div> <div><span>where b.ruleabbr=<span>'STR'</span></span></div> <div><span>return a,b</span></div> </td> </tr> </tbody> </table> <h1>LegacyIndex中創建聯合exact和fulltext索引</h1> <p>對label爲AddressNode的節點,根據節點屬性ruleabbr的分類addressnode_fulltext_index(省-&gt;市-&gt;區縣-&gt;鄉鎮街道-&gt;街路巷/物業小區)/addressnode_exact_index(門牌號-&gt;樓幢號-&gt;單元號-&gt;層號-&gt;戶室號),對屬性text分別建不一樣類型的索引</p> <p> </p> <table> <tbody> <tr> <td> <div>1</div> <div>2</div> <div>3</div> <div>4</div> </td> <td> <div>profile</div> <div><span>START a=node:addressnode_fulltext_index(<span>"text:商業街"),b=node:addressnode_exact_index(<span>"text:二期19")</span></span></span></div> <div><span>match (a:AddressNode{ruleabbr:<span>'STR'})-[r:BELONGTO]-(b:AddressNode{ruleabbr:<span>'TAB'})</span></span></span></div> <div><span>return a,b <span>limit <span>10</span></span></span></div> </td> </tr> </tbody> </table>原文地址:http://neo4j.com.cn/topic/58184ea2cdf6c5bf145675c3</div>java

相關文章
相關標籤/搜索