lucene_02_IKAnalyre

時間 2019-12-09

標籤 lucene ikanalyre 简体版

原文原文鏈接

前言java

在lucene中雖然已經提供了許多的分詞器：StandardAnalyzer、CJKAnalyzer等，但在解析中文的時候都會把文中拆成一個個的單子。git

畢竟老外不懂中文。這裏介紹一箇中文的分詞器：IKAnalyre。雖然在其在分詞的時候還不夠完美github

例如：將「高富帥，是2012年以後纔有的詞彙」apache

拆分爲下圖：maven

可是它能夠經過配置文件來，增長新詞和過濾不準出現的詞好比：「的、啊、呀」等等沒有具體意思的修飾副詞和語氣詞等等。ide

配置IK解析器測試

第一步：在pom.xml 引入IK，注意：這個分詞器因爲從2012年以後就沒有更新過，因此只能在低版本的lucene的版本中使用，該例使用的是：4.10.3ui

<!--ik 中文分詞器-->
    <!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer -->
    <dependency>
      <groupId>com.janeluo</groupId>
      <artifactId>ikanalyzer</artifactId>
      <version>2012_u6</version>
    </dependency>

完整pom.xmlthis

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.chen</groupId>
  <artifactId>lucene</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>lucene</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-core</artifactId>
      <version>4.10.3</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-queryparser</artifactId>
      <version>4.10.3</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-analyzers-common</artifactId>
      <version>4.10.3</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>2.6</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>RELEASE</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/io.github.zacker330.es/ik-analysis-core -->
    <!--ik 中文分詞器-->
    <!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer -->
    <dependency>
      <groupId>com.janeluo</groupId>
      <artifactId>ikanalyzer</artifactId>
      <version>2012_u6</version>
    </dependency>



  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.0</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

第二步：在資源目錄下引入配置文件和擴展詞彙文件、過濾詞文件url

IKAnalyzer.cfg.xml，是該分詞器的核心配置文件，管理着ext.dic(擴展詞彙文件)、stopword.dic(禁詞文件)

內容以下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer 擴展配置</comment>
    <!--用戶能夠在這裏配置本身的擴展字典 -->
    <entry key="ext_dict">ext.dic;</entry> 
    
    <!--用戶能夠在這裏配置本身的擴展中止詞字典-->
    <entry key="ext_stopwords">stopword.dic;</entry> 
    
</properties>

ext.dic 內容示例：

高富帥
白富美
java工程師

stopword.dic內容示例：

我
是
用
的
你
它
他
她
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

測試代碼

 // 查看標準分析器的分詞效果
    @Test
    public void testTokenStream() throws Exception {
        // 建立一個標準分析器對象
//        Analyzer analyzer = new StandardAnalyzer();
//        Analyzer analyzer = new CJKAnalyzer();
//        Analyzer analyzer = new SmartChineseAnalyzer();
        Analyzer analyzer = new IKAnalyzer();
        // 得到tokenStream對象
        // 第一個參數：域名，能夠隨便給一個
        // 第二個參數：要分析的文本內容
//        TokenStream tokenStream = analyzer.tokenStream("test",
//                "The Spring Framework provides a comprehensive programming and configuration model.");
        TokenStream tokenStream = analyzer.tokenStream("test",
                "高富帥，是2012年以後纔有的詞彙");
        // 添加一個引用，能夠得到每一個關鍵詞
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        // 添加一個偏移量的引用，記錄了關鍵詞的開始位置以及結束位置
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        // 將指針調整到列表的頭部
        tokenStream.reset();
        // 遍歷關鍵詞列表，經過incrementToken方法判斷列表是否結束
        while (tokenStream.incrementToken()) {
            // 關鍵詞的起始位置

            System.out.println("start->" + offsetAttribute.startOffset());
            // 取關鍵詞
            System.out.println(charTermAttribute);
            // 結束位置
            System.out.println("end->" + offsetAttribute.endOffset());
        }
        tokenStream.close();
    }

結果以下圖：

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。