Lucene使用IKAnalyzer中文分詞筆記

時間 2019-11-06

標籤 lucene 使用 ikanalyzer 中文分詞筆記简体版

原文原文鏈接

本文主要講解IKAnalyzer（如下簡稱‘IK’）在Lucene中的具體使用，關於Lucene和IK分詞器的背景及其做用在這裏就再也不熬述。不得不感嘆下Lucene版本變動的快速，現在最新已經到了4.9.0，相信任何技術的發展壯大都不可避免有這一過程。本文使用的是Lucene4.0，IKAnalyzer使用的是2012FF版。java

Lucene下載請移步官網，IK下載地址以下：算法

http://code.google.com/p/ik-analyzer/downloads/listapache

IK下載完成夠拷貝至項目中，目錄結構以下圖所示：函數

能夠看到src目錄下有三個配置文件，分別爲擴展字典文件ext.dic，中止詞字典文件stopwprd.dic和配置文件IKAnalyzer.cfg.xml。配置文件IKAnalyzer.cfg.xml爲配置擴展字典文件和中止詞字典文件路徑。IKAnalyzer.cfg.xml文件默認放置在classpath的根目錄下，能夠修改源碼來改變該文件位置。工具

在程序中使用IK很簡單，只須要建立IKAnalyzer對象便可，由於IKAnalyzer繼承於Lucene的Analyzer。google

IK無參構造函數默認採用細粒度切分算法，spa

Analyzer analyzer = new IKAnalyzer();//細粒度切分算法code

固然也能夠傳入參數設置採用智能切分算法。xml

Analyzer analyzer = new IKAnalyzer(true);//智能切分對象

Demo例子以下：

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.wltea.analyzer.lucene.IKAnalyzer;




/**
 * 使用IKAnalyzer進行Lucene索引和查詢的演示
 * 2012-3-2
 * 
 * 如下是結合Lucene4.0 API的寫法
 *
 */
public class LuceneIndexAndSearchDemo {
	
	
	/**
	 * 模擬：
	 * 建立一個單條記錄的索引，並對其進行搜索
	 * @param args
	 */
	public static void main(String[] args){
		//Lucene Document的域名
		String fieldName = "text";
		 //檢索內容
		String text = "IK Analyzer是一個結合詞典分詞和文法分詞的中文分詞開源工具包。它使用了全新的正向迭代最細粒度切分算法。";
		
		//實例化IKAnalyzer分詞器
		Analyzer analyzer = new IKAnalyzer(true);
		
		Directory directory = null;
		IndexWriter iwriter = null;
		IndexReader ireader = null;
		IndexSearcher isearcher = null;
		try {
			//創建內存索引對象
			directory = new RAMDirectory();	 
			
			//配置IndexWriterConfig
			IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_40 , analyzer);
			iwConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);
			iwriter = new IndexWriter(directory , iwConfig);
			//寫入索引
			Document doc = new Document();
			doc.add(new StringField("ID", "10000", Field.Store.YES));
			doc.add(new TextField(fieldName, text, Field.Store.YES));
			iwriter.addDocument(doc);
			iwriter.close();
			
			
			//搜索過程**********************************
		    //實例化搜索器   
			ireader = DirectoryReader.open(directory);
			isearcher = new IndexSearcher(ireader);			
			
			String keyword = "中文分詞工具包";			
			//使用QueryParser查詢分析器構造Query對象
			QueryParser qp = new QueryParser(Version.LUCENE_40, fieldName,  analyzer);
			qp.setDefaultOperator(QueryParser.AND_OPERATOR);
			Query query = qp.parse(keyword);
			System.out.println("Query = " + query);
			
			//搜索類似度最高的5條記錄
			TopDocs topDocs = isearcher.search(query , 5);
			System.out.println("命中：" + topDocs.totalHits);
			//輸出結果
			ScoreDoc[] scoreDocs = topDocs.scoreDocs;
			for (int i = 0; i < topDocs.totalHits; i++){
				Document targetDoc = isearcher.doc(scoreDocs[i].doc);
				System.out.println("內容：" + targetDoc.toString());
			}			
			
		} catch (CorruptIndexException e) {
			e.printStackTrace();
		} catch (LockObtainFailedException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} catch (ParseException e) {
			e.printStackTrace();
		} finally{
			if(ireader != null){
				try {
					ireader.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
			if(directory != null){
				try {
					directory.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
		}
	}
}

看看代碼，IK的使用真的很簡單。其實該例子的代碼就在IK包org/wltea/analyzer/sample/下。關於Lucene的使用可參看另外一篇文章，文章地址：

http://www.52jialy.com/article/showArticle?articleId=402881e546d8b14b0146d8e638640008