使用Lucene2.3構建搜索引擎

時間 2019-11-22

原文原文鏈接

Lucene不是一個完整的全文索引應用，而是是一個用Java寫的全文索引引擎工具包，它能夠方便的嵌入到各類應用中實現針對應用的全文索引/檢索功能。

Lucene的做者：Lucene的貢獻者Doug Cutting是一位資深全文索引/檢索專家，曾經是V-Twin搜索引擎(Apple的Copland操做系統的成就之一)的主要開發者，後在 Excite擔任高級系統架構設計師，目前從事於一些INTERNET底層架構的研究。他貢獻出的Lucene的目標是爲各類中小型應用程序加入全文檢索功能。

Lucene的發展歷程：早先發布在做者本身的[url]www.lucene.com[/url]，後來發佈在SourceForge，2001年年末成爲APACHE基金會jakarta的一個子項目： [url]http://jakarta.apache.org/lucene/[/url]

已經有不少Java項目都使用了Lucene做爲其後臺的全文索引引擎

1、開始
首先在Apache下載Lucene 2.3.0包，其中包含了核心jar和LuceneAPI文檔，解壓後，將 lucene-core-2.3.0.jar放在classpath中。

2、建立索引

建立索引時須要指定存放索引的目錄（未來檢索時須要對這個目錄中的索引進行檢索），和文件的目錄（若是是對文件進行索引的話）代碼以下：

public void crateIndex() throws Exception {
File indexDir = new File( "D://luceneIndex" );

// 存儲索引文件夾

File dataDir = new File( "D://test" );

// 須要檢索文件夾

Analyzer luceneAnalyzer = new PaodingAnalyzer();

// PaodingAnalyzer這個類是庖丁解牛中文分詞分析器類繼承了Lucene的 Analyzer接口，對於檢索中文分詞有很大幫助

File[] dataFiles = dataDir.listFiles();

boolean fileIsExist = false ;

if (indexDir.listFiles(). length == 0)

fileIsExist = true ;

IndexWriter indexWriter = new IndexWriter(indexDir, luceneAnalyzer , fileIsExist);

// 第三個參數是一個布爾型的變量，若是爲 true 的話就表明建立一個新的索引，爲 false 的話就表明在原來索引的基礎上進行操做。

long startTime = new Date().getTime();

this .doIndex(dataFiles, indexWriter);

indexWriter.optimize();//優化索引

indexWriter.close();//關閉索引

long endTime = new Date().getTime();

System. out .println( "It takes " + (endTime - startTime)

+ " milliseconds to create index for the files in directory " + dataDir.getPath());

{color:black}}

* private{*} void doIndex(File[] dataFiles, IndexWriter indexWriter) throws Exception {

for ( int i = 0; i < dataFiles. length ; i++) {

if (dataFiles[i].isFile() && dataFiles[i].getName().endsWith( ".html" )) {//索引全部html格式文件

System. out .println( "Indexing file " + dataFiles[i].getCanonicalPath());

Reader txtReader = new FileReader(dataFiles[i]);

Document document = new Document();

// Field.Store.YES 存儲 Field.Store.NO 不存儲

// Field.Index.TOKENIZED 分詞 Field.Index.UN_TOKENIZED 不分詞

document.add( new Field( "path" , dataFiles[i].getCanonicalPath(), Field.Store. YES , Field.Index. UN_TOKENIZED ));

document.add( new Field( "filename" , dataFiles[i].getName(), Field.Store. YES , Field.Index. TOKENIZED ));

// 另一個構造函數 , 接受一個 Reader 對象

document.add( new Field( "contents" , txtReader));

indexWriter.addDocument(document);

{color:black}} else if (dataFiles[i].isFile() && dataFiles[i].getName().endsWith( ".doc" )) {//索引全部word文件

FileInputStream in = new FileInputStream(dataFiles[i]);// 得到文件流

WordExtractor extractor = new WordExtractor(in);// 使用POI對word文件進行解析

String str = extractor.getText();// 返回String

Document document = new Document();//生成 Document對象,其中有3個 Field,分別是 path , filename, contents

document.add( new Field( "path" , dataFiles[i].getCanonicalPath(), Field.Store. YES ,

Field.Index. UN_TOKENIZED ));

document.add( new Field( "filename" , dataFiles[i].getName(), Field.Store. YES , Field.Index. TOKENIZED ));

// 另一個構造函數 , 接受一個 Reader 對象

document.add( new Field( "contents" , str, Field.Store. YES ,Field.Index. TOKENIZED ,
Field.TermVector. WITH_POSITIONS_OFFSETS ));

indexWriter.addDocument(document);

{color:black}} else {

if (dataFiles[i].isDirectory()) {

doIndex(dataFiles[i].listFiles(), indexWriter);//使用遞歸,繼續索引文件夾

{color:black}}

從上面代碼中能夠看到對文件(或者說是數據)建立索引是一件很容易的事,首先肯定須要索引的文件夾(或者數據庫中的數據注:Lucene只接受數據,他不會區分數據的來源,也就是說無論是什麼你只要把它轉爲String格式的數據,Lucene就能建立索引),而後指定建立後索引存放的地方,咱們本身對數據處理後建立一個 Document對象這裏面你能夠本身定義放幾個 Field,並定義 Field是否進行分詞什麼的,這樣索引就建立好了.

注:使用庖丁解牛中文分詞,須要將"庖丁"中的詞典(dic文件夾)放到classpath 中再把 paoding-analyzer.properties文件也放到classpath中 properties文件內容以下:

paoding.imports = {color}

ifexists:classpath:paoding-analysis-default.properties;{color}

ifexists:classpath:paoding-analysis-user.properties;{color}

ifexists:classpath:paoding-knives-user.properties

paoding.dic.home = classpath:dic