初識Lucene

時間 2019-11-18

標籤 lucene 简体版

原文原文鏈接

之前聽過Lucene的大名，可是實際項目中一直沒機會用到。直到今天，無事看看，發現者東西真是厲害，不少知名公司都已經在用了，包括google和apple。

這篇文章將分3部分介紹lucene, 1. Lucene簡單的介紹 2.如何建立索引 3.如何搜索

1. Lucene簡單的介紹

Lucene是一個全文檢索引擎，官網是 http://lucene.apache.org 當前最新版是4.4

官網除了介紹Lucene外，還在顯目位置放了Solr的東西，看介紹說是基於Lucene core的高性能搜索服務器，往後有空再去研究他。

後來又去弄了本Lucene in action 2nd，準備走走經典流程Hello World。

和預想的同樣，搜索引擎，逃脫不了這樣的套路：

建立索引的時候

用web爬蟲等數據內容獲取工具拿到數據接着將內容進行分詞處理，留下有意義的詞而後建立索引保存起來

在搜索的時候

獲得用戶的搜索詞，解析分詞，找到索引文件，搜索，返回搜索結果。

Lucene in action中是這樣描述的：

大致是相同的，從資料中獲取數據，構造文檔，分析文檔進行分詞，而後生成索引文檔，而後保存。當搜索的時候，用戶經過搜索UI輸入關鍵字，應用程序建立搜索對象，搜索索引，而後返回結果。

接下來，咱們先走一遍Hello World，再介紹下各個重要部分。

2. 建立索引

下面是建立索引的代碼：

/**
*
*/
package demo;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
* @author
*
*/
publicclass Indexer {
private IndexWriter writer;

/**
*
* @param indexDir
* @throws IOException
*/
public Indexer(String indexDir) throws IOException {
// 定義索引的存放目錄爲File System
Directory dir = FSDirectory.open(new File(indexDir));
// 配置建立建立方式，使用標準的Analyzer
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_44,
new StandardAnalyzer(Version.LUCENE_44));
writer = new IndexWriter(dir, conf);
}

publicint index(String dataDir, FileFilter filter) throws Exception {
// 遍歷數據目錄
File[] files = new File(dataDir).listFiles();
for (File f : files) {
if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && (filter == null || filter.accept(f))) { // 獲得符合條件的txt文件 indexFile(f); }
}
return writer.numDocs();
}
privatevoid indexFile(File f) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f);
writer.addDocument(doc);
}
// 爲Document 添加 Field
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
doc.add(new TextField("content", new BufferedReader(new InputStreamReader(new FileInputStream(f), "UTF-8"))));
doc.add(new StringField("filename", f.getName(), Field.Store.YES));
doc.add(new StringField("fullpath", f.getCanonicalPath(),
Field.Store.YES));
return doc;
}
/**
* @throws IOException
*/
publicvoid close() throws IOException {
writer.close();
}
// 過濾文件 privatestaticclass TextFilesFilter implements FileFilter {
publicboolean accept(File path) {
return path.getName().toLowerCase().endsWith(".txt");
}
}
/**
* @param args
* @throws Exception
*/
publicstaticvoid main(String[] args) throws Exception {
// 索引存放路徑
String indexDir = "/Users/apple/Documents/index";
// 數據文件路徑
String dataDir = "/Users/apple/Documents/data";
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
// 調用本身寫得index方法建立索引
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
}
}

這段代碼是這樣：

首先定義了數據存放目錄，以及索引目錄，後者空，前者包含幾個txt文檔，文檔裏面有幾句英文句子(固然支持中文，但代碼會複雜，不在hello world範疇了)

定義了2個目錄後，接下來就是讀取文檔，獲取內容了，這一步你們能夠按照本身的方法寫，只要可以遍歷txt文檔，並取得內容便可。

接着，

這2個方法是把File對象建立成Document，也就是把內容存入到Document中，而後添加到IndexWriter中。一個Document對象包含若干個Field。你能夠這麼理解：一個Document對象想象成數據庫表中得一行，而Field就是某列。

說到IndexWriter，她是這樣建立的：

StandardAnalyzer，按照字面理解是標準分詞解析器，按照個人猜測，若是要解析中文，估計也是在這裏動手腳。她幫助IndexWriter解析你傳進來的Field的內容，進行分詞，詞加權等操做。IndexWriter負責將生成的索引寫入到FSDirectory中。也能夠是RAM，你能夠修改38行的代碼。

這樣，一個索引的建立過程就完成了。若是你的代碼報錯，尤爲是StandardAnalyzer找不到，你須要導入Analyzer / common包下的那個jar包。

運行一遍，便可在index目錄下生成索引了：

3. 搜索

有了索引後，就是搜索了，搜索相對簡單些，構造搜索對象，而後搜索並顯示結果。

這個是她的運行結果

至於那個爲何是null，Lucene in action沒說，按照個人猜想，由於她是一個TextField對象，從文本讀取出來的，她可能很大很大，若是把這個內容也放入索引，那麼索引庫的體積將會很是大，因此默認是不保存TextField，至於個人猜測是否正確，需待深刻了解。熱心的網友也可給我留言。

就這樣，一個Hello World就完成了。

總結：

建立索引的核心類：IndexWriter，Directory，Analyzer，Document，Field

IndexWriter是建立索引的核心組件，她負責建立一個新索引貨打開已經存在的索引，以及添加，刪除更新索引中得文檔。她是負責寫索引的，因此也要給他指定一個Directory，告訴她索引存哪。

Directory表明了Lucene 索引的存放地址，她又多種方式可用存，File System, RAM等

Analyzer，在文本被索引前，將會被Analyzer處理。作分詞處理，去掉一些無心義的詞，如空格，停頓，the，之類的詞。有多種Analyzer，自由選擇以解析富文本對象等。

Document，表明field的集合。可被認爲是web page，email message，數據表某一行，而field就是元數據了，好比標題，內容，建立時間，路徑等等，自由自定。不過你要記住的是，Lucene只能處理文本和數字。因此你要利用各類方法先將非文本轉成文本。

Field就是Lucene將要索引的值

搜索的關鍵對象：

IndexSearcher,負責搜索INdexWriter寫得索引。

Directory dir = FSDirectory.open(new File("/tmp/index"));
IndexSearcher searcher = new IndexSearcher(dir);
Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);
java

searcher.close();

這是她的典型用法

Term，是搜索的基本單元，與Field對象相似。包含用於搜索的Field name和value

Query，搜索對象，和JDBC中得Query功能同樣，Lucene有不少實現方式，例子中用到的是TermQuery

TermQuery基本的搜索方式

TopDoc，也就是搜索結果，她包含一個docID，你能夠經過她取得document對象。