Lucene In Action 讀書筆記（一）

時間 2019-11-07

標籤 lucene action 讀書筆記简体版

原文原文鏈接

簡介

Lucene是apache軟件基金會4 jakarta項目組的一個子項目，是一個開放源代碼的全文檢索引擎工具包，即它不是一個完整的全文檢索引擎，而是一個全文檢索引擎的架構，提供了完整的查詢引擎和索引引擎，部分文本分析引擎（英文與德文兩種西方語言）。Lucene的目的是爲軟件開發人員提供一個簡單易用的工具包，以方便的在目標系統中實現全文檢索的功能，或者是以此爲基礎創建起完整的全文檢索引擎。(摘自百度百科)java

代碼環境

操做系統：centos 5.8linux

開發環境：Eclipse 4.3apache

構建工具：Maven 4.0centos

Maven配置

爲了可以按照書中的例子進行學習，這裏依賴的Lucene版本是3.0.1架構

<dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>3.0.1</version>
        </dependency>
    </dependencies>

完整配置：maven

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.linjl.study.book</groupId>
    <artifactId>book_luceneInAction</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <build>
        <sourceDirectory>src</sourceDirectory>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source />
                    <target />
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>3.0.1</version>
        </dependency>
    </dependencies>
</project>

程序示例

下面將用兩個例子進行Lucene入門講解工具

案例一：創建索引

案例一主要展現經過對指定路徑下.txt文件創建索引的過程學習

完整源碼：ui

package com.linjl.study.book.luceneInAction.chapter1;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer {
    private IndexWriter indexWriter;

    public Indexer(String indexDir) throws IOException {
        //步驟一:建立 Directory
        Directory dir = FSDirectory.open(new File(indexDir));
        //步驟二：建立 IndexWriter
        indexWriter = new IndexWriter(dir, new StandardAnalyzer(
                Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED);
    }

    public void close() throws CorruptIndexException, IOException {
        //步驟五：關閉IndexWriter
        indexWriter.close();
    }

    public int index(String dataDir, FileFilter fileFilter) throws IOException {
        File[] files = new File(dataDir).listFiles();
        for (File file : files) {
            if (!file.isDirectory() && !file.isHidden() && file.exists()
                    && file.canRead()
                    && (fileFilter == null || fileFilter.accept(file))) {
                indexFile(file);
            }
        }
        return indexWriter.numDocs();
    }

    private void indexFile(File file) throws IOException {
        System.out.println("Indexing " + file.getCanonicalPath());
        //步驟三：建立Document對象
        Document doc = getDocument(file);
        //步驟四：添加Document
        indexWriter.addDocument(doc);
    }

    protected Document getDocument(File file) throws IOException {
        Document doc = new Document();
        doc.add(new Field("contents", new FileReader(file)));
        doc.add(new Field("filename", file.getName(), Field.Store.YES,
                Field.Index.NOT_ANALYZED));
        doc.add(new Field("fullpath", file.getCanonicalPath(), Field.Store.YES,
                Field.Index.NOT_ANALYZED));
        return doc;
    }

    private static class TextFilesFilter implements FileFilter {

        public boolean accept(File pathname) {
            return pathname.getName().toLowerCase().endsWith(".txt");
        }

    }

    public static void main(String[] strs) throws IOException {
        //存放索引的位置（linux環境下路徑）
        String indexDir = "/opt/test/lucene/index";
        //存放待索引文件的位置（linux環境下路徑）
        String dataDir = "/opt/test/lucene/files";
        long startTime = System.currentTimeMillis();
        Indexer indexer = new Indexer(indexDir);
        int numIndexed;
        try {
            numIndexed = indexer.index(dataDir, new TextFilesFilter());
        } finally {
            indexer.close();
        }
        long endTime = System.currentTimeMillis();
        System.out.println("Indexing " + numIndexed + " files took "
                + (endTime - startTime) + "ms");
    }
}

案例二：搜索索引

案例二展現如何經過對指定的索引文件夾進行關鍵詞索引google

完整源碼：

package com.linjl.study.book.luceneInAction.chapter1;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Searcher {

    public static void search(String indexDir, String searchWord)
            throws IOException, ParseException {
        //步驟一：建立Directory
        Directory dir = FSDirectory.open(new File(indexDir));
        //步驟二：建立IndexSearcher
        IndexSearcher indexSearcher = new IndexSearcher(dir);
        //步驟三：建立QueryParser
        QueryParser parser = new QueryParser(Version.LUCENE_30, "contents",
                new StandardAnalyzer(Version.LUCENE_30));
        long startTime = System.currentTimeMillis();
        //步驟四：解析生成查詢對象
        Query query = parser.parse(searchWord);
        //步驟五：查詢並獲取查詢結果（只是獲取到查詢結果的引用）
        TopDocs hits = indexSearcher.search(query, 30);
        long endTime = System.currentTimeMillis();
        System.out.println("Found " + hits.totalHits + "document(s) (in "
                + (endTime - startTime) + "ms) that matched query '"
                + searchWord + "':");
        for (ScoreDoc scoreDoc : hits.scoreDocs) {
            //步驟六：根據引用生成查詢結果
            Document doc = indexSearcher.doc(scoreDoc.doc);
            System.out.println(doc.get("fullpath"));
        }
        //步驟七：關閉IndexSearcher
        indexSearcher.close();
    }

    public static void main(String[] args) throws IOException, ParseException {
        String indexDir = "/opt/test/lucene/index";
        String searchWord = "牀";
        Searcher.search(indexDir, searchWord);
    }

理解創建索引過程的核心類

IndexWriter
IndexWriter(寫索引)是索引過程的核心組件。這個類負責建立新索引或者打開已有索引，以及向索引添加、刪除或者更新文檔的信息。他只能寫入索引不能讀取或者搜索索引。
Directory
Directory描述了Lucene索引存放的位置。它是一個抽象類，有不少子類，例子中的FsDirectory是基於文件系統的索引，還有基於內存等更多子類。
Analyzer
文本文件在被索引或者創建索引的時候都須要通過Analyzer(分析器)處理。它負責從被索引文本文件中提取語彙單元，並提出剩下的無用信息。Analyzer是一個抽象類，Lucene提供了幾個實現類，不過對中文分詞的效果不太好，網上有幾個比較好的開源中文分詞庫IKAnalyzer，mmseg
Document
Document（文檔）對象表明一些域（Field）的集合。你能夠將Document理解爲虛擬文檔—好比Web頁面、E-mail信息或者文本文件---而後你能夠從中取回大量的數據。
Field
索引中的每一個文檔都包含一個或多個不一樣命名的域，這些魚包含在Field類中，每一個域都有一個域名和對應的域值，以及一組選項來精確控制Lucene索引操做各個域值。文檔可能擁有不值一個同名的域。在這種狀況下，域的值就按照索引操做順序添加進去。在搜索時，全部域的文本就好像鏈接在一塊兒，做爲一個文本域來處理。

理解搜索過程的核心類

IndexSearcher
IndexSearcher類用於搜索由IndexWriter類建立的索引：這個類公開了幾個搜索的方法，他是連接索引的中間環節，能夠將IndexSearcher類看做是一個以只讀方式打開索引的類。它須要利用Direcotry實例來控制前期建立的索引，而後才能提供大量的搜索方法。該類最簡單的用法以下：
```
Directory dir = FSDirectory.open(new File("/tmp/index"));
IndexSearcher searcher = new IndexSearcher(dir);
Query q = new TermQuery(new Term("contents","lucene"));
TopDocs hits = searcher.search(q,10);
searcher.close();
```
Term
Term對象是搜索功能的基本單元，與Field對象相似，Term對象包含一對字符串元素：域名和單詞。
Query
Lucene含有許多具體的Query(查詢)子類。
TermQuery
TermQuery是Lucene提供的最基本的查詢類型，也是簡單查詢類型之一。
TopDocs
TopDocs類是一個簡單的指針容器，指針通常指向前N個排名的搜索結果，搜索結果即匹配查詢條件的文檔。TopDocs會記錄前N個結果中每一個結果的int docID（能夠用它來回覆文檔）和浮點型分數

小結

本文主要是Lucene In Action 第一章的內容，經過2個例子，對lucene有了最初的認識和使用方法。

（全文完 linjl 20130904 深圳）

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。