lucene基礎

時間 2019-12-06

標籤 lucene 基礎简体版

原文原文鏈接

Lucene是一個高效的，基於Java的全文檢索庫。html

文檔地址：http://lucene.apache.org/core/5_0_0/core/overview-summary.htmljava

咱們從下往上看，很容易發現索引（index）是lucene的核心。apache

那lucene的索引（index）是怎麼樣的呢？api

假設咱們有1000份文檔分別用編號1-1000表示吧。而後能獲得如下結構數據結構

左邊做爲索引而右邊做爲一個文檔鏈表。app

好比第一行表明了lucene單詞在二、三、十、3五、92文檔中maven

那麼lucene是怎麼建索引的呢？學習

首先先是分析器analysis進行拆解this

1 將文檔分紅一個一個單獨的單詞url

2 去除標點符號

3 去除無心義的單詞： a , the , this

4 單詞小寫

5 單詞原型化：好比過去式、分詞形式轉換爲原型

6 提取常量：teamwork homework hoursework 這裏work就能夠提取出來

而後獲得原始的單詞組合再進行索引

索引的數據結構

Lucene 索引文件中，用一下基本類型來保存信息：
Byte：是最基本的類型，長8 位(bit)。
UInt32：由4 個Byte 組成。
UInt64：由8 個Byte 組成。
VInt：
變長的整數類型，它可能包含多個Byte，對於每一個Byte 的8 位，其中後7 位表示
數值，最高1 位表示是否還有另外一個Byte，0 表示沒有，1 表示有。
越前面的Byte 表示數值的低位，越後面的Byte 表示數值的高位。
例如130 化爲二進制爲 1000, 0010，總共須要8 位，一個Byte 表示不了，於是需
要兩個Byte 來表示，第一個Byte 表示後7 位，而且在最高位置1 來表示後面還有
一個Byte，因此爲(1) 0000010，第二個Byte 表示第8 位，而且最高位置0 來表示
後面沒有其餘的Byte 了，因此爲(0) 0000001。

OK，咱們開始上代碼去實驗一下吧。

咱們先想一想一下整個過程再配合官方資料動手

索引

1 你首先有不少份文檔或者數據須要儲存

2 那麼你得先指定一個創建index的目錄

3 而後再用分析器把須要索引的文檔或者數據進行解析和拆解

4 對簡化的數據進行索引的創建

--------------------------

搜索

1 對搜索內容用分析器進行解析和拆解

2 搜索

3 對返回結果進行讀取

pom.xml文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>Test</groupId>
    <artifactId>Test</artifactId>
    <packaging>war</packaging>
    <version>0.0.1-SNAPSHOT</version>
    <name>Test Maven Webapp</name>
    <url>http://maven.apache.org</url>
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.4</version>
        </dependency>
        <dependency>
            <groupId>javax</groupId>
            <artifactId>javaee-api</artifactId>
            <version>7.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>5.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>5.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>5.0.0</version>
        </dependency>
    </dependencies>
    
</project>

LuceneDemo.java(官方demo)

package com.newtouchone.lucene;

import static org.junit.Assert.assertEquals;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

public class LuceneDemo {

    public static void main(String[] args) throws Exception {
        Analyzer analyzer = new StandardAnalyzer();//解析器，用於將文檔中的單詞進行處理減小索引空間、同時也會對查詢單詞進行處理
        // Store the index in memory:
        Directory directory = new RAMDirectory();//打開內存空間
        // To store an index on disk, use this instead:
        // Directory directory = FSDirectory.open("/tmp/testindex");//打開本地磁盤
        IndexWriterConfig config = new IndexWriterConfig(analyzer);//配置寫入流的解析器
        IndexWriter iwriter = new IndexWriter(directory, config);//indexwriter是索引寫入的流
        
        Document doc = new Document();//創建文檔
        String text = "This is the text to be indexed.";//須要保存的寫入的內容
        doc.add(new Field("body", text, TextField.TYPE_STORED));//文檔的一個屬性（field），這裏我寫入body->內容
        doc.add(new Field("title","first", TextField.TYPE_STORED));//title->first
        iwriter.addDocument(doc);//索引裏面添加此文檔
        String text2 = "learning lecene";//同理上面
        Document doc2 = new Document();
        doc2.add(new Field("body", text2, TextField.TYPE_STORED));
        doc.add(new Field("title","second", TextField.TYPE_STORED));
        iwriter.addDocument(doc2);
        
        iwriter.close();//關閉索引流
        

        // Now search the index://搜索
        DirectoryReader ireader = DirectoryReader.open(directory); //打開索引地址
        IndexSearcher isearcher = new IndexSearcher(ireader);//建立搜索器
        // Parse a simple query that searches for "text":
        QueryParser parser = new QueryParser("body", analyzer);//在body裏面進行搜索，簡單來講就是搜索文檔內容（我定義的文檔是title和body（內容））
        Query query = parser.parse("indexed");//搜索含有index的文檔
        ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;//拿到命中結果

        assertEquals(1, hits.length);//驗證
        
        // Iterate through the results:讀取結果
        for (int i = 0; i < hits.length; i++) {
             Document hitdoc = isearcher.doc(hits[i].doc);//document'index
             for (IndexableField indexableField : hitDoc.getFields()) {//一個文檔能夠有多個field的，好比說我此次的文檔有title和body
                System.out.println(indexableField.stringValue());//讀取filed的內容
            }
        }
        ireader.close();
        directory.close();
    }

}

輸出

This is the text to be indexed.
first

以上是根據官方文檔進行學習的一份筆記，lecene經驗+1。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。