Lucene

時間 2019-12-19

標籤 lucene 简体版

原文原文鏈接

一、什麼是全文檢索

數據分類

咱們生活中的數據整體分爲兩種：結構化數據和非結構化數據。java

結構化數據：指具備固定格式或有限長度的數據，如數據庫，元數據等。spring

非結構化數據：指不定長或無固定格式的數據，如郵件，word文檔等磁盤上的文件sql

結構化數據搜索

常見的結構化數據也就是數據庫中的數據。在數據庫中搜索很容易實現，一般都是使用sql語句進行查詢，並且能很快的獲得查詢結果。數據庫

爲何數據庫搜索很容易？apache

由於數據庫中的數據存儲是有規律的，有行有列並且數據格式、數據長度都是固定的。windows

非結構化數據查詢方法

順序掃描法(Serial Scanning)網絡

所謂順序掃描，好比要找內容包含某一個字符串的文件，就是一個文檔一個文檔的看，對於每個文檔，從頭看到尾，若是此文檔包含此字符串，則此文檔爲咱們要找的文件，接着看下一個文件，直到掃描完全部的文件。如利用windows的搜索也能夠搜索文件內容，只是至關的慢。數據結構

全文檢索(Full-text Search)app

將非結構化數據中的一部分信息提取出來，從新組織，使其變得有必定結構，而後對此有必定結構的數據進行搜索，從而達到搜索相對較快的目的。這部分從非結構化數據中提取出的而後從新組織的信息，咱們稱之索引。ide

例如：字典。字典的拼音表和部首檢字表就至關於字典的索引，對每個字的解釋是非結構化的，若是字典沒有音節表和部首檢字表，在茫茫辭海中找一個字只能順序掃描。然而字的某些信息能夠提取出來進行結構化處理，好比讀音，就比較結構化，分聲母和韻母，分別只有幾種能夠一一列舉，因而將讀音拿出來按必定的順序排列，每一項讀音都指向此字的詳細解釋的頁數。咱們搜索時按結構化的拼音搜到讀音，而後按其指向的頁數，即可找到咱們的非結構化數據——也即對字的解釋。

這種先創建索引，再對索引進行搜索的過程就叫全文檢索(Full-text Search)。

雖然建立索引的過程也是很是耗時的，可是索引一旦建立就能夠屢次使用，全文檢索主要處理的是查詢，因此耗時間建立索引是值得的。

如何實現全文檢索

可使用Lucene實現全文檢索。Lucene是apache下的一個開放源代碼的全文檢索引擎工具包。提供了完整的查詢引擎和索引引擎，部分文本分析引擎。Lucene的目的是爲軟件開發人員提供一個簡單易用的工具包，以方便的在目標系統中實現全文檢索的功能。

全文檢索的應用場景

對於數據量大、數據結構不固定的數據可採用全文檢索方式搜索，好比百度、Google等搜索引擎、論壇站內搜索、電商網站站內搜索等。

二、Lucene實現全文檢索的流程

2.1索引和搜索流程圖

一、綠色表示索引過程，對要搜索的原始內容進行索引構建一個索引庫，索引過程包括：肯定原始內容即要搜索的內容->採集文檔->建立文檔->分析文檔->索引文檔

二、紅色表示搜索過程，從索引庫中搜索內容，搜索過程包括：

用戶經過搜索界面->建立查詢->執行搜索，從索引庫搜索->渲染搜索結果

2.2建立索引

對文檔索引的過程，將用戶要搜索的文檔內容進行索引，索引存儲在索引庫（index）中。

這裏咱們要搜索的文檔是磁盤上的文本文件，根據案例描述：凡是文件名或文件內容包括關鍵字的文件都要找出來，這裏要對文件名和文件內容建立索引。

2.2.1得到原始文檔

原始文檔是指要索引和搜索的內容。原始內容包括互聯網上的網頁、數據庫中的數據、磁盤上的文件等。

本案例中的原始內容就是磁盤上的文件，以下圖：

從互聯網上、數據庫、文件系統中等獲取須要搜索的原始信息，這個過程就是信息採集，信息採集的目的是爲了對原始內容進行索引。

在Internet上採集信息的軟件一般稱爲爬蟲或蜘蛛，也稱爲網絡機器人，爬蟲訪問互聯網上的每個網頁，將獲取到的網頁內容存儲起來。

本案例咱們要獲取磁盤上文件的內容，能夠經過文件流來讀取文本文件的內容，對於pdf、doc、xls等文件可經過第三方提供的解析工具讀取文件內容，好比Apache POI讀取doc和xls的文件內容。

2.2.2建立文檔對象

獲取原始內容的目的是爲了索引，在索引前須要將原始內容建立成文檔（Document），文檔中包括一個一個的域（Field），域中存儲內容。

這裏咱們能夠將磁盤上的一個文件當成一個document，Document中包括一些Field（file_name文件名稱、file_path文件路徑、file_size文件大小、file_content文件內容），以下圖：

注意：每一個Document能夠有多個Field，不一樣的Document能夠有不一樣的Field，同一個Document能夠有相同的Field（域名和域值都相同）

每一個文檔都有一個惟一的編號，就是文檔id。

2.2.3分析文檔

將原始內容建立爲包含域（Field）的文檔（document），須要再對域中的內容進行分析，分析的過程是通過對原始文檔提取單詞、將字母轉爲小寫、去除標點符號、去除停用詞等過程生成最終的語彙單元，能夠將語彙單元理解爲一個一個的單詞。

好比下邊的文檔通過分析以下：

原文檔內容：

Lucene is a Java full-text search engine.  Lucene is not a complete
application, but rather a code library and API that can easily be used
to add search capabilities to applications.

分析後獲得的語彙單元：

lucene、java、full、search、engine。。。。

每一個單詞叫作一個Term，不一樣的域中拆分出來的相同的單詞是不一樣的term。term中包含兩部分一部分是文檔的域名，另外一部分是單詞的內容。

例如：文件名中包含apache和文件內容中包含的apache是不一樣的term。

2.2.4建立索引

對全部文檔分析得出的語彙單元進行索引，索引的目的是爲了搜索，最終要實現只搜索被索引的語彙單元從而找到Document（文檔）。

注意：建立索引是對語彙單元索引，經過詞語找文檔，這種索引的結構叫倒排索引結構。

傳統方法是根據文件找到該文件的內容，在文件內容中匹配搜索關鍵字，這種方法是順序掃描方法，數據量大、搜索慢。

倒排索引結構是根據內容（詞語）找文檔，以下圖：

倒排索引結構也叫反向索引結構，包括索引和文檔兩部分，索引即詞彙表，它的規模較小，而文檔集合較大。

2.3查詢索引

查詢索引也是搜索的過程。搜索就是用戶輸入關鍵字，從索引（index）中進行搜索的過程。根據關鍵字搜索索引，根據索引找到對應的文檔，從而找到要搜索的內容（這裏指磁盤上的文件）

2.3.1用戶查詢接口

全文檢索系統提供用戶搜索的界面供用戶提交搜索的關鍵字，搜索完成展現搜索結果。

Lucene不提供製做用戶搜索界面的功能，須要根據本身的需求開發搜索界面

2.3.2建立查詢

用戶輸入查詢關鍵字執行搜索以前須要先構建一個查詢對象，查詢對象中能夠指定查詢要搜索的Field文檔域、查詢關鍵字等，查詢對象會生成具體的查詢語法，

例如：語法「fileName:lucene」表示要搜索Field域的內容爲「lucene」的文檔

2.3.3執行查詢

搜索索引過程：

根據查詢語法在倒排索引詞典表中分別找出對應搜索詞的索引，從而找到索引所連接的文檔鏈表。

好比搜索語法爲「fileName:lucene」表示搜索出fileName域中包含Lucene的文檔。

搜索過程就是在索引上查找域爲fileName，而且關鍵字爲Lucene的term，並根據term找到文檔id列表。

2.3.4渲染結果

以一個友好的界面將查詢結果展現給用戶，用戶根據搜索結果找本身想要的信息，爲了幫助用戶很快找到本身的結果，提供了不少展現的效果，好比搜索結果中將關鍵字高亮顯示，百度提供的快照等。

三、配置開發環境

3.1Lucene下載

Lucene是開發全文檢索功能的工具包，從官方網站下載lucene-7.4.0，並解壓。

官方網站：http://lucene.apache.org/

版本：lucene-7.4.0

Jdk要求：1.8以上

3.2使用的jar包

lucene-core-7.4.0.jar

lucene-analyzers-common-7.4.0.jar

四、入門程序

4.1需求

實現一個文件的搜索功能，經過關鍵字搜索文件，凡是文件名或文件內容包括關鍵字的文件都須要找出來。還能夠根據中文詞語進行查詢，而且須要支持多個條件查詢。

4.2建立索引

4.2.1實現步驟

第一步：建立一個java工程，並導入jar包。

第二步：建立一個indexwriter對象。

指定索引庫的存放位置Directory對象
指定一個IndexWriterConfig對象。

第二步：建立document對象。

第三步：建立field對象，將field添加到document對象中。

第四步：使用indexwriter對象將document對象寫入索引庫，此過程進行索引建立。並將索引和document對象寫入索引庫。

第五步：關閉IndexWriter對象。

4.2.2代碼實現

public class LuceneFirst {
    @Test
    public void createIndex() throws Exception {
        //一、建立一個Director對象，指定索引庫保存的位置。
        //把索引庫保存在內存中
        //Directory directory = new RAMDirectory();
        //把索引庫保存在磁盤
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
        //二、基於Directory對象建立一個IndexWriter對象
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory, config);
        //三、讀取磁盤上的文件，對應每一個文件建立一個文檔對象。
        File dir = new File("C:\\A0.lucene2018\\05.參考資料\\searchsource");
        File[] files = dir.listFiles();
        for (File f :
                files) {
            //取文件名
            String fileName = f.getName();
            //文件的路徑
            String filePath = f.getPath();
            //文件的內容
            String fileContent = FileUtils.readFileToString(f, "utf-8");
            //文件的大小
            long fileSize = FileUtils.sizeOf(f);
            //四、向文檔對象中添加域
            //建立Field：TextField文本域。
            //參數1：域的名稱，參數2：域的內容，參數3：是否存儲
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            //Field fieldPath = new TextField("path", filePath, Field.Store.YES);
            Field fieldPath = new StoredField("path", filePath);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            //Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
            Field fieldSizeValue = new LongPoint("size", fileSize);
            Field fieldSizeStore = new StoredField("size", fileSize);
            //建立文檔對象
            Document document = new Document();
            //向文檔對象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            //document.add(fieldSize);
            document.add(fieldSizeValue);
            document.add(fieldSizeStore);
            //五、把文檔對象寫入索引庫
            indexWriter.addDocument(document);
        }
        //六、關閉indexwriter對象
        indexWriter.close();
    }

4.2.3使用Luke工具查看索引文件

咱們使用的luke的版本是luke-7.4.0，跟lucene的版本對應的。能夠打開7.4.0版本的lucene建立的索引庫。須要注意的是此版本的Luke是jdk9編譯的，因此要想運行此工具還須要jdk9才能夠。

4.3查詢索引

4.3.1實現步驟

第一步：建立一個Directory對象，也就是索引庫存放的位置。

第二步：建立一個indexReader對象，須要指定Directory對象。

第三步：建立一個indexsearcher對象，須要指定IndexReader對象

第四步：建立一個TermQuery對象，指定查詢的域和查詢的關鍵詞。

第五步：執行查詢。

第六步：返回查詢結果。遍歷查詢結果並輸出。

第七步：關閉IndexReader對象

public void searchIndex() throws Exception {
        //一、建立一個Director對象，指定索引庫的位置
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
        //二、建立一個IndexReader對象
        IndexReader indexReader = DirectoryReader.open(directory);
        //三、建立一個IndexSearcher對象，構造方法中的參數indexReader對象。
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //四、建立一個Query對象，TermQuery
        Query query = new TermQuery(new Term("name", "spring"));
        //五、執行查詢，獲得一個TopDocs對象
        //參數1：查詢對象 參數2：查詢結果返回的最大記錄數
        TopDocs topDocs = indexSearcher.search(query, 10);
        //六、取查詢結果的總記錄數
        System.out.println("查詢總記錄數：" + topDocs.totalHits);
        //七、取文檔列表
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        //八、打印文檔中的內容
        for (ScoreDoc doc :
                scoreDocs) {
            //取文檔id
            int docId = doc.doc;
            //根據id取文檔對象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            //System.out.println(document.get("content"));
            System.out.println("-----------------寂寞的分割線");
        }
        //九、關閉IndexReader對象
        indexReader.close();
    }

五、分析器

5.1分析器的分詞效果

@Test
    public void testTokenStream() throws Exception {
        //1）建立一個Analyzer對象，StandardAnalyzer對象
        Analyzer analyzer = new StandardAnalyzer();
//        Analyzer analyzer = new IKAnalyzer();//使用IK分析器
        //2）使用分析器對象的tokenStream方法得到一個TokenStream對象
        //TokenStream tokenStream = analyzer.tokenStream("", "2017年12月14日 - 傳智播客Lucene概述公安局Lucene是一款高性能的、可擴展的信息檢索(IR)工具庫。信息檢索是指文檔搜索、文檔內信息搜索或者文檔相關的元數據搜索等操做。");
        TokenStream tokenStream = analyzer.tokenStream("", "The Spring Framework provides a comprehensive programming and configuration model.");
        //3）向TokenStream對象中設置一個引用，至關於數一個指針
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4）調用TokenStream對象的rest方法。若是不調用拋異常
        tokenStream.reset();
        //5）使用while循環遍歷TokenStream對象
        while(tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6）關閉TokenStream對象
        tokenStream.close();
    }

運行結果：
spring
framework
provides
comprehensive
programming
configuration
model

5.2中文分析器

5.2.1Lucene自帶中文分詞器

StandardAnalyzer：

單字分詞：就是按照中文一個字一個字地進行分詞。如：「我愛中國」，
效果：「我」、「愛」、「中」、「國」。

2.SmartChineseAnalyzer

對中文支持較好，但擴展性差，擴展詞庫，禁用詞庫和同義詞庫等很差處理

5.2.2 IKAnalyzer

使用方法：

第一步：把jar包添加到工程中

第二步：把配置文件、擴展詞典、停用詞詞典添加到classpath下

注意：hotword.dic和ext_stopword.dic文件的格式爲UTF-8，注意是無BOM 的UTF-8 編碼。

也就是說禁止使用windows記事本編輯擴展詞典文件

5.3使用自定義分析器

@Test
    public void createIndex() throws Exception {
        //一、建立一個Director對象，指定索引庫保存的位置。
        //把索引庫保存在內存中
        //Directory directory = new RAMDirectory();
        //把索引庫保存在磁盤
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
        //二、基於Directory對象建立一個IndexWriter對象
        //IndexWriter indexWriter=new IndexWriter(directory,new IndexWriterConfig());默認使用Lucene自帶的分詞器
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());//使用IK分詞器
        IndexWriter indexWriter = new IndexWriter(directory, config);
        //三、讀取磁盤上的文件，對應每一個文件建立一個文檔對象。
        File dir = new File("C:\\A0.lucene2018\\05.參考資料\\searchsource");
        File[] files = dir.listFiles();
        for (File f :
                files) {
            //取文件名
            String fileName = f.getName();
            //文件的路徑
            String filePath = f.getPath();
            //文件的內容
            String fileContent = FileUtils.readFileToString(f, "utf-8");
            //文件的大小
            long fileSize = FileUtils.sizeOf(f);
            //四、向文檔對象中添加域。不一樣的文本類型使用不一樣的域：name、content可使用TextField
            //path也只是路徑，不必分詞
            //建立Field：TextField文本域。
            //參數1：域的名稱，參數2：域的內容，參數3：是否存儲
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            //Field fieldPath = new TextField("path", filePath, Field.Store.YES);
            Field fieldPath = new StoredField("path", filePath);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            //可能會對fileSize進行一些處理，那麼都使用TextField就不太合適，好比文件大小大於多少小於多少的
            //Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
            Field fieldSizeValue = new LongPoint("size", fileSize);
            Field fieldSizeStore = new StoredField("size", fileSize);
            //建立文檔對象
            Document document = new Document();
            //向文檔對象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            //document.add(fieldSize);
            document.add(fieldSizeValue);
            document.add(fieldSizeStore);
            //五、把文檔對象寫入索引庫
            indexWriter.addDocument(document);
        }
        //六、關閉indexwriter對象
        indexWriter.close();
    }

代碼演示：使用Lucene自帶的分詞器

@Test
    public void testTokenStream() throws Exception {
        //1）建立一個Analyzer對象，StandardAnalyzer對象
        Analyzer analyzer = new StandardAnalyzer();
//        Analyzer analyzer = new IKAnalyzer();//使用IK分析器
        //2）使用分析器對象的tokenStream方法得到一個TokenStream對象
        TokenStream tokenStream = analyzer.tokenStream("", "2017年12月14日 - 傳智播客Lucene概述公安局Lucene是一款高性能的、可擴展的信息檢索(IR)工具庫。信息檢索是指文檔搜索、文檔內信息搜索或者文檔相關的元數據搜索等操做。");
        //TokenStream tokenStream = analyzer.tokenStream("", "The Spring Framework provides a comprehensive programming and configuration model.");
        //3）向TokenStream對象中設置一個引用，至關於數一個指針
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4）調用TokenStream對象的rest方法。若是不調用拋異常
        tokenStream.reset();
        //5）使用while循環遍歷TokenStream對象
        while(tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6）關閉TokenStream對象
        tokenStream.close();
    }

運行結果爲：
2017
年
12
月
14
日
傳
智
播
客
lucene
概
述
公
安
局
lucene
是
一
款
高
性
能
的
可
擴
展
的
信
息
檢
索
ir
工
具
庫
信
息
檢
索
是
指
文
檔
搜
索
文
檔
內
信
息
搜
索
或
者
文
檔
相
關
的
元
數
據
搜
索
等
操
做

使用IK分詞器

運行結果爲：
加載擴展詞典：hotword.dic
加載擴展中止詞典：stopword.dic
2017
12
14
傳智播客
lucene
概述
lucene
一款
一
高性能
性能
可
擴展
信息
檢索
ir
工具
庫
信息
檢索
文檔
搜索
文檔
內
信息
搜索
或者
文檔
相關
數據
搜索
操做

六、索引庫的維護

6.1索引庫的添加

6.1.1Field域的屬性

是否分析：是否對域的內容進行分詞處理。前提是咱們要對域的內容進行查詢。

是否索引：將Field分析後的詞或整個Field值進行索引，只有索引方可搜索到。

好比：商品名稱、商品簡介分析後進行索引，訂單號、身份證號不用分析但也要索引，這些未來都要做爲查詢條件。

是否存儲：將Field值存儲在文檔中，存儲在文檔中的Field才能夠從Document中獲取

好比：商品名稱、訂單號，凡是未來要從Document中獲取的Field都要存儲。

是否存儲的標準：是否要將內容展現給用戶

存不存儲不影響查詢（不存儲，也是能夠經過索引查詢內容，好比查詢content:內容，即使沒有存儲，可是依舊能夠查詢出content域中包含內容關鍵字），隻影響能不能取出來

Field類	數據類型	Analyzed是否分析	Indexed是否索引	Stored是否存儲	說明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	這個Field用來構建一個字符串Field,可是不會進行分析, 會將整個串存儲在索引中，好比(訂單號,姓名等)是否存儲在文檔中用Store.YES或Store.NO決定
Long/int/floatPoint(String name, long... point)	Long型	Y	N	N	可使用LongPoint、IntPoint等類型存儲數值類型的數據。讓數值類型能夠進行索引。可是不能存儲數據，若是想存儲數據還須要使用StoredField。
StoredField(FieldName, FieldValue)	重載方法，支持多種類型	N	N	Y	這個Field用來構建不一樣類型Field 不分析，不索引，但要Field存儲在文檔中
TextField(FieldName, FieldValue, Store.NO) 或 TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	若是是一個Reader, lucene猜想內容比較多,會採用Unstored的策略.

我的推論：若是分詞那麼確定要索引，否則分詞沒有意義？不分詞也能夠建立索引（StringField），好比身份證號什麼的，不必分詞，可是有可能須要查詢，那麼就須要建立索引

LongPoint(String name, long... point)：這個point僅僅是參與運算的，並不存儲。

StoredField不分析不索引只存儲，好比path就能夠用這個

@Test
    public void createIndex() throws Exception {
        //一、建立一個Director對象，指定索引庫保存的位置。
        //把索引庫保存在內存中
        //Directory directory = new RAMDirectory();
        //把索引庫保存在磁盤
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
        //二、基於Directory對象建立一個IndexWriter對象
        //IndexWriter indexWriter=new IndexWriter(directory,new IndexWriterConfig());默認使用Lucene自帶的分詞器
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());//使用IK分詞器
        IndexWriter indexWriter = new IndexWriter(directory, config);
        //三、讀取磁盤上的文件，對應每一個文件建立一個文檔對象。
        File dir = new File("C:\\A0.lucene2018\\05.參考資料\\searchsource");
        File[] files = dir.listFiles();
        for (File f :
                files) {
            //取文件名
            String fileName = f.getName();
            //文件的路徑
            String filePath = f.getPath();
            //文件的內容
            String fileContent = FileUtils.readFileToString(f, "utf-8");
            //文件的大小
            long fileSize = FileUtils.sizeOf(f);
            //四、向文檔對象中添加域。不一樣的文本類型使用不一樣的域：name、content可使用TextField
            //path也只是路徑，不必分詞
            //建立Field：TextField文本域。
            //參數1：域的名稱，參數2：域的內容，參數3：是否存儲
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            //Field fieldPath = new TextField("path", filePath, Field.Store.YES);
            Field fieldPath = new StoredField("path", filePath);//Field.Store.YES默認就是存儲
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
            //可能會對fileSize進行一些處理，那麼都使用TextField就不太合適，好比文件大小大於多少小於多少的
            //好比path域，其實他是不須要分詞的，也就不必用TextField
            //Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
            Field fieldSizeValue = new LongPoint("size", fileSize);//只運算
            Field fieldSizeStore = new StoredField("size", fileSize);//另外還只存儲
            //建立文檔對象
            Document document = new Document();
            //向文檔對象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            //document.add(fieldSize);
            document.add(fieldSizeValue);
            document.add(fieldSizeStore);
            //五、把文檔對象寫入索引庫
            indexWriter.addDocument(document);
        }
        //六、關閉indexwriter對象
        indexWriter.close();
    }

6.2索引庫的增、刪、改

添加文檔

刪除文檔：

1）刪除所有

2）根據查詢(Query... queries)、關鍵詞(Term... terms)刪除

修改文檔

修改的原理：先刪除、再添加

/**
 * 索引庫維護：增、刪、更新
 */
public class IndexManager {

    private IndexWriter indexWriter;

    @Before
    public void init() throws Exception {
        //建立一個IndexWriter對象，須要使用IKAnalyzer做爲分析器
        indexWriter =
                new IndexWriter(FSDirectory.open(new File("C:\\temp\\index").toPath()),
                        new IndexWriterConfig(new IKAnalyzer()));
    }

    @Test
    public void addDocument() throws Exception {
        //建立一個IndexWriter對象，須要使用IKAnalyzer做爲分析器
        indexWriter =
                new IndexWriter(FSDirectory.open(new File("C:\\temp\\index").toPath()),
                new IndexWriterConfig(new IKAnalyzer()));
        //建立一個Document對象
        Document document = new Document();
        //向document對象中添加域
        document.add(new TextField("name", "新添加的文件", Field.Store.YES));
        document.add(new TextField("content", "新添加的文件內容", Field.Store.NO));
        document.add(new StoredField("path", "c:/temp/helo"));
        // 把文檔寫入索引庫。使用Lucene時，不一樣的文檔能夠有不一樣的域，同一個文檔也不是域都要同樣
        indexWriter.addDocument(document);
        //關閉索引庫
        indexWriter.close();
    }

    @Test
    public void deleteAllDocument() throws Exception {
        //刪除所有文檔
        indexWriter.deleteAll();
        //關閉索引庫
        indexWriter.close();
    }

    @Test
    public void deleteDocumentByQuery() throws Exception {
        //根據條件刪除文檔：刪除name域中包含Apache關鍵字的文檔
        indexWriter.deleteDocuments(new Term("name", "apache"));
        indexWriter.close();
    }

    @Test
    public void updateDocument() throws Exception {
        //建立一個新的文檔對象
        Document document = new Document();
        //向文檔對象中添加域
        document.add(new TextField("name", "更新以後的文檔", Field.Store.YES));
        document.add(new TextField("name1", "更新以後的文檔2", Field.Store.YES));
        document.add(new TextField("name2", "更新以後的文檔3", Field.Store.YES));
        //更新操做
        indexWriter.updateDocument(new Term("name", "spring"), document);
        //關閉索引庫
        indexWriter.close();
    }

}

七、Lucene索引庫查詢

對要搜索的信息建立Query查詢對象，Lucene會根據Query查詢對象生成最終的查詢語法，相似關係數據庫Sql語法同樣Lucene也有本身的查詢語法，好比：「name:lucene」表示查詢Field的name爲「lucene」的文檔信息。

可經過兩種方法建立查詢對象：

1、使用Lucene提供Query子類

1.TermQuery

根據關鍵詞進行查詢，須要制定要查詢的關鍵詞

2.RangeQuery

範圍查詢。計算使用LongPOint，取值用的StoreFiled

2、使用QueryParse解析查詢表達式

能夠對要查詢的內容先分詞，而後基於分詞的結果進行查詢

7.1 TermQuery

TermQuery，經過項查詢，TermQuery不使用分析器因此建議匹配不分詞的Field域查詢，好比訂單號、分類ID號等。

指定要查詢的域和要查詢的關鍵詞(只能是詞，不能是句子)。

//使用Termquery查詢
@Test
public void testTermQuery() throws Exception {
    Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());
    IndexReader indexReader = DirectoryReader.open(directory);
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    
    //建立查詢對象
    Query query = new TermQuery(new Term("content", "lucene"));
    //執行查詢
    TopDocs topDocs = indexSearcher.search(query, 10);
    //共查詢到的document個數
    System.out.println("查詢結果總數量：" + topDocs.totalHits);
    //遍歷查詢結果
    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
        Document document = indexSearcher.doc(scoreDoc.doc);
        System.out.println(document.get("filename"));
        //System.out.println(document.get("content"));
        System.out.println(document.get("path"));
        System.out.println(document.get("size"));
    }
    //關閉indexreader
    indexSearcher.getIndexReader().close();

7.2 數值範圍查詢和queryparser查詢

經過QueryParser也能夠建立Query，QueryParser提供一個Parse方法，此方法能夠直接根據查詢語法來查詢。Query對象執行的查詢語法可經過System.out.println(query);查詢。

須要使用到分析器。建議建立索引時使用的分析器和查詢索引時使用的分析器要一致。

須要加入queryParser依賴的jar包：lucene-queryparser-7.4.0.jar

public class SearchIndex {
    private IndexReader indexReader;
    private IndexSearcher indexSearcher;
    @Before
    public void init() throws Exception {
        indexReader = DirectoryReader.open(FSDirectory.open(new File("C:\\temp\\index").toPath()));
        indexSearcher = new IndexSearcher(indexReader);
    }

    private void printResult(Query query) throws Exception {
        //執行查詢
        TopDocs topDocs = indexSearcher.search(query, 10);
        System.out.println("總記錄數：" + topDocs.totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc doc:scoreDocs){
            //取文檔id
            int docId = doc.doc;
            //根據id取文檔對象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            //System.out.println(document.get("content"));
            System.out.println("-----------------寂寞的分割線");
        }
        indexReader.close();
    }


    @Test
    public void testRangeQuery() throws Exception {
        //建立一個Query對象:參與運算的是LongPoint，取得值是StoreFiled的值。LongPoint主要能夠用於範圍查詢
        Query query = LongPoint.newRangeQuery("size", 0l, 100l);
        printResult(query);
    }


    /**
     * 咱們使用Term查詢的時候，只能是一個詞，若是是一段話很明顯就不行了。
     * 使用QueryParser，帶分詞的查詢，先分詞，再根據分詞的結果進行查詢
     * @throws Exception
     */
    @Test
    public void testQueryParser() throws Exception {
        //建立一個QueryParser對象，兩個參數
        QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
        //參數1：默認搜索域，參數2：分析器對象
        //使用QueryParser對象建立一個Query對象
        Query query = queryParser.parse("lucene是一個Java開發的全文檢索工具包");
        //執行查詢
        printResult(query);
    }
}

相關標籤/搜索

springmvc+mybatis+shiro+lucene+rest+webservice+maven

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。