Lucene&全文檢索

時間 2019-11-12

標籤 lucene 全文檢索简体版

原文原文鏈接

目錄結構:
1.全文檢索
2.Lucene入門
3.Lucene進階

全文檢索

一, 生活中的搜索:
1.Windows系統中的有搜索功能：打開「個人電腦」，按「F3」就可使用查找的功能，查找指定的文件或文件夾。搜索的範圍是整個電腦中的文件資源。java

2.Eclipse中的幫助子系統：點擊HelpHelp Contents，能夠查找出相關的幫助信息。搜索的範圍是Eclipse的全部幫助文件。
搜索引擎，如Baidu或Google等，能夠查詢到互聯網中的網頁、PDF、DOC、PPT、圖片、音樂、視頻等。
3.Mac中的Spotlight搜索
4.數據庫中檢索檢查某一個關鍵字的例子。
select * from topic where content like ‘%java%’
文本檢索,會使索引失效算法

存在問題:
1.搜索速度慢
2.搜索效果很差.
3.沒有相關度排序數據庫

二, 什麼是全文檢索？apache

全文檢索是指計算機索引程序經過掃描文章中的每個詞，對每個詞創建一個索引，指明該詞在文章中出現的次數和位置，當用戶查詢時，檢索程序就根據事先創建的索引進行查找，並將查找的結果反饋給用戶的檢索方式。這個過程相似於經過字典中的檢索字表查字的過程。編程

在說全文檢索以前咱們先來了解一下數據分類架構

結構化數據:指具備固定格式或有限長度的數據，如數據庫，元數據等;
半結構化數據:半結構化數據
非結構化數據:指不定長或無固定格式的數據，如郵件，word文檔等;
非結構化數據又一種叫法叫全文數據。從全文數據中進行檢索就叫全文檢索。
特色:只關注文本不考慮語義app

三, 爲何使用 ?
搜索速度:將數據源中的數據都經過全文索引dom

匹配效果:過詞元(term)進行匹配，經過語言分析接口的實現，能夠實現對中文等非英語的支持。函數

相關度:有匹配度算法，將匹配程度（類似度）比較高的結果排在前面。工具

適用場景:關係數據庫中進行模糊查詢時，數據庫自帶的索引將不起做用，此時須要經過全文檢索來提升速度；好比：
網站系統中針對內容的模糊查詢；
select * from article where content like ‘%上海平安%’
ERP系統中產品等數據的模糊查詢，BBS、BLOG中的文章搜索等；
各類搜索引擎運行依賴於全文檢索；
只對指定領域的網站進行索引與搜索（即垂直搜索，如「818工做搜索」、「有道購物搜索」）
要在word、pdf等各類各樣的數據格式中檢索內容；
其它場合：好比搜狐拼音輸入法、Google輸入法等。

四, 工做原理

1.如何查詢全文數據?

順序掃描法(Serial Scanning)：所謂順序掃描，好比要找內容包含某一個字符串的文件，就是一個文檔一個文檔的看，對於每個文檔，從頭看到尾，若是此文檔包含此字符串，則此文檔爲咱們要找的文件，接着看下一個文件，直到掃描完全部的文件。好比Window自帶的搜索。
如何提高全文檢索的速度?

對非結構化數據順序掃描很慢，對結構化數據的搜索卻相對較快（因爲結構化數據有必定的結構能夠採起必定的搜索算法加快速度），那麼把咱們的非結構化數據想辦法弄得有必定結構不就好了嗎？關係數據庫中存儲的都是結構化數據，所以很檢索都比較快。
從非結構化數據中提取出的而後從新組織的信息，咱們稱之索引。
字典及圖書目錄的原理。

2.全文檢索的過程

索引建立:將現實世界中全部的結構化和非結構化數據提取信息，建立索引的過程。
搜索索引:就是獲得用戶的查詢請求，搜索建立的索引，而後返回結果的過程。

3.案例分析

索引文件中應該存放什麼？
索引文件中只須要存放單詞及文檔編號便可
要查出即包含is，又包括 shanghai及pingan的文檔，先得到包含is的文檔列表，再得到包含shanghai及pingan的文檔列表，最合作一個集合並運算，就得出文檔1及文檔3。

文檔0
What is your name?
文檔1
My name is shanghai pingan!
文檔2
What is that?
文檔3
It is shanghai pingan, ShangHai Pingan

首先將咱們非結構化數據存儲到文檔區

文檔編號	內容
0	What is your name?
1	My name is shanghai pingan!
2	What is that?
3	It is shanghai pingan, ShangHai Pingan

如何創建索引？
第一步：分詞組件（Tokenizer）對文檔進行處理,此過程稱爲Tokenize。
1. 將文檔分紅一個一個單獨的單詞。(用空格分開)
2. 去除標點符號。
3. 去除停詞(Stop word)。大量出現的助詞,好比is,it等。中文：的，了，呢
通過分詞(Tokenizer)後獲得的結果稱爲詞元(Token)。詞元(Token)以下：
shanghai,ShangHai,pingan,My,name,What,your,pingan

第二步：將獲得的詞元(Token)傳給語言處理組件(Linguistic Processor)，對於英語，處理大體以下：
1. 變爲小寫(Lowercase)。
2. 將單詞縮減爲詞根形式，如「cars」到「car」等。這種操做稱爲：stemming。
3. 將單詞轉變爲詞根形式，如「drove」到「drive」等。這種操做稱爲：lemmatization。
語言處理組件(linguistic processor)的結果稱爲詞(Term)。結果以下：
shanghai,pingan,my,name,what,your

第三步：把獲得的詞Term傳給索引組件(Indexer)處理,處理過程以下：
一、把獲得的詞建立一個字典表

詞term	文檔Document
what	0
name	0
My	1
name	1
shanghai	1
pingan	1
what	2
that	2
shanghai	3
pingan	3
shanghai	3
pingan	3

二、對字典按字母順序進行排序

詞term	文檔Document
shanghai	1
shanghai	3
shanghai	3
pingan	1
pingan	3
pingan	3
my	1
name	0
name	1
what	0
what	2
your	0

三、合併相同的詞(Term)成爲文檔倒排(Posting List)鏈表。

詞term	出現次數	文檔	Frequency	文檔	Frequency
shanghai	3	1	1	3	2
pingan	3	1	1	3	2
my	1	1	1	~	~
name	2	0	1	1	1
what	2	0	1	2	1
your	1	0	1	~	~

最終會存儲兩部分一個文檔區和一個索引區

詞元	文檔編號
what	0,2
your	0
name	0,1
my	1
shanghai	1,3,3
pingan	1,3,3
that	2

搜索處理的大體流程：
一、接收用戶輸入的搜索詞及關鍵字並做簡單處理；
二、對查詢語句進行詞法分析，語法分析，及語言處理；
三、查詢到包含輸出詞的文檔列表，並進行相關邏輯運算；
四、根據文檔的相關性進行排序，把相關性最高的文檔返回出來。

4.文檔相關性

計算詞的權重:
一、找出詞(Term)對文檔的重要性的過程稱爲計算詞的權重(Term weight)的過程。主要有兩個因素：
A、Term Frequency (tf)：即此Term在此文檔中出現了多少次。tf 越大說明越重要。
B、 Document Frequency (df)：即有多少文檔包含該Term。df 越大說明越不重要。

二、判斷Term之間的關係從而獲得文檔相關性的過程，也即向量空間模型的算法(VSM)。
實現方式：把文檔看做一系列詞(Term)，每個詞(Term)都有一個權重(Term weight)，不一樣的詞(Term)根據本身在文檔中的權重來影響文檔相關性的打分計算

5.全文檢索應用架構

6.全文檢索的流程對應的Lucene 實現的包結構

Lucene 的analysis 模塊主要負責詞法分析及語言處理而造成Term。
Lucene的index模塊主要負責索引的建立，裏面有IndexWriter。
Lucene的store模塊主要負責索引的讀寫。
Lucene 的QueryParser主要負責語法分析。
Lucene的search模塊主要負責對索引的搜索。

Lucene入門

Lucene是什麼？

Lucene是一個用Java寫的高性能、可伸縮的全文檢索引擎工具包，它能夠方便的嵌入到各類應用中實現針對應用的全文索引/檢索功能。Lucene的目標是爲各類中小型應用程序加入全文檢索功能。

開發步驟

創建索引文件

1,建立一個測試類LuceneTest
2,導入jar包
lucene-core-4.10.4.jar 核心包
lucene-analyzers-common-4.10.4.jar 分詞器包
3,建立索引寫入器IndexWriter 傳入對應的參數:索引須要存放的位置,索引寫入器配置對象(配置版本,分詞器)
4.內容寫入以後,寫入到二進制文件中不方便查看,使用工具(lukeall-4.10.0.jar)查看索引庫

public class LuceneTest {
    String content1 = "hello world";
    String content2 = "hello java world";
    String content3 = "hello lucene world";
    String indexPath = "hello";
    Analyzer analyzer = new StandardAnalyzer();//分詞器

    @Test
    public void testCreateIndex() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));//索引須要存放的位置
        //建立索引寫入器配置對象
        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, conf);
        //2.寫入文檔信息
        //添加文檔 定義字段的存儲規則
        FieldType type = new FieldType();
        type.setIndexed(true);//是否要索引
        type.setStored(true);//是否須要存儲
        Document document1 = new Document();//數據庫中的一條數據
        //new Field("字段名","字段內容","字段的配置屬性")
        document1.add(new Field("title", "doc1", type));//該條記錄中的字段 title:doc1
        document1.add(new Field("content", content1, type));//content: hello world
        writer.addDocument(document1);

        Document document2 = new Document();
        document2.add(new Field("title", "doc2", type));
        document2.add(new Field("content", content2, type));
        writer.addDocument(document2);

        Document document3 = new Document();
        document3.add(new Field("title", "doc3", type));
        document3.add(new Field("content", content3, type));
        writer.addDocument(document3);

        //須要把添加的記錄保存
        writer.commit();
        writer.close();
    }
}

運行測試類會在該項目目錄下生成一個hello文件夾

打開_0.xfs文件,這時咱們看不出一個因此然

使用工具(lukeall-4.10.0.jar)查看索引庫
只需在終端經過命令行 java -jar lukeall-4.10.0.jar 便可

須要在Path路徑上找到hello索引庫的絕對路徑

點擊OK便可看到索引庫

查詢索引庫

0.導入jar包lucene-queryparser-4.10.4.jar(將字符串變成Query對象)
1.建立測試方法searchIndex()
2.建立索引查詢對象IndexSearcher
3.根據查詢的文本內容解析成Query查詢對象(導入jar包lucene-queryparser-4.10.4.jar)設置查詢字段,分詞器
4.根據查詢器查詢到文檔編號
5.經過文檔編號查詢對應的文檔內容

//索引查詢過程
@Test
 public void searchIndex() throws Exception {
    //1.建立索引寫入器
    Directory d = FSDirectory.open(new File(indexPath));
    //建立分詞器
    Analyzer analyzer = new StandardAnalyzer();
    //打開索引目錄
    IndexReader r = DirectoryReader.open(d);
    //建立索引查詢對象
    IndexSearcher searcher = new IndexSearcher(r);
    QueryParser parser = new QueryParser("content", analyzer);

    Query query = parser.parse("hello");//查詢hello
    //search(查詢對象,符合條件的前n條記錄)
    TopDocs search = searcher.search(query, 10000);//n:前幾個結果
    System.out.println("符合條件的記錄有多少個:" + search.totalHits);

    ScoreDoc[] scoreDocs = search.scoreDocs;
    for (int i = 0; i < scoreDocs.length; i++) {
        System.out.println("*******************************");
        System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
        int docId = scoreDocs[i].doc;//文檔編號
        Document document = searcher.doc(docId);
        System.out.println("文檔編號 docId--->" + docId);
        System.out.println("標題內容 title:--->" + document.get("content"));
    }
}

打印結果:

經常使用API

Directory:索引目錄用於存放lucene索引文件
Directory是一個對索引目錄的一個抽象，索引目錄能夠存放在普通的文件中，也能夠位於數據庫，或其它的遠程服務中；通常狀況下均使用文件來索引目錄，這時一個Directory就至關於一個文件夾。
SimpleFSDirectory：直接使用java.io.RandomAccessFile類來操做索引文件，在普通的Lucene應用中，能夠直接使用SimpleFSDirectory。

SimpleFSDirectory類：直接使用java.io.RandomAccessFile類來操做索引文件，在普通的Lucene應用中，這是最簡單的用法。
構造函數：
SimpleFSDirectory(File path) ：直接根據一個文件夾地址來建立索引目錄；
MMapDirectory(File path) ：讓OS把整個索引文件映射到虛擬地址空間，這樣Lucene就會以爲索引在內存中。

Document:當往索引中加入內容的時候，每一條信息用一個子Document來表示,Document的意思表示文檔，也能夠理解成記錄，與關係數據表中的一行數據記錄相似；
在Document建立完之後，直接調用其提供的字段操做方法來操做其中的字段對象。
Document提供的方法主要包括：
字段添加：add(Field field)
字段刪除：removeField、removeFields
獲取字段或值:get、getBinaryValue、getField、getFields等

**Field:**Field表明Document中的一行數據，至關於一條Lucene記錄中的一列。
Lucene提供了一個接口Fieldable，其它的API大多針對這個接口編程，所以Lucene中的列對象其實是由Fieldable來定義，實現該接口的除了Field類，還包括NumericField等。在實際開發中，主要使用的是Field類。
Field類提供的經常使用構造方法：
一、Field(String name, String value, Field.Store store, Field.Index index) -經過字段名稱，字段值，存儲類型及索引方式來建立一個字段；
二、Field(String name, byte[] value, Field.Store store) -經過字段名稱、字段值(字節碼)及字段存儲方式建立字段對象；
三、Field(String name, Reader reader) -根據字段名稱及Reader對象建立字段對象；
四、其它構造方法，詳情查看API。
new Field(「title」, 「中國太平」, Store.NO, Index.ANALYZED);
new Field(「content」, 「比較好的保險公司」, Store.YES, Index.ANALYZED);

**FieldType:**Lucene中，在建立Field的時候，能夠指定Field的store及index屬性；
store屬性：表示字段值是否存儲，True表示要存儲，而False則表示不存儲；
type.setStored(true);//是否須要存儲在文檔區中
indexed屬性：表示字段的是否須要創建索引，便是否支持搜索。tokenized屬性：表示字段是否須要根據Analyzer規則進行分詞

建立FieldTest測試類(複製上面的類修改類名)
定義字段的存儲規則

FieldType type2 = new FieldType();
  type2.setIndexed(true);//該字段是否要索引
  type2.setStored(true);//是否須要存儲在文檔區中
  type2.setTokenized(false);//字段是否分詞
  type2.setTokenized(false);//字段是否分詞

設置全部的字段的配置屬性爲type2

document1.add(new Field("content", content1, type2));
document2.add(new Field("content", content2, type2));
document3.add(new Field("content", content3, type2));

public class FieldTest {
    String content1 = "hello world";
    String content2 = "hello java world";
    String content3 = "hello lucene world";
    String indexPath = "fieldType";
    Analyzer analyzer = new StandardAnalyzer();//分詞器

    //建立索引
    @Test
    public void testCreateIndex() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));//索引須要存放的位置
        //建立索引寫入器配置對象
        IndexWriterConfig confg = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        confg.setOpenMode(IndexWriterConfig.OpenMode.CREATE);//索引每次從新建立
        IndexWriter writer = new IndexWriter(d, confg);
        //2.寫入文檔信息
        //添加文檔 定義字段的存儲規則
        FieldType type = new FieldType();
        type.setIndexed(true);//該字段是否要索引
        type.setStored(true);//是否須要存儲
        type.setTokenized(true);

        FieldType type2 = new FieldType();
        type2.setIndexed(true);//該字段是否要索引
        type2.setStored(true);//是否須要存儲
        type2.setTokenized(false);//字段是否分詞

        Document document1 = new Document();//數據庫中的一條數據
        //new Field("字段名","字段內容","字段的配置屬性")
        document1.add(new Field("title", "doc1", type));//該條記錄中的字段 title:doc1
        document1.add(new Field("content", content1, type2));//content: hello world
        writer.addDocument(document1);

        Document document2 = new Document();
        document2.add(new Field("title", "doc2", type));
        document2.add(new Field("content", content2, type2));
        writer.addDocument(document2);

        Document document3 = new Document();
        document3.add(new Field("title", "doc3", type));
        document3.add(new Field("content", content3, type2));
        writer.addDocument(document3);

        //須要把添加的記錄保存
        writer.commit();
        writer.close();
    }
}

運行測試類

查看索引庫

當咱們搜索用戶名或者地名但願是完整的詞元,不但願被分割,此時就能夠設置該字段的tokenize屬性爲false,設置不進行分詞
在索引庫中:
1.標題和內容都經過分詞器進行索引了.
2.標題是完整儲存在文檔區中,內容值截取前30個字符存儲在存儲區
3.文章ID只是存儲在文檔區可是沒有進行分詞
4.時間,做者,閱讀量,評論數,來源是沒索引也沒存儲的

Analyzer(詞法分析器)

建立一個測試類AnalyzerTest
封裝一個測試各個分詞器的方法analyzerMethod(Analyzer analyzer, String content);

public class AnalyzerTest {
    String en = "good morning boy";
    String ch = "你好 恭喜發財 東方明珠三生三世十里桃花";

    @Test
    public void analyzerMethod(Analyzer analyzer, String content) throws Exception {

        TokenStream tokenStream = analyzer.tokenStream("content", content);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            System.out.println(tokenStream);
        }
    }

    //英文分詞器SimpleAnalyzer測試
    @Test
    public void testSimpleAnalyzer() throws Exception {
        analyzerMethod(new SimpleAnalyzer(), en);
    }
}

英文分詞:
SimpleAnalyzer:最簡單的詞法分析器，按英文單詞創建索引，以空格爲分隔符；

//英文分詞器SimpleAnalyzer測試
    @Test
    public void testSimpleAnalyzer() throws Exception {
        analyzerMethod(new SimpleAnalyzer(), en);
    }

StandardAnalyzer:按英文單詞及中文字符來進行分析。

//英文分詞器StandardAnalyzer測試
    @Test
    public void testStandardAnalyzer() throws Exception {
        analyzerMethod(new StandardAnalyzer(), en);
    }

對於英文StandardAnalyzer也是採起空格進行分詞
下面對中文進行分詞測試(對於中文他是單字分詞)

//英文分詞器StandardAnalyzer測試
    @Test
    public void testStandardAnalyzer() throws Exception {
        analyzerMethod(new StandardAnalyzer(), ch);
    }

PerFieldAnalyzerWrapper:

public void testPerFieldAnalyzerWrapper() throws Exception {
  Map<String, Analyzer> analyzerMap = new HashMap<>();
  analyzerMap.put("en", new SimpleAnalyzer());//使用SimpleAnalyzer分詞器
  analyzerMap.put("ch", new StandardAnalyzer());//使用StandardAnalyzer
  //設置默認分詞器
  PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new SimpleAnalyzer(), analyzerMap);
   //會根據傳入的字段名在PerFieldAnalyzerWrapper找到這個字段對應的分詞器
   //若是PerFieldAnalyzerWrapper沒有該字段對應的分詞器就會應用默認的的分詞器
   //tokenStream("content", xxxxxxxxx);根據xxxxxx來判斷選擇的分詞器
   TokenStream tokenStream = wrapper.tokenStream("content", ch);
   tokenStream.reset();
   while (tokenStream.incrementToken()) {
     System.out.println(tokenStream);
   }
 }

中文分詞:
StandardAnalyzer:單字分詞，把每個字當成一個詞

//中文分詞器StandardAnalyzer測試
@Test
public void testStandardAnalyzer() throws Exception {
   analyzerMethod(new StandardAnalyzer(), ch);
 }

CJKAnalyzer:二分法分詞，把相臨的兩個字當成一個詞，好比咱們是中國人；咱們，們是，是中，中國，國人等

//中文分詞器CJKAnalyzer測試
@Test
public void testCJKAnalyzer() throws Exception {
analyzerMethod(new CJKAnalyzer(), ch);
}

SmartChineseAnalyzer:字典分詞，也叫詞庫分詞；把中文的詞所有放置到一個詞庫中，按某種算法來維護詞庫內容；若是匹配到就切分出來成爲詞語。一般詞庫分詞被認爲是最理想的中文分詞算法。如：「咱們是中國人」，效果爲：「咱們」、「中國人」。（可使用SmartChineseAnalyzer，「極易分詞」 MMAnalyzer ，或者是「庖丁分詞」分詞器、IKAnalyzer。推薦使用IKAnalyzer ）

//中文分詞器SmartChineseAnalyzer測試
//須要導入jar包lucene-analyzers-smartcn-4.10.4.jar
@Test
public void testSmartChineseAnalyzer() throws Exception {
   analyzerMethod(new SmartChineseAnalyzer(), ch);
  }
}

IKAnalyzer:第三方的
1.導入jar包 IKAnalyzer2012FF_u1.jar(這個包在中央倉庫是沒有的)支持停詞和自定義拓展詞
2.添加停詞詞典stopword.dic
3.添加拓展詞典ext.dic

//中文分詞器IKAnalyzer測試
//須要導入jar包IKAnalyzer2012FF_u1.jar
 @Test
 public void testIKAnalyzer() throws Exception {
   analyzerMethod(new IKAnalyzer(), ch);
 }

若是想去掉」的」,」了」,」嗎」…..的語氣詞咱們能夠加入配置文件
IKAnalyzer.cfg.xml和stopword.dic

在stopword.dic文件裏添加咱們不須要的分詞便可,這樣拆分詞元就不會把這些停詞做爲分詞了

咱們若是想加入一些咱們本身須要的詞元則須要在配置文件IKAnalyzer.cfg.xml中配置一個額外分詞文件拓展詞典ext.dic
在拓展詞典ext.dic中設置咱們自定義的詞元

索引庫的更新

public class CRUDTest {

    String content1 = "hello world";
    String content2 = "hello java world";
    String content3 = "hello lucene world";
    String indexPath = "luncecrud";
    Analyzer analyzer = new StandardAnalyzer();//分詞器

    //建立索引
    @Test
    public void testCreateIndex() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));//索引須要存放的位置
        //建立索引寫入器配置對象
        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, conf);
        //2.寫入文檔信息
        //添加文檔 定義字段的存儲規則
        FieldType type = new FieldType();
        type.setIndexed(true);//是否要索引
        type.setStored(true);//是否須要存儲
        Document document1 = new Document();//數據庫中的一條數據
        //new Field("字段名","字段內容","字段的配置屬性")
        document1.add(new Field("title", "doc1", type));//該條記錄中的字段 title:doc1
        document1.add(new Field("content", content1, type));//content: hello world
        writer.addDocument(document1);

        Document document2 = new Document();
        document2.add(new Field("title", "doc2", type));
        document2.add(new Field("content", content2, type));
        writer.addDocument(document2);

        Document document3 = new Document();
        document3.add(new Field("title", "doc3", type));
        document3.add(new Field("content", content3, type));
        writer.addDocument(document3);

        //須要把添加的記錄保存
        writer.commit();
        writer.close();
        testSearch();
    }

    @Test
    public void testUpdate() throws Exception {
        //建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, config);
        //更新對象
        Term term = new Term("title", "doc2");//更新的條件
        Document updateDoc = new Document();//更新以後的文檔對象
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setStored(true);
        updateDoc.add(new Field("title", "doc2", type));
        updateDoc.add(new Field("content", "hello黃河之水天上來吧我要更新內容啦", type));
        writer.updateDocument(term, updateDoc);
        //提交更新內容 釋放資源
        writer.commit();
        writer.close();
        testSearch();
    }

    //索引查詢過程
    @Test
    public void testSearch() throws Exception {
        //1.建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));

        //打開索引目錄
        IndexReader r = DirectoryReader.open(d);
        IndexSearcher searcher = new IndexSearcher(r);
        QueryParser parser = new QueryParser("content", analyzer);

        Query query = parser.parse("hello");//查詢hello
        //search(查詢對象,符合條件的前n條記錄)
        TopDocs search = searcher.search(query, 10000);//n:前幾個結果
        System.out.println("符合條件的記錄有多少個:" + search.totalHits);
        ScoreDoc[] scoreDocs = search.scoreDocs;
        Document doc = null;
        for (int i = 0; i < scoreDocs.length; i++) {
            System.out.println("*******************************");
            System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
            int docId = scoreDocs[i].doc;//文檔編號
            Document document = searcher.doc(docId);
            System.out.println("文檔編號 docId--->" + docId);
            System.out.println("標題內容 title:--->" + document.get("title"));
            System.out.println("正文內容 content:--->" + document.get("content"));
        }
    }
}

先建立一個建立索引的方法testCreateIndex()和索引查詢的方法testSearch()而後建立一個索引更新的方法testUpdate();
先執行testCreateIndex()

在執行testUpdate();

把文檔標題爲doc2 的內容更新爲新的內容,同時文檔編號發生變化,文檔編號爲1的被刪除,增長類文檔編號3.說明更新的操做是先刪除後添加

刪除索引庫

@Test
    public void testDelete()throws Exception{
        //建立索引寫入器
        Directory d = FSDirectory.open(new File(indexPath));
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
        IndexWriter writer = new IndexWriter(d, config);
        //刪除記錄
        /**
         * 方式一
         Term term=new Term("title","doc2");
         writer.deleteDocuments(term);
         */
        //方式二
        QueryParser parser = new QueryParser("title", analyzer);
        Query query = parser.parse("doc3");
        writer.deleteDocuments(query);

        //將刪除操做提交
        writer.commit();
        writer.close();
        testSearch();
    }

Lucene進階

查詢全部

//索引查詢過程1
public void search1(String content) throws Exception {
  //1.建立索引寫入器
  Directory d = FSDirectory.open(new File(indexPath));
  //建立分詞器
  Analyzer analyzer = new StandardAnalyzer();
  //打開索引目錄
  IndexReader r = DirectoryReader.open(d);
  IndexSearcher searcher = new IndexSearcher(r);
  QueryParser parser = new QueryParser("content", analyzer);

  Query query = parser.parse(content);//查詢hello
  //search(查詢對象,符合條件的前n條記錄)
  TopDocs search = searcher.search(query, 10000);//n:前幾個結果
  System.out.println("符合條件的記錄有多少個:" + search.totalHits);

  ScoreDoc[] scoreDocs = search.scoreDocs;
    for (int i = 0; i < scoreDocs.length; i++) {
    System.out.println("*******************************");
    System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
    int docId = scoreDocs[i].doc;//文檔編號
    Document document = searcher.doc(docId);
    System.out.println("文檔編號 docId--->" + docId);
    System.out.println("標題內容 title:--->" + document.get("title"));
    System.out.println("正文內容 content:--->" + document.get("content"));
        }
    }


//索引查詢過程2
 public void search2(Query query) throws Exception {
//1.建立索引寫入器
 Directory d = FSDirectory.open(new File(indexPath));
//建立分詞器
 Analyzer analyzer = new StandardAnalyzer();
  //打開索引目錄
 IndexReader r = DirectoryReader.open(d);
 IndexSearcher searcher = new IndexSearcher(r);
 QueryParser parser = new QueryParser("content", analyzer);
 //search(查詢對象,符合條件的前n條記錄)
 TopDocs search = searcher.search(query, 10000);//n:前幾個結果
 System.out.println("符合條件的記錄有多少個:" + search.totalHits);

 ScoreDoc[] scoreDocs = search.scoreDocs;
 for (int i = 0; i < scoreDocs.length; i++) {
    System.out.println("*******************************");
    System.out.println("分數:" + scoreDocs[i].score);//相關度的排序
    int docId = scoreDocs[i].doc;//文檔編號
    Document document = searcher.doc(docId);
    System.out.println("文檔編號 docId--->" + docId);
    System.out.println("標題內容 title:--->" + document.get("title"));
    System.out.println("正文內容 content:--->" + document.get("content"));
    }
    }

@Test
public void test1() throws Exception {
 search1("*:*");//查詢全部,匹配全部字段
 search2(new MatchAllDocsQuery());
 }

單詞搜索

/**
     * 單詞搜索
     *
     * @throws Exception
     */
    @Test
    public void test2() throws Exception {
        //search("title:doc1"); --->public void search(String content)
        search(new TermQuery(new Term("title", "doc1")));//--->search(Query query)
    }

段落查詢

/**
     * 段落查詢
     * @throws Exception
     */
    @Test
    public void test3() throws Exception {
     // search("content:\"hello world\"");
      PhraseQuery query =new PhraseQuery();
      query.add(new Term("content","hello"));
      query.add(new Term("content","world"));
      search(query);
    }

通配符檢索

/**
 * 通配符檢索
 * @throws Exception
 */
@Test
public void test4() throws Exception {
//查詢全部
//方式1
 search("l*ne");
//方式2
 search("luenc?");
//方式3
 WildcardQuery query = new WildcardQuery(new Term("content","l*ne"));
   search(query);
}

search(「l**ne」);中的 *表示多個字符
search(「luenc?」);中的?表示一個字符

單詞模糊查詢

Lucene支持單詞容錯content:lucenx ~1 表示支持單詞容錯一個字母,content:lucenx~N N最大值爲2

@Test
public void test5() throws Exception{
search("content:lxcenX~2");
FuzzyQuery query = new FuzzyQuery(new Term("content","lucenx"),1);
search(query);
}

類似查詢在關鍵字後面使用 ~ （波浪線)符號，後面能夠跟一個表示類似度的數字，好比~0.85 , ~ 0.3 , ~1，值在0-1之間，1表示很是類似度最高，默認爲0.5。

@Test
public void test6() throws Exception{
search("lqcenX~1");
FuzzyQuery query = new FuzzyQuery(new Term("content","lqcenX"));
search(query);
}

段落查詢 (臨近查詢)

content:\」hello world\」~1 表示這個段落中間能夠插入1個單詞
content:\」hello world\」~N 表示這個段落中間能夠插入N個單詞

/**
  * 段落查詢 (臨近查詢)
  * @throws Exception
  */
 @Test
 public void test7() throws Exception{
  //~1 表示這個段落中間能夠插入一個單詞
  //content:\"hello world\"~N 表示這個段落中間能夠插入N個單詞
  //search("content:\"hello world\"~1");
   PhraseQuery query = new PhraseQuery();
   query.add(new Term("content","hello"));
   query.add(new Term("content","world"));
   query.setSlop(1);//設置中間有一個停詞
   search(query);
 }

範圍檢索

/**
  * 範圍檢索
  */
@Test
public void test8() throws Exception {
//  {:左開區間
//  }:右開區間
//  [:左閉區間
//  ]:右閉區間
//search("inputtime:{20101010 TO 20101012}");
//TermRangeQuery(查詢字段,左邊的值,右邊的值,是否左閉區間,是否右閉區間);
  TermRangeQuery query = new TermRangeQuery("inputtime", new BytesRef("20101010"), new BytesRef("20101012"), false, false);
   search(query);
}

組合查詢

AND和&&:目標–>查詢出標題中包括One及內容中包括java的文檔；
下面兩種狀況都可：
title:one && content:java
title:one AND content:java

/**
 * 組合查詢AND和&&
 * @throws Exception
 */
 @Test
 public  void test9() throws Exception {
   //search("content:hello AND inputtime:{20101010 TO 20101012}");
    search("content:hello && inputtime:{20101010 TO 20101012}");
   /*
    BooleanQuery query = new BooleanQuery();
    query.add(new TermQuery(new Term("content","hello")), BooleanClause.Occur.MUST);
    query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.MUST);
    search(query);
   */
 }

OR和||:查詢出標題中包括One但內容中不包括java的文檔；
默認狀況下分詞組合即爲邏輯或(OR)方式。
下面三種狀況都可：
title:one || content:java
title:one OR content:java
title:one content:java

/**
 * 組合查詢OR和||
 * @throws Exception
 */
@Test
public  void test10() throws Exception {
//search("content:lucene OR inputtime:{20101010 TO 20101012}");
//search("content:lucene || inputtime:{20101010 TO 20101012}");
  BooleanQuery query = new BooleanQuery();
  query.add(new TermQuery(new Term("content","lucene")), BooleanClause.Occur.SHOULD);
  query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.SHOULD);
  search(query);
}

Not或!:查詢出標題中包括One但內容中不包括java的文檔；
下面兩種狀況都可：
title:one ! content:java
title:one NOT content:java

/**
 * 組合查詢OR和||
 * @throws Exception
 */
@Test
public  void test10() throws Exception {
  //search("content:lucene OR inputtime:{20101010 TO 20101012}");
  //search("content:lucene || inputtime:{20101010 TO 20101012}");
    BooleanQuery query = new BooleanQuery();
    query.add(new TermQuery(new Term("content","lucene")), BooleanClause.Occur.SHOULD);
    query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.SHOULD);
    search(query);
}

必須包括(+)及排除(-):目標—>查詢出標題中包括One但內容中不包括java的文檔；
+title:one -content:title

增長權重

Luence容許咱們在組合查詢中，指定某一個詞的相關性權重值，從而可讓獲得相關性高的結果;
要提高一個詞的相關性權重，則能夠在關鍵詞的後面添加^n來實現。
好比查詢jakarta apache，若是要把jakarta 的相關性提升，則能夠改成jakarta^4 apache
相關性權重也能夠用於詞組查詢，好比」jakarta apache」^4 「Apache Lucene」將把與jakarta apache詞組最相關的優先排列出來；
相關性權重值默認爲1，通常要提高權重時均設置爲大於1的整數；該值也能夠爲0-1的小數，但不能爲負數。

/**
 *  增長權重
 * @throws Exception
 */
@Test
public  void test12() throws Exception {
 //search("content:lucene^10 java");
   BooleanQuery query = new BooleanQuery();
   TermQuery termQuery = new TermQuery(new Term("content", "lucene"));
   termQuery.setBoost(10);//該查詢對象添加權重
   query.add(termQuery, BooleanClause.Occur.SHOULD);
   query.add(new TermQuery(new Term("content","java")), BooleanClause.Occur.SHOULD);
  search(query);
}

特殊字符

因爲| & ! + - ( ) 等符號在查詢表達式中被用作關鍵字，所以要查詢這些字符必須使用\來進行轉義處理。
當前Lucene查詢中的特殊字符：+ - && || ! ( ) { } [ ] ^ 」 ~ * ? : \
好比，要查詢包括(1+1):2 的文檔，須要使用到以下表達式:
(1+1):2

分組
使用括號()對查詢表示式分組Grouping
Lucene查詢語法中支持經過()來對查詢表達式進行分組，從而組合出各類複雜的查詢。
一、查詢出標題中包括one或two，但內容中不包括java的文檔；
Query query=parser.parse(「title:(one OR two) NOT content:java」);

高亮實現

１、高亮的概述：從搜索結果中截取一部分摘要，並把符合條件的記錄添加高亮顯示；
高亮須要使用jar包lucene-highlighter-4.10.4.jar
２、高亮涉及的功能包括兩部分：Ａ、截取摘要，Ｂ、高亮顯示

Formatter formatter = new SimpleHTMLFormatter("<font color=\"red\">","</font>");
Scorer scorer = new QueryScorer(query);
Highlighter hl = new Highlighter(formatter,scorer);
hl.setMaxDocCharsToAnalyze(20);
String str=hl.getBestFragment(new StandardAnalyzer(), "content",doc.get("content"));