Lucene學習：建立索引

時間 2019-11-24

標籤 lucene 學習建立索引简体版

原文原文鏈接

1.1. 建立索引

示例：java

  1 import org.apache.lucene.analysis.Analyzer;
  2 
  3 import org.apache.lucene.analysis.TokenStream;
  4 
  5 import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
  6 
  7 import org.apache.lucene.analysis.standard.StandardAnalyzer;
  8 
  9 import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
 10 
 11 import org.apache.lucene.document.Document;
 12 
 13 import org.apache.lucene.document.Field;
 14 
 15 import org.apache.lucene.document.StringField;
 16 
 17 import org.apache.lucene.document.TextField;
 18 
 19 import org.apache.lucene.index.DirectoryReader;
 20 
 21 import org.apache.lucene.index.IndexWriter;
 22 
 23 import org.apache.lucene.index.IndexWriterConfig;
 24 
 25 import org.apache.lucene.index.Term;
 26 
 27 import org.apache.lucene.queryparser.classic.ParseException;
 28 
 29 import org.apache.lucene.queryparser.classic.QueryParser;
 30 
 31 import org.apache.lucene.search.*;
 32 
 33 import org.apache.lucene.store.Directory;
 34 
 35 import org.apache.lucene.store.RAMDirectory;
 36 
 37 import org.apache.lucene.util.BytesRef;
 38 
 39 import org.junit.Assert;
 40 
 41 import org.junit.Before;
 42 
 43 import org.junit.Test;
 44 
 45 import org.wltea.analyzer.lucene.IKAnalyzer;
 46 
 47  
 48 
 49 import cn.lmc.myworld.common.lucene.LuceneUtils;
 50 
 51  
 52 
 53 import java.io.IOException;
 54 
 55 import java.io.StringReader;
 56 
 57  
 58 
 59 public class IndexSearchDemo {
 60 
 61     private Directory directory = new RAMDirectory();
 62 
 63     private String[] ids = {"1", "2"};
 64 
 65     private String[] teamname = {"fpx", "ig"};
 66 
 67     private String[] contents = {"涅槃隊，鳳凰涅槃，勇奪2019LOLS賽冠軍!", "翻山隊，登峯造極，翻過那座山奪得了2019LOLS賽冠軍!"};
 68 
 69     private String[] players = {"doinb,tian,lwx,crisp,gimgoong", "rookie,jacklove,ning,baolan,theshy"};
 70 
 71     private IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new IKAnalyzer());
 72 
 73     private IndexWriter indexWriter;
 74 
 75  
 76 
 77     @Before
 78 
 79     public void createIndex() {
 80 
 81         try {
 82 
 83             indexWriter = new IndexWriter(directory, indexWriterConfig);
 84 
 85             for (int i = 0; i < 2; i++) {
 86 
 87                 Document document = new Document();
 88 
 89                 Field idField = new StringField("id", ids[i], Field.Store.YES);
 90 
 91                 Field teamnameField = new StringField("teamname", teamname[i], Field.Store.YES);
 92 
 93                 Field contentField = new TextField("content", contents[i], Field.Store.YES);
 94 
 95                 Field playersField = new StringField("players", players[i], Field.Store.YES);
 96 
 97                 document.add(idField);
 98 
 99                 document.add(teamnameField);
100 
101                 document.add(contentField);
102 
103                 document.add(playersField);
104 
105                 indexWriter.addDocument(document);
106 
107             }
108 
109             indexWriter.close();
110 
111         } catch (IOException e) {
112 
113             e.printStackTrace();
114 
115         }
116 
117     }
118 
119  
120 
121     @Test
122 
123     public void testTermQuery() throws IOException {
124 
125         Term term = new Term("id", "1");
126 
127         IndexSearcher indexSearcher = getIndexSearcher();
128 
129         TopDocs topDocs = indexSearcher.search(new TermQuery(term), 10);
130 
131         // 1)打印總記錄數（命中數）：相似於百度爲您找到相關結果約100,000,000個
132 
133      long totalHits = topDocs.totalHits.value;
134 
135      System.out.println("查詢（命中）總的文檔條數："+totalHits);
136 
137      // LOGGER.info("查詢（命中）文檔最大分數："+topDocs.getMaxScore());
138 
139      // 2)獲取指定的最大條數的、命中的查詢結果的文檔對象集合
140 
141      ScoreDoc[] scoreDocs = topDocs.scoreDocs;
142 
143      // 打印具體文檔
144 
145      for (ScoreDoc scoreDoc : scoreDocs) {
146 
147 int doc = scoreDoc.doc;
148 
149 Document document = indexSearcher.doc(doc);  
150 
151 // 打印content字段的值
152 
153 System.out.println("id: "+document.get("id"));
154 
155 System.out.println("teamname: "+document.get("teamname"));
156 
157 System.out.println("content: "+document.get("content"));
158 
159 System.out.println("players: "+document.get("players"));
160 
161 }
162 
163 }
164 
165 }

先來一個例子測試一下，咱們要關注下面幾個主要的類：數據庫

Directory
IndexWriterConfig
IndexWriter
IndexSearcher
TopDocs
ScoreDoc
Document
Field

1.1.1. Directory

因爲Lucene的索引是存儲在文件系統上面的，所以要經過Directory這個類來實現索引的存儲。常常使用到的是下面兩個類：apache

FSDirectory：在文件系統上存儲索引文件工具

RAMDirectory：在內存中暫存索引文件，適用少許索引，大量索引會出現頻繁GC測試

1.1.2. IndexWriterConfig

該類主要用於配置Lucene的分詞器。spa

1.1.3. IndexWriter

該類爲Lucene的操做類，能夠對索引進行增，刪操做。這裏可能就會有人問了，」改」和」查」呢？設計

注意一下：code

IndexWriter是有」改」（更新）操做的，可是實現的原理是先刪除，後從新插入。所以更新的時候數據必須齊全，不可以只能更新一個字段的狀況，這樣有可能會出現將其餘字段數據清空的問題。對象

另外，Lucene的索引統一時間內只能有一個IndexWriter來操做，所以設計工具類的時候，IndexWriter儘可能設計爲單例模式。blog

至於」查」操做則由另外一個類IndexSearcher來實現。

1.1.4. IndexSearcher

該類就是用來查詢索引庫的索引的。

1.1.5. TopDocs

該類包含IndexSearcher.search()方法返回的具備較高評分的頂部文檔

1.1.6. ScoreDoc

該類提供對TopDocs中每條搜索結果的訪問接口

1.1.7. Document

Lucene的文檔，也能夠看做數據庫中的一條記錄。

1.1.8. Field

Field屬於Document的一部分，能夠重複。能夠看做數據庫中的字段。

1.1.8.1. Field的類型

名稱	說明
IntPoint	對int型字段索引，只索引不存儲
FloatPoint	對float型字段索引，只索引不存儲
LongPoint	對long型字段索引，只索引不存儲
DoublePoint	對double型字段索引，只索引不存儲
BinaryDocValuesField	只存儲不共享，例如標題類字段
NumericDocValuesField	存儲long型字段，用於評分、排序和值檢索，可存儲值，但須要添加一個單獨的StoredField實例
SortedDocValuesField	索引並存儲，用於String類型的Field排序，須要在StringField後添加同名的SortedDocValuesField
StringField	只索引但不分詞，全部的字符串會做爲一個總體進行索引，例如一般用於country或id等
TextField	索引並分詞，不包括term vectors
StoredField	存儲Field的值，能夠用 IndexSearcher.doc和IndexReader.document來獲取存儲的Field和存儲的值