Lucene基礎(1)

時間 2019-11-21

標籤 lucene 基礎简体版

原文原文鏈接

1、Lucene介紹

http://www.kailing.pub/index/columns/colid/16.htmljava

Documentation: http://lucene.apache.org/core/5_5_2/index.htmlgit

API: http://lucene.apache.org/core/5_5_2/core/overview-summary.htmlgithub

按照官網的說法：Lucene is a Java full-text search engine. Lucene is not a complete application, but rather a code library and API that can easily be used to add search capabilities to applications.算法

全文搜索引擎組件，維基百科，lucene一般不會單獨使用，通常會使用solr或者是elasticsearch，因爲es中使用的版本是5.如下演示都是5.數據庫

　　爲何須要搜索引擎？MySQL這樣的RDBMS並不適合用來作全文索引，假設要對一個網站的日誌進行行爲分析，這個數據規模並不適合放入MySQL中，即便能夠放入MySQL中，MySQL不管是使用LIKE %%仍是使用MyISAM fulltext的方案都不是很適合。apache

　　而搜索引擎技術就是解決方案。經過關鍵字搜索文檔的技術，可是搜索算法很是複雜，通常非搜索方向的程序須要大量時間去掌握搜索算法也不太合適，Lucene的做用就是將複雜的搜索算法封裝成相對而言很是簡易使用的API。數組

原文連接：http://www.kailing.pub/article/index/arcid/72.html，很是詳細。app

全文檢索大致分爲2個過程：索引建立(indexing)搜索(search).elasticsearch

1.1 反向索引結構

假設有100篇文檔，編號從1到100，上面則是索引大體的結構。若是要搜索包含lucene和solr的關鍵字，則找出2者交集便可。

1.2 建立索引過程

第一步，經過IO讀取文件至內存，獲得文檔Document

第二步，將Document傳給分詞組件(TOKENIZER)

　　將分檔切分爲單詞

　　去除標點符號

　　去除stop word

　　通過分詞以後獲得的結果稱爲詞元"Token"

第三步：將Token傳遞給語言處理組件LINGUISTIC PROCESSOR，對TOKEN進一步處理(以英語爲例)

　　統一變爲小寫

　　將單詞縮減爲詞根形式，例如"cars"->"car"，這種操做稱爲stemming

　　將單詞變爲詞根形式，例如"drove"->"drive",這種操做稱爲lemmatization

　　(注意stemming和Lemmatization的形式轉換不一樣，所以實現算法也很不一樣)

　　通過linguistic processor處理的結果成爲詞(Term)

第四步: 將獲得的詞傳遞給索引組件(INDEXER)。

　　利用獲得的Term建立一個字典。

　　對字典按照字母進行排序

　　合併相同的詞(Term)成爲倒排(Posing List)鏈表。

1.3 搜索過程

step 1 　　用戶輸入查詢語句

查詢語句也是有必定語法的，例如SQL，而在全文索引中根據實現不一樣而語法不一樣，不過最基本的有AND OR NOT...

step 2　　對查詢進行詞法分析，語法分析以及語言處理

　　1. 詞法分析用來識別單詞和關鍵字.

　　　　例如用戶輸入Lucene AND learned NOT hadoop。分析時候，獲得單詞lucene, learned hadoop，關鍵字有AND NOT

　　2. 詞法分詞根據查詢語句的語法規則生成一棵語法樹

　　3. 語言處理，對語法樹進行語言處理，和以前的linguistic processor處理幾乎相同

step 3　　根據索引，獲得符合語法樹的文檔。　

首先，在反向索引表中，分別找出包含lucene，learn，hadoop的文檔鏈表。
其次，對包含lucene，learn的鏈表進行合併操做，獲得既包含lucene又包含learn的文檔鏈表。
而後，將此鏈表與hadoop的文檔鏈表進行差操做，去除包含hadoop的文檔，從而獲得既包含lucene又包含learn並且不包含hadoop的文檔鏈表。
此文檔鏈表就是咱們要找的文檔。

step 4 　　根據結果文檔按照相關性進行排序

　　　很是複雜，略

step 5　　返回結果

2、Lucene Hello World

maven依賴

<dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>5.5.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>5.5.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>5.5.3</version>
        </dependency>

建立索引:

public class Indexer {
    public  IndexWriter writer;
    /**
     * 實例化寫索引
     */
    public Indexer(String indexDir)throws Exception{
        Analyzer analyzer=new StandardAnalyzer();//分詞器
        IndexWriterConfig writerConfig=new IndexWriterConfig(analyzer);//寫索引配置
        //Directory ramDirectory= new RAMDirectory();//索引寫的內存
        Directory directory= FSDirectory.open(Paths.get(indexDir));//索引存儲磁盤位置
        writer=new IndexWriter(directory,writerConfig);//實例化一個寫索引
    }
    /**
     * 關閉寫索引
     * @throws Exception
     */
    public void close()throws Exception{
        writer.close();
    }
    /**
     * 添加指定目錄的全部文件的索引
     * @param dataDir
     * @return
     * @throws Exception
     */
    public int index(String dataDir)throws Exception{
        File[] files=new File(dataDir).listFiles();//獲得指定目錄的文檔數組
        for(File file:files){
            indexFile(file);
        }
        return writer.numDocs();
    }
    public void indexFile(File file)throws Exception{
        System.out.println("索引文件:"+file.getCanonicalPath());//打印索引到的文件路徑信息
        Document document=getDocument(file);//獲得一個文檔信息，相對一個表記錄
        writer.addDocument(document);//寫入到索引，至關於插入一個表記錄
    }

    /**
     * 返回一個文檔記錄
     * @param file
     * @return
     * @throws Exception
     */
    public Document getDocument(File file)throws Exception{
        Document document=new Document();//實例化一個文檔
        document.add(new TextField("context",new FileReader(file)));//添加一個文檔信息，至關於一個數據庫表字段
        document.add(new TextField("fileName",file.getName(), Field.Store.YES));//添加文檔的名字屬性
        document.add(new TextField("filePath",file.getCanonicalPath(),Field.Store.YES));//添加文檔的路徑屬性
        return document;
    }
    public static void main(String []ages){
        String indexDir="G:\\projects-helloworld\\lucene\\src\\main\\resources\\LuceneIndex";
        String dataDir="G:\\projects-helloworld\\lucene\\src\\main\\resources\\LuceneTestData";
        Indexer indexer=null;
        int indexSum=0;
        try {
            indexer=new Indexer(indexDir);
            indexSum= indexer.index(dataDir);
            System.out.printf("完成"+indexSum+"個文件的索引");

        }catch (Exception e){
            e.printStackTrace();
        }finally {
            try {
                indexer.close();
            }catch (Exception e){
                e.printStackTrace();
            }

        }

    }

}

使用索引進行查詢　

public class Searcher {
    public static void search(String indexDir,String q)throws Exception{
        Directory dir= FSDirectory.open(Paths.get(indexDir));//索引地址
        IndexReader reader= DirectoryReader.open(dir);//讀索引
        IndexSearcher is=new IndexSearcher(reader);
        Analyzer analyzer=new StandardAnalyzer(); // 標準分詞器
        QueryParser parser=new QueryParser("context", analyzer);//指定查詢Document的某個屬性
        Query query=parser.parse(q);//指定查詢索引內容，對應某個分詞
        TopDocs hits=is.search(query, 10);//執行搜索
        System.out.println("匹配 "+q+"查詢到"+hits.totalHits+"個記錄");
        for(ScoreDoc scoreDoc:hits.scoreDocs){
            Document doc=is.doc(scoreDoc.doc);
            System.out.println(doc.get("fileName"));//打印Document的fileName屬性
        }
        reader.close();
    }
    public static void main(String[] args) {
        String indexDir="G:\\projects-helloworld\\lucene\\src\\main\\resources\\LuceneIndex";
        String q="file";
        try {
            search(indexDir,q);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

3、關於Luke

Lucene經常使用工具，github地址：https://github.com/DmitryKey/luke/tree/pivot-luke-5.5.0

一個純maven項目，使用mvn install生成jar既可使用

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。