我封裝的全文檢索之lucene篇

時間 2019-11-11

原文原文鏈接

最近利用晚上下班還有周末的時間本身搗騰的封裝了一個我本身的全文檢索引擎(基於lucene和solr).如今將大概的思路給寫出來,分享下: html

1.首先是索引對象,也能夠說是查詢的VO對象.封裝了幾個經常使用字段(如:主鍵,所屬者ID,所屬者姓名,進入詳情頁面的link,建立時間等),其餘各個模塊的字段(如:標題,內容,郵箱等) java

SearchBean.java spring

字段的代碼以下: apache

/********如下 共有字段***********/
    /**
     * 檢索的內容
     */
    protected String keyword;
    /**
     * 擁有者ID
     */
    protected String owerId;
    /**
     * 擁有者name
     */
    protected String owerName;
    /**
     * 檢索對象的惟一標識位的值
     */
    protected String id;
    /**
     * 檢索出對象後進入詳情頁面的連接
     */
    protected String link;
    /**
     * 建立時間
     */
    protected String createDate;
    /**
     * index類型
     */
    protected String indexType;

    //setter,getter方法省略
/********以上 共有字段***********/

/*************如下 其餘字段************/
    /**
     * 須要檢索出來的字段及其值的對應map
     */
    private Map<String, String> searchValues;

    /**
     * 值對象
     */
    private Object object;

    /**
     * 獲取檢索出來的doIndexFields字段的值
     *
     * @return
     */
    public Map<String, String> getSearchValues() {
        return searchValues;
    }

    /**
     * 設置檢索出來的doIndexFields字段的值
     *
     * @param searchValues
     */
    public void setSearchValues(Map<String, String> searchValues) {
        this.searchValues = searchValues;
    }
    /********************以上 其餘字段*******************/

抽象方法代碼以下: 安全

/*****************如下 抽象方法******************/
    /**
     * 返回須要進行檢索的字段
     *
     * @return
     */
    public abstract String[] getDoSearchFields();

    /**
     * 進行索引的字段
     *
     * @return
     */
    public abstract String[] getDoIndexFields();

    /**
     * 初始化searchBean中的公共字段(每一個對象都必須建立的索引字段)
     * @throws Exception
     */
    public abstract void initPublicFields() throws Exception;

    /**
     * 返回索引類型
     * 
     * @return
     */
    public abstract String getIndexType();
    /*****************以上 抽象方法********************/

共有的方法:

/*******************如下 公共方法**********************/
    /**
     * 獲取須要建立索引字段的鍵值對map
     *
     * @return
     */
    public Map<String, String> getIndexFieldValues() {
        if(this.object == null){
            logger.warn("given object is null!");
            return Collections.emptyMap();
        }

        String[] doIndexFields = this.getDoIndexFields();
        if(doIndexFields == null || doIndexFields.length < 1){
            logger.debug("given no doIndexFields!");
            return Collections.emptyMap();
        }

        Map<String, String> extInfo = new HashMap<String, String>();
        for(String f : doIndexFields){
            String value = getValue(f, object);
            extInfo.put(f, value);
        }

        return extInfo;
    }

    /**
     * 獲取一個對象中的某個字段的值,結果轉化成string類型
     *
     * @param field         字段名稱
     * @param obj           對象
     * @return
     */
    private String getValue(String field, Object obj){
        if(StringUtils.isEmpty(field)){
            logger.warn("field is empty!");
            return StringUtils.EMPTY;
        }

        String result = StringUtils.EMPTY;
        try {
            Object value = ObjectUtils.getFieldValue(object, field);
            if (value == null)
                result = StringUtils.EMPTY;
            else if (value instanceof String)
                result = (String) value;
            else if (value instanceof Collections || value instanceof Map)
                result = ToStringBuilder.reflectionToString(object);
            else if (value instanceof Date)
                result = DateUtils.formatDate((Date) value);
            else
                result = value.toString();

        } catch (IllegalAccessException e) {
            logger.error("can not find a value for field '{}' in object class '{}'!", field, object.getClass());
        }

        return result;
    }

    /**
     * you must use this method when you create the index, set what object you will to be created its index!
     *
     * @param object            the object which you will want to be create index
     */
    public void setObject(Object object){
        this.object = object;
    }

    /**
     * get what object you want to be created index!
     *
     * @return
     */
    public Object getObject(){
        return this.object;
    }
    /***************以上 公共方法*************/

2.如今有不少開源或者閉源的索引引擎能夠用在項目上使用,因此我寫了一個接口和一個抽取了一些公共方法的抽象類,只須要將你選擇的搜索引擎的具體建立索引,檢索等功能的實現代碼寫在一個繼承上面這個抽象類的子類中,就能夠隨意的切換使用的目標引擎.貼上接口和抽象類 ui

SearchEngine.java this

package com.message.base.search.engine;

import com.message.base.pagination.PaginationSupport;
import com.message.base.search.SearchBean;

import java.util.List;

/**
 * 索引引擎實現構建索引.刪除索引.更新索引.檢索等操做.
 *
 * @author sunhao(sunhao.java@gmail.com)
 * @version V1.0
 * @createTime 13-5-5 上午1:38
 */
public interface SearchEngine {

    /**
     * 建立索引(考慮線程安全)
     *
     * @param searchBeans       對象
     * @throws Exception
     */
    public void doIndex(List<SearchBean> searchBeans) throws Exception;

    /**
     * 刪除索引
     *
     * @param bean              對象
     * @throws Exception
     */
    public void deleteIndex(SearchBean bean) throws Exception;

    /**
     * 刪除索引(刪除多個)
     *
     * @param beans             對象
     * @throws Exception
     */
    public void deleteIndexs(List<SearchBean> beans) throws Exception;

    /**
     * 進行檢索
     *
     * @param bean              檢索對象(通常只須要放入值keyword,即用來檢索的關鍵字)
     * @param isHighlighter     是否高亮
     * @param start             開始值
     * @param num               偏移量
     * @return
     * @throws Exception
     */
    public PaginationSupport doSearch(SearchBean bean, boolean isHighlighter, int start, int num) throws Exception;

    /**
     * 進行多個檢索對象的檢索
     *
     * @param beans             多個檢索對象(通常只須要放入值keyword,即用來檢索的關鍵字)
     * @param isHighlighter     是否高亮
     * @param start             開始值
     * @param num               偏移量
     * @return
     * @throws Exception
     */
    public PaginationSupport doSearch(List<SearchBean> beans, boolean isHighlighter, int start, int num) throws Exception;

    /**
     * 刪除某個類型的全部索引(考慮線程安全)
     *
     * @param clazz             索引類型
     * @throws Exception
     */
    public void deleteIndexsByIndexType(Class<? extends SearchBean> clazz) throws Exception;

    /**
     * 刪除某個類型的全部索引(考慮線程安全)
     *
     * @param indexType         索引類型
     * @throws Exception
     */
    public void deleteIndexsByIndexType(String indexType) throws Exception;

    /**
     * 刪除全部的索引
     *
     * @throws Exception
     */
    public void deleteAllIndexs() throws Exception;

    /**
     * 更新索引
     *
     * @param searchBean        須要更新的bean
     * @throws Exception
     */
    public void updateIndex(SearchBean searchBean) throws Exception;

    /**
     * 批量更新索引
     *
     * @param searchBeans       須要更新的beans
     * @throws Exception
     */
    public void updateIndexs(List<SearchBean> searchBeans) throws Exception;
}

AbstractSearchEngine.java 搜索引擎

package com.message.base.search.engine;

import com.message.base.pagination.PaginationSupport;
import com.message.base.pagination.PaginationUtils;
import com.message.base.search.SearchBean;
import com.message.base.utils.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Collections;

/**
 * 搜索引擎的公用方法.
 *
 * @author sunhao(sunhao.java@gmail.com)
 * @version V1.0
 * @createTime 13-5-8 下午10:53
 */
public abstract class AbstractSearchEngine implements SearchEngine {
    private static final Logger logger = LoggerFactory.getLogger(AbstractSearchEngine.class);

    /**
     * 進行高亮處理時,html片斷的前綴
     */
    private String htmlPrefix = "<p>";
    /**
     * 進行高亮處理時,html片斷的後綴
     */
    private String htmlSuffix = "</p>";

    public String getHtmlPrefix() {
        return htmlPrefix;
    }

    public void setHtmlPrefix(String htmlPrefix) {
        this.htmlPrefix = htmlPrefix;
    }

    public String getHtmlSuffix() {
        return htmlSuffix;
    }

    public void setHtmlSuffix(String htmlSuffix) {
        this.htmlSuffix = htmlSuffix;
    }

    public PaginationSupport doSearch(SearchBean bean, boolean isHighlighter, int start, int num) throws Exception {
        if(bean == null){
            logger.debug("given search bean is empty!");
            return PaginationUtils.getNullPagination();
        }

        return doSearch(Collections.singletonList(bean), isHighlighter, start, num);
    }

    /**
     * 獲取index類型
     *
     * @param bean
     * @return
     */
    public String getIndexType(SearchBean bean){
        return StringUtils.isNotEmpty(bean.getIndexType()) ? bean.getIndexType() : bean.getClass().getSimpleName();
    }
}

3.開始談談lucene spa

貼上代碼先: .net

LuceneSearchEngine.java

package com.message.base.search.engine;

import com.message.base.pagination.PaginationSupport;
import com.message.base.pagination.PaginationUtils;
import com.message.base.search.SearchBean;
import com.message.base.search.SearchInitException;
import com.message.base.utils.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.BeanUtils;

import java.io.File;
import java.io.IOException;
import java.util.*;

/**
 * 基於lucene實現的索引引擎.
 *
 * @author sunhao(sunhao.java@gmail.com)
 * @version V1.0
 * @createTime 13-5-5 上午10:38
 */
public class LuceneSearchEngine extends AbstractSearchEngine {
    private static final Logger logger = LoggerFactory.getLogger(LuceneSearchEngine.class);
    /**
     * 索引存放路徑
     */
    private String indexPath;
    /**
     * 分詞器
     */
    private Analyzer analyzer = new SimpleAnalyzer();

    public synchronized void doIndex(List<SearchBean> searchBeans) throws Exception {
        this.createOrUpdateIndex(searchBeans, true);
    }

    public synchronized void deleteIndex(SearchBean bean) throws Exception {
        if(bean == null){
            logger.warn("Get search bean is empty!");
            return;
        }

        String id = bean.getId();

        if(StringUtils.isEmpty(id)){
            logger.warn("get id and id value from bean is empty!");
            return;
        }
        String indexType = getIndexType(bean);
        Directory indexDir = this.getIndexDir(indexType);
        IndexWriter writer = this.getWriter(indexDir);

        writer.deleteDocuments(new Term("pkId", id));
        writer.commit();
        this.destroy(writer);
    }

    public synchronized void deleteIndexs(List<SearchBean> beans) throws Exception {
        if(beans == null){
            logger.warn("Get beans is empty!");
            return;
        }

        for(SearchBean bean : beans){
            this.deleteIndex(bean);
        }
    }

    public PaginationSupport doSearch(List<SearchBean> beans, boolean isHighlighter, int start, int num) throws Exception {
        if(beans == null || beans.isEmpty()){
            logger.debug("given search beans is empty!");
            return PaginationUtils.getNullPagination();
        }

        List queryResults = new ArrayList();
        int count = 0;
        for(SearchBean bean : beans){
            String indexType = getIndexType(bean);

            IndexReader reader = IndexReader.open(this.getIndexDir(indexType));

            List<String> fieldNames = new ArrayList<String>();             //查詢的字段名
            List<String> queryValue = new ArrayList<String>();             //待查詢字段的值
            List<BooleanClause.Occur> flags = new ArrayList<BooleanClause.Occur>();

            //要進行檢索的字段
            String[] doSearchFields = bean.getDoSearchFields();
            if(doSearchFields == null || doSearchFields.length == 0)
                return PaginationUtils.getNullPagination();

            //默認字段
            if(StringUtils.isNotEmpty(bean.getKeyword())){
                for(String field : doSearchFields){
                    fieldNames.add(field);
                    queryValue.add(bean.getKeyword());
                    flags.add(BooleanClause.Occur.SHOULD);
                }
            }

            Query query = MultiFieldQueryParser.parse(Version.LUCENE_CURRENT, queryValue.toArray(new String[]{}), fieldNames.toArray(new String[]{}),
                    flags.toArray(new BooleanClause.Occur[]{}), analyzer);

            logger.debug("make query string is '{}'!", query.toString());
            IndexSearcher searcher = new IndexSearcher(reader);
            ScoreDoc[] scoreDocs = searcher.search(query, 1000000).scoreDocs;

            //查詢起始記錄位置
            int begin = (start == -1 && num == -1) ? 0 : start;
            //查詢終止記錄位置
            int end = (start == -1 && num == -1) ? scoreDocs.length : Math.min(begin + num, scoreDocs.length);

            //高亮處理
            Highlighter highlighter = null;
            if(isHighlighter){
                SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(this.getHtmlPrefix(), this.getHtmlSuffix());
                highlighter = new Highlighter(formatter, new QueryScorer(query));
            }

            List<SearchBean> results = new ArrayList<SearchBean>();
            for (int i = begin; i < end; i++) {
                SearchBean result = BeanUtils.instantiate(bean.getClass());

                int docID = scoreDocs[i].doc;
                Document hitDoc = searcher.doc(docID);

                result.setId(hitDoc.get("pkId"));
                result.setLink(hitDoc.get("link"));
                result.setOwerId(hitDoc.get("owerId"));
                result.setOwerName(hitDoc.get("owerName"));
                result.setCreateDate(hitDoc.get("createDate"));
                result.setIndexType(indexType);

                String keyword = StringUtils.EMPTY;
                if(isHighlighter && highlighter != null)
                    keyword = highlighter.getBestFragment(analyzer, "keyword", hitDoc.get("keyword"));

                if(StringUtils.isEmpty(keyword))
                    keyword = hitDoc.get("keyword");

                result.setKeyword(keyword);

                Map<String, String> extendValues = new HashMap<String, String>();
                for(String field : doSearchFields){
                    String value = hitDoc.get(field);
                    if(isHighlighter && highlighter != null)
                        value = highlighter.getBestFragment(analyzer, field, hitDoc.get(field));

                    if(StringUtils.isEmpty(value))
                        value = hitDoc.get(field);

                    extendValues.put(field, value);
                }

                result.setSearchValues(extendValues);

                results.add(result);
            }

            queryResults.addAll(results);
            count += scoreDocs.length;
            searcher.close();
            reader.close();
        }

        PaginationSupport paginationSupport = PaginationUtils.makePagination(queryResults, count, num, start);
        return paginationSupport;
    }

    public synchronized void deleteIndexsByIndexType(Class<? extends SearchBean> clazz) throws Exception {
        String indexType = getIndexType(BeanUtils.instantiate(clazz));
        this.deleteIndexsByIndexType(indexType);
    }

    public synchronized void deleteIndexsByIndexType(String indexType) throws Exception {
        //傳入readOnly的參數,默認是隻讀的
        IndexReader reader = IndexReader.open(this.getIndexDir(indexType), false);
        int result = reader.deleteDocuments(new Term("indexType", indexType));
        reader.close();
        logger.debug("the rows of delete index is '{}'! index type is '{}'!", result, indexType);
    }

    public synchronized void deleteAllIndexs() throws Exception {
        File indexFolder = new File(this.indexPath);
        if(indexFolder == null || !indexFolder.isDirectory()){
            //不存在或者不是文件夾
            logger.debug("indexPath is not a folder! indexPath: '{}'!", indexPath);
            return;
        }

        File[] children = indexFolder.listFiles();
        for(File child : children){
            if(child == null || !child.isDirectory()) continue;

            String indexType = child.getName();
            logger.debug("Get indexType is '{}'!", indexType);

            this.deleteIndexsByIndexType(indexType);
        }
    }

    public void updateIndex(SearchBean searchBean) throws Exception {
        this.updateIndexs(Collections.singletonList(searchBean));
    }

    public void updateIndexs(List<SearchBean> searchBeans) throws Exception {
        this.createOrUpdateIndex(searchBeans, false);
    }

    /**
     * 建立或者更新索引
     *
     * @param searchBeans       須要建立或者更新的對象
     * @param isCreate          是不是建立索引;true建立索引,false更新索引
     * @throws Exception
     */
    private synchronized void createOrUpdateIndex(List<SearchBean> searchBeans, boolean isCreate) throws Exception {
        if(searchBeans == null || searchBeans.isEmpty()){
            logger.debug("do no index!");
            return;
        }

        Directory indexDir = null;
        IndexWriter writer = null;
        for(Iterator<SearchBean> it = searchBeans.iterator(); it.hasNext(); ){
            SearchBean sb = it.next();
            String indexType = getIndexType(sb);
            if(sb == null){
                logger.debug("give SearchBean is null!");
                return;
            }
            boolean anotherSearchBean = indexDir != null && !indexType.equals(((FSDirectory) indexDir).getFile().getName());
            if(indexDir == null || anotherSearchBean){
                indexDir = this.getIndexDir(indexType);
            }
            if(writer == null || anotherSearchBean){
                this.destroy(writer);
                writer = this.getWriter(indexDir);
            }

            Document doc = new Document();

            //初始化一些字段
            sb.initPublicFields();
            String id = sb.getId();

            //主鍵的索引,不做爲搜索字段,而且也不進行分詞
            Field idField = new Field("pkId", id, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
            doc.add(idField);

            logger.debug("create id index for '{}', value is '{}'! index is '{}'!", new Object[]{"pkId", id, idField});

            String owerId = sb.getOwerId();
            if(StringUtils.isEmpty(owerId)){
                throw new SearchInitException("you must give a owerId");
            }
            Field owerId_ = new Field("owerId", owerId, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
            doc.add(owerId_);

            String owerName = sb.getOwerName();
            if(StringUtils.isEmpty(owerName)){
                throw new SearchInitException("you must give a owerName");
            }
            Field owerName_ = new Field("owerName", owerName, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
            doc.add(owerName_);

            String link = sb.getLink();
            if(StringUtils.isEmpty(link)){
                throw new SearchInitException("you must give a link");
            }
            Field link_ = new Field("link", link, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
            doc.add(link_);

            String keyword = sb.getKeyword();
            if(StringUtils.isEmpty(keyword)){
                throw new SearchInitException("you must give a keyword");
            }
            Field keyword_ = new Field("keyword", keyword, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
            doc.add(keyword_);

            String createDate = sb.getCreateDate();
            if(StringUtils.isEmpty(createDate)){
                throw new SearchInitException("you must give a createDate");
            }
            Field createDate_ = new Field("createDate", createDate, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
            doc.add(createDate_);

            //索引類型字段
            Field indexType_ = new Field("indexType", indexType, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
            doc.add(indexType_);

            //進行索引的字段
            String[] doIndexFields = sb.getDoIndexFields();
            Map<String, String> indexFieldValues = sb.getIndexFieldValues();
            if(doIndexFields != null && doIndexFields.length > 0){
                for(String field : doIndexFields){
                    Field extInfoField = new Field(field, indexFieldValues.get(field), Field.Store.YES, Field.Index.ANALYZED,
                            Field.TermVector.WITH_POSITIONS_OFFSETS);

                    doc.add(extInfoField);
                }
            }

            if(isCreate)
                writer.addDocument(doc);
            else
                writer.updateDocument(new Term("pkId", sb.getId()), doc);

            writer.optimize();
        }

        this.destroy(writer);
        logger.debug("create or update index success!");
    }

    public Directory getIndexDir(String suffix) throws Exception {
        return FSDirectory.open(new File(indexPath + File.separator + suffix));
    }

    public IndexWriter getWriter(Directory indexDir) throws IOException {
        return new IndexWriter(indexDir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
    }

    public void destroy(IndexWriter writer) throws Exception {
        if(writer != null)
            writer.close();
    }

    public void setIndexPath(String indexPath) {
        this.indexPath = indexPath;
    }

    public void setAnalyzer(Analyzer analyzer) {
        this.analyzer = analyzer;
    }

}

關於如何使用lucene這裏我就再也不重複了,網上一大堆這方面的資料,有什麼不懂得能夠谷歌一下.下面談談個人一些想法,有不對的,儘管拍磚,來吧:

....

也沒啥好說的,等想到再補充吧,就是以爲有一點比較操蛋,窩心:

FSDirectory.open(new File("D:\index\xxx"/**一個不存在的目錄,或者是一個不是索引的目錄**/));

使用上面一段取到索引Directory的時候,若是目錄不存在會報錯.能夠有人認爲這沒什麼,就是應該,我封裝的這代碼裏面,確實對這玩意有要求的.

上面的SearchBean.java中有一個字段叫indexType,當沒有指定的時候,默認爲類名,如MessageSerarchBean,若是我沒有對Message進行建立索引操做,在檢索的時候就報錯了.我得想一想用什麼方法給解決掉.

最後PS:這是博客,無法上傳代碼,因此我就在代碼分享的地方上傳代碼,link: http://www.oschina.net/code/snippet_151849_21445;

下一篇:我封裝的全文檢索之lucene篇

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。