Xapian的內存索引

  關鍵字:xapian、內存索引ios

  xapian除了提供用於生產環境的磁盤索引,也提供了內存索引(InMemoryDatabase)。內存索引。咱們能夠經過觀察內存索引的設計,來了解xapian的設計思路。git

1 用途github

  官方文檔說法:api

  「inmemory, This type is a database held entirely in memory. It was originally written for testing purposes only, but may prove useful for building up temporary small databases.」數據結構

  早期版本的源碼說明:post

  「This backend stores a database entirely in memory.  When the database is closed these indexed contents are lost.This is useful for searching through relatively small amounts of data (such as a single large file) which hasn't previously been indexed.」測試

  早期版本的源碼註釋:ui

  「This is a prototype database, mainly used for debugging and testing」spa

  

  總的來講,這是一個原型DB,最初只用來作測試和debug用,沒有持久化,關閉DB數據就丟失,能夠用來處理小量數據的搜索,而且這部分數據能夠在內存中實時建索引。prototype

2 使用內存索引

/***************************************************************************
*
* @file ram_xapian.cpp
* @author cswuyg
* @date 2019/02/21
* @brief
*
**************************************************************************/
// inmemory index use deprecated class, disalbe the compile error.
#pragma warning(disable : 4996)
#include <iostream>
#include "xapian.h"

#pragma comment(lib, "libxapian.a")
#pragma comment(lib, "zdll.lib")

const char* const K_DB_PATH = "index_data";
const char* const K_DOC_UNIQUE_ID = "007";

Xapian::WritableDatabase createIndex() {
    std::cout << "--index start--" << std::endl;
    Xapian::WritableDatabase db = Xapian::InMemory::open();

    Xapian::Document doc;
    doc.add_posting("T世界", 1);
    doc.add_posting("T體育", 1);
    doc.add_posting("T比賽", 1);
    doc.set_data("世界體育比賽");
    doc.add_boolean_term(K_DOC_UNIQUE_ID);

    Xapian::docid innerId = db.replace_document(K_DOC_UNIQUE_ID, doc);

    std::cout << "add doc innerId=" << innerId << std::endl;

    db.commit();

    std::cout << "--index finish--" << std::endl;
    return db;
}

void queryIndex(Xapian::WritableDatabase db) {
    std::cout << "--search start--" << std::endl;
    Xapian::Query termOne = Xapian::Query("T世界");
    Xapian::Query termTwo = Xapian::Query("T比賽");
    Xapian::Query termThree = Xapian::Query("T體育");
    auto query = Xapian::Query(Xapian::Query::OP_OR, Xapian::Query(Xapian::Query::OP_OR, termOne, termTwo), termThree);
    std::cout << "query=" << query.get_description() << std::endl;

    Xapian::Enquire enquire(db);
    enquire.set_query(query);
    Xapian::MSet result = enquire.get_mset(0, 10);
    std::cout << "find results count=" << result.get_matches_estimated() << std::endl;

    for (auto it = result.begin(); it != result.end(); ++it) {
        Xapian::Document doc = it.get_document();
        std::string data = doc.get_data();
        Xapian::weight docScoreWeight = it.get_weight();
        Xapian::percent docScorePercent = it.get_percent();

        std::cout << "doc=" << data << ",weight=" << docScoreWeight << ",percent=" << docScorePercent << std::endl;
    }

    std::cout << "--search finish--" << std::endl;
}

int main() {
    auto db = createIndex();
    queryIndex(db);
    return 0;
}

github: https://github.com/cswuyg/xapian_exercise/tree/master/ram_xapian

3 數據結構

內存索引包含一系列數據結構,經過這些數據結構,能夠一窺xapian的索引設計思路。

內存索引數據結構以下圖所示:

幾個主要的操做類封裝

InMemoryPostList:內存中的postlist,單個term,操做的就是倒排鏈表;    

InMemoryAllDocsPostList:內存中的postlist,整個DB,操做的其實是termlist表(doc表);

InMemoryTermList: 某個doc的term列表;

InMemoryDatabase: 內存DB;

InMemoryAllTermsList: 內存中的termlist,其實是整個DB的postlists;

InMemoryDocument:單個doc的操做封裝  ;

InMemoryPositionList:內存中的position列表操做封裝

相關文章
相關標籤/搜索