Xapian索引-文檔檢索過程分析

時間 2019-12-10

原文原文鏈接

本文是Xapian檢索過程的分析，本文內容中源碼比較多。檢索過程，總的來講就是拉取倒排鏈，取得合法doc，而後作打分排序的過程。html

1 理論分析算法

1.1 檢索語法api

面對不一樣的檢索業務，咱們會有多種檢索需求，譬如：要求A term和B term都在Doc中出現；要求A term或者B term任意在Doc中出現；要求A term或者B term任意在Doc出現，而且C term不出現…...，用符號表示：app

A & B框架

A || Bide

(A || B) & ~C函數

( A & ( B || C ) ) || Doop

…源碼分析

以上的種種檢索需求，複雜繁多，每個檢索需求都單獨實現一份代碼，是不現實的，須要有一種簡單、高效、可擴展的檢索語法來支持他們。post

1.2 檢索過程

首先是根據業務需求，組裝檢索語句，而後調用檢索內核提供的API獲取檢索結果。

檢索內核的實現，以xapian爲例：首先根據用戶組裝的檢索語句造成query-tree（query檢索樹），而後將query-tree轉換爲postlist-tree（倒排鏈樹），最後獲取postlist-tree運算後的結果。在獲取postlist-tree和最後的計算過程當中，穿插着相關性公式（如：BM25）的運算。

1.3 相關性

計算query跟doc相關性方式有好幾種，

（1）布爾模型（Boolean Model）
判斷用戶的term在不在文檔中出現，若是出現了則認爲文檔跟用戶需求相關，不然認爲不相關。
優勢：簡單；
缺點：結果是二元的，只有YES 或者 NO，多條結果之間沒有前後順序；

（2）向量空間模型（Vector Space Model）
將query和doc都向量化，計算query跟doc的餘弦值，這個值就是query跟doc的類似性打分。這裏將查詢跟文檔的內容類似性替換相關性。
這個模型對長文本比較抑制。
consine公式：向量點積 / 向量長度相乘。
那麼，怎麼向量化？每一緯的值，給多少合適？
詞頻因子（TF）：某個單詞在文檔中出現的次數；通常取log作平滑，避免直接使用詞頻致使出現1次和出現10次的term權重差別過大。常見公式: Wtf = 1 + log(TF). 常量1是爲了不TF=1時，log(TF) = 0，致使W變成0。
變體公式： Wtf = a + (1 - a) * TF/Max(TF)，其中a是調節因子，取值0.4或者0.5，TF表示詞頻，Max(TF)表示文檔中出現次數最多的單詞對應的詞頻數目。這個變種有利於抑制長文本，使得不一樣長度文檔的詞頻因子具備可比性。
逆文檔頻率因子（IDF）：包含有某個詞的文檔數量的倒數。若是一個詞在全部文檔中都出現，那麼這個詞對文檔的區分度貢獻不高，不是那麼重要，反之，則說明這個詞很重要。
公式： IDFk = log(N/nk)， N表明文檔集合總共有多少個文檔；nk表明詞在多少個文檔中出現過。

TF*IDF框架：
Weight = TF * IDF

（3）機率檢索模型
BIM模型的公式，由四個部分組成，這四個部分能夠理解爲：

一、含有某term的doc在相關集合中出現的次數，正面因素；
二、不含有某term的doc在相關集合中出現的次數，負面因素；
三、含有某term的doc在不相關集合中出現的次數，負面因素；
四、不含有某term的doc在不相關集合中不出現的次數，正面因素。

BM25公式，三部分：一、BIM模型，等價於IDF；二、term在文檔中的權重（doc-tf）；三、term在query中的權重（query-tf）；

N，表示索引中總的文檔數，
Ni，表示索引中包含有term的文檔數，也就是df，
fi，表示term在文檔中出現的次數，
qfi，表示term在query中出現的次數，
dl，表示文檔長度，
avdl，表示平均文檔長度

BM25F
考慮到不一樣的域，對第二部分的平均長度、調節因子，須要根據不一樣的域設置不一樣的值，而且須要一個跟域相關聯的權重值。

相關性部分資料參考：《這就是搜索引擎》

2 源碼分析

2.1 主要類

下面以xapian爲例，介紹通常檢索過程，因涉及源碼衆多，部分枝節策略不一一細說。首先，這裏列出，涉及到的主要類，從這裏也能夠一窺xapian在檢索上的設計思路。

綠色背景的塊是用戶看到的，藍色背景是其底層涉及到的。

Enquire::Internal，Enquire的內部實現，Xapian的設計風格都是包一層殼，功能實際的實現放在Internal中；

BM25Weight，Xapian默認使用的相關性打分類；

Weight::Internal，打分須要用到的基礎信息，譬如：索引庫文檔量、索引庫總的term長度、query裏的 term的tf、df數據…；

MultiMatch，檢索的實現類；

LocalSubMatch，本地子索引庫操做的封裝。 xapian支持遠程索引庫，也支持一個索引庫拆分紅多個子索引庫；

QueryOptimiser ，從Query-Tree構建PostList-Tree時的幫助類，主要記錄了一些子索引庫相關的信息，譬如：LocalSubMatch的引用、索引庫DataBase的引用…；

QueryOr、QueryBranch、QueryTerm ，這系列是Query Tree上的一個個類；

PostList、LeafPostList，PostList-Tree上的一個個類；

InMemoryPostList，內存索引庫的PostList封裝；

OrContext，記錄在Query-Tree轉PostList-Tree過程當中的PostList上下文信息，包括：QueryOptimiser對象指針、臨時存放的PostList指針；

2.2 檢索過程

2.2.1 用戶demo代碼

Xapian::Query term_one = Xapian::Query("T世界");
Xapian::Query term_two = Xapian::Query("T比賽");
Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, term_one, term_two); // query組裝

std::cout << "query=" << query.get_description() << std::endl;

Xapian::Enquire enquire(db);
enquire.set_query(query);
Xapian::MSet result = enquire.get_mset(0, 10); // 執行檢索，獲取結果
std::cout << "find results count=" << result.get_matches_estimated() << std::endl;

for (auto it = result.begin(); it != result.end(); ++it) {
    Xapian::Document doc = it.get_document();
    std::string data = doc.get_data();
    double doc_score_weight = it.get_weight();
    int doc_score_percent = it.get_percent();
    std::cout << "doc=" << data << ",weight=" << doc_score_weight << ",percent=" << doc_score_percent << std::endl;
}

2.2.2 query組裝的實現

只有一個類——Query，經過構造函數重載，提供了一切須要的功能。

eg：

Query::Query(const string & term, Xapian::termcount wqf, Xapian::termpos pos)
    : internal(new Xapian::Internal::QueryTerm(term, wqf, pos)) {
    LOGCALL_CTOR(API, "Query", term | wqf | pos);
}

Query(op op_, const Xapian::Query & a, const Xapian::Query & b) {
    init(op_, 2);
    bool positional = (op_ == OP_NEAR || op_ == OP_PHRASE);
    add_subquery(positional, a);
    add_subquery(positional, b);
    done();
}

/* 根據OP，生成對應的Query派生類，譬如：or的生成 QueryOr類，含有兩個子query，這個QueryOr類對象做爲Query的internal成員存在；
在組合多個query時，直接添加到vector中；
若是最後發現vector是空的則將internal設置爲NULL，或者=1，則將internal設置爲子query的internal，這樣子能夠避免沒必要要的vector嵌套，如：[xxquery]，單個元素不必放在vector中。*/

...

檢索樹的組織沒有作特別的設計，譬如：用vector來存儲OR的元素。

2.2.3 檢索的實現

（1）檢索函數入口：

MSet Enquire::Internal::get_mset(Xapian::doccount first, Xapian::doccount maxitems, Xapian::doccount check_at_least, const RSet *rset, const MatchDecider *mdecider) const {
    LOGCALL(MATCH, MSet, "Enquire::Internal::get_mset", first | maxitems | check_at_least | rset | mdecider);

    if (percent_cutoff && (sort_by == VAL || sort_by == VAL_REL)) {
        throw Xapian::UnimplementedError("Use of a percentage cutoff while sorting primary by value isn't currently supported");
    }

    if (weight == 0) {
        weight = new BM25Weight;  // 若是外界沒有指定打分策略，採用BM25Weight
    }

    Xapian::doccount first_orig = first;
    {
        Xapian::doccount docs = db.get_doccount();
        first = min(first, docs);
        maxitems = min(maxitems, docs - first);
        check_at_least = min(check_at_least, docs);
        check_at_least = max(check_at_least, first + maxitems);
    }

    AutoPtr<Xapian::Weight::Internal> stats(new Xapian::Weight::Internal);  // 用於記錄打分用的全局信息
    // MultiMatch對象的初始化，會執行檢索的初始化工做，譬如：填充stats對象，
    ::MultiMatch match(db, query, qlen, rset,
               collapse_max, collapse_key,
               percent_cutoff, weight_cutoff,
               order, sort_key, sort_by, sort_value_forward,
               time_limit, *(stats.get()), weight, spies,
               (sorter.get() != NULL),
               (mdecider != NULL));
    // Run query and put results into supplied Xapian::MSet object.
    MSet retval;
    match.get_mset(first, maxitems, check_at_least, retval, *(stats.get()), mdecider, sorter.get());  // 檢索
    if (first_orig != first && retval.internal.get()) {
        retval.internal->firstitem = first_orig;
    }

    Assert(weight->name() != "bool" || retval.get_max_possible() == 0);

    // The Xapian::MSet needs to have a pointer to ourselves, so that it can
    // retrieve the documents.  This is set here explicitly to avoid having
    // to pass it into the matcher, which gets messy particularly in the
    // networked case.
    retval.internal->enquire = this;

    if (!retval.internal->stats) {
        retval.internal->stats = stats.release();
    }

    RETURN(retval);
}

（2）檢索以前的準備工做，在 MultiMatch 對象構造的時候作，prepare_sub_matches：

static void prepare_sub_matches(vector<intrusive_ptr<SubMatch> > & leaves, Xapian::Weight::Internal & stats) {
    LOGCALL_STATIC_VOID(MATCH, "prepare_sub_matches", leaves | stats);
    // We use a vector<bool> to track which SubMatches we're already prepared.
    vector<bool> prepared;
    prepared.resize(leaves.size(), false);
    size_t unprepared = leaves.size();
    bool nowait = true;
    while (unprepared) {
        for (size_t leaf = 0; leaf < leaves.size(); ++leaf) {
            if (prepared[leaf]) continue;
            SubMatch * submatch = leaves[leaf].get();
            if (!submatch || submatch->prepare_match(nowait, stats)) {
                prepared[leaf] = true;
                --unprepared;
            }
        }
        // Use blocking IO on subsequent passes, so that we don't go into
        // a tight loop.
        nowait = false;
    }
}

bool LocalSubMatch::prepare_match(bool nowait, Xapian::Weight::Internal & total_stats) {
    LOGCALL(MATCH, bool, "LocalSubMatch::prepare_match", nowait | total_stats);
    (void)nowait;
    Assert(db);
    total_stats.accumulate_stats(*db, rset);
    RETURN(true);
}

View Code

void Weight::Internal::accumulate_stats(const Xapian::Database::Internal &subdb, const Xapian::RSet &rset) {
#ifdef XAPIAN_ASSERTIONS
    Assert(!finalised);
    ++subdbs;
#endif
    total_length += subdb.get_total_length();
    collection_size += subdb.get_doccount();
    rset_size += rset.size();

    total_term_count += subdb.get_doccount() * subdb.get_total_length();
    Xapian::TermIterator t;
    for (t = query.get_unique_terms_begin(); t != Xapian::TermIterator(); ++t) {
        const string & term = *t;

        Xapian::doccount sub_tf;
        Xapian::termcount sub_cf;
        subdb.get_freqs(term, &sub_tf, &sub_cf);
        TermFreqs & tf = termfreqs[term];
        tf.termfreq += sub_tf;
        tf.collfreq += sub_cf;
    }

    const set<Xapian::docid> & items(rset.internal->get_items());
    set<Xapian::docid>::const_iterator d;
    for (d = items.begin(); d != items.end(); ++d) {
        Xapian::docid did = *d;
        Assert(did);
        // The query is likely to contain far fewer terms than the documents,
        // and we can skip the document's termlist, so look for each query term
        // in the document.
        AutoPtr<TermList> tl(subdb.open_term_list(did));
        map<string, TermFreqs>::iterator i;
        for (i = termfreqs.begin(); i != termfreqs.end(); ++i) {
            const string & term = i->first;
            TermList * ret = tl->skip_to(term);
            Assert(ret == NULL);
            (void)ret;
            if (tl->at_end()) {
                break;
            }
            if (term == tl->get_termname()) {
                ++i->second.reltermfreq;
            }
        }
    }
}

View Code

prepare_sub_matches()： BM25計算以前的準備工做
Weight::Internal::accumulate_stats:
total_length：db的總文檔長度加和；
collection_size：db的總文檔數量；
total_term_count：存疑，變量名是term計數，其實是總文檔長度加和 * 總文檔數量；
termfreqs： term的tf信息（term在多少個doc中出現）和cf信息（term在索引集合中出現的次數）；
query中涉及到的全部term，都獲取到它們的TF、IDF信息；
極致的壓縮：VectorTermList，把幾個string存儲的term壓縮存儲到一個塊內存中。若是使用vector來存儲，則會增長30Byte每個term。

（3）打開倒排鏈，構造postlist-tree：

打開倒排鏈和檢索放在一個800行的超大函數裏面：

void MultiMatch::get_mset(Xapian::doccount first, Xapian::doccount maxitems,
             Xapian::doccount check_at_least,
             Xapian::MSet & mset,
             Xapian::Weight::Internal & stats,
             const Xapian::MatchDecider *mdecider,
             const Xapian::KeyMaker *sorter) {
........
}

打開倒排鏈的過程，函數多層嵌套很是深刻，這也是檢索樹解析-->重建過程：

PostList * LocalSubMatch::get_postlist(MultiMatch * matcher, Xapian::termcount * total_subqs_ptr) {
    LOGCALL(MATCH, PostList *, "LocalSubMatch::get_postlist", matcher | total_subqs_ptr);

    if (query.empty()) {
        RETURN(new EmptyPostList); // MatchNothing
    }

    // Build the postlist tree for the query.  This calls
    // LocalSubMatch::open_post_list() for each term in the query.
    PostList * pl;
    {
        QueryOptimiser opt(*db, *this, matcher);
        pl = query.internal->postlist(&opt, 1.0);
        *total_subqs_ptr = opt.get_total_subqs();
    }

    AutoPtr<Xapian::Weight> extra_wt(wt_factory->clone());
    // Only uses term-independent stats.
    extra_wt->init_(*stats, qlen);
    if (extra_wt->get_maxextra() != 0.0) {
        // There's a term-independent weight contribution, so we combine the
        // postlist tree with an ExtraWeightPostList which adds in this
        // contribution.
        pl = new ExtraWeightPostList(pl, extra_wt.release(), matcher);
    }

    RETURN(pl);
}

PostingIterator::Internal * QueryOr::postlist(QueryOptimiser * qopt, double factor) const {
    LOGCALL(QUERY, PostingIterator::Internal *, "QueryOr::postlist", qopt | factor);
    OrContext ctx(qopt, subqueries.size());
    do_or_like(ctx, qopt, factor);
    RETURN(ctx.postlist());
}

View Code

void QueryBranch::do_or_like(OrContext& ctx, QueryOptimiser * qopt, double factor, Xapian::termcount elite_set_size, size_t first) const {
    LOGCALL_VOID(MATCH, "QueryBranch::do_or_like", ctx | qopt | factor | elite_set_size);

    // FIXME: we could optimise by merging OP_ELITE_SET and OP_OR like we do
    // for AND-like operations.

    // OP_SYNONYM with a single subquery is only simplified by
    // QuerySynonym::done() if the single subquery is a term or MatchAll.
    Assert(subqueries.size() >= 2 || get_op() == Query::OP_SYNONYM);

    vector<PostList *> postlists;
    postlists.reserve(subqueries.size() - first);

    QueryVector::const_iterator q;
    for (q = subqueries.begin() + first; q != subqueries.end(); ++q) {
        // MatchNothing subqueries should have been removed by done().
        Assert((*q).internal.get());
        (*q).internal->postlist_sub_or_like(ctx, qopt, factor);
    }

    if (elite_set_size && elite_set_size < subqueries.size()) {
        ctx.select_elite_set(elite_set_size, subqueries.size());
        // FIXME: not right!
    }
}

View Code

...

LeafPostList * LocalSubMatch::open_post_list(const string& term,
                  Xapian::termcount wqf,
                  double factor,
                  bool need_positions,
                  bool in_synonym,
                  QueryOptimiser * qopt,
                  bool lazy_weight) {
    LOGCALL(MATCH, LeafPostList *, "LocalSubMatch::open_post_list", term | wqf | factor | need_positions | qopt | lazy_weight);

    bool weighted = (factor != 0.0 && !term.empty());

    LeafPostList * pl = NULL;
    if (!term.empty() && !need_positions) {
        if ((!weighted && !in_synonym) ||
            !wt_factory->get_sumpart_needs_wdf_()) {
            Xapian::doccount sub_tf;
            db->get_freqs(term, &sub_tf, NULL);
            if (sub_tf == db->get_doccount()) {
                // If we're not going to use the wdf or term positions, and the
                // term indexes all documents, we can replace it with the
                // MatchAll postlist, which is especially efficient if there
                // are no gaps in the docids.
                pl = db->open_post_list(string());
                // Set the term name so the postlist looks up the correct term
                // frequencies - this is necessary if the weighting scheme
                // needs collection frequency or reltermfreq (termfreq would be
                // correct anyway since it's just the collection size in this
                // case).
                pl->set_term(term);
            }
        }
    }

    if (!pl) {
        const LeafPostList * hint = qopt->get_hint_postlist();
        if (hint)
            pl = hint->open_nearby_postlist(term);
        if (!pl)
            pl = db->open_post_list(term);
        qopt->set_hint_postlist(pl);
    }

    if (lazy_weight) {
        // Term came from a wildcard, but we may already have that term in the
        // query anyway, so check before accumulating its TermFreqs.
        map<string, TermFreqs>::iterator i = stats->termfreqs.find(term);
        if (i == stats->termfreqs.end()) {
            Xapian::doccount sub_tf;
            Xapian::termcount sub_cf;
            db->get_freqs(term, &sub_tf, &sub_cf);
            stats->termfreqs.insert(make_pair(term, TermFreqs(sub_tf, 0, sub_cf)));
        }
    }

    if (weighted) {
        Xapian::Weight * wt = wt_factory->clone();
        if (!lazy_weight) {
            wt->init_(*stats, qlen, term, wqf, factor);  // BM25Weight::init()計算不涉及query跟doc相關性部分的打分（只跟term和query相關）
            stats->set_max_part(term, wt->get_maxpart());
        } else {
            // Delay initialising the actual weight object, so that we can
            // gather stats for the terms lazily expanded from a wildcard
            // (needed for the remote database case).
            wt = new LazyWeight(pl, wt, stats, qlen, wqf, factor);
        }
        pl->set_termweight(wt);
    }
    RETURN(pl);
}

weight的init：

void BM25Weight::init(double factor) {
    Xapian::doccount tf = get_termfreq();

    double tw = 0;
    if (get_rset_size() != 0) {
        Xapian::doccount reltermfreq = get_reltermfreq();

        // There can't be more relevant documents indexed by a term than there
        // are documents indexed by that term.
        AssertRel(reltermfreq,<=,tf);

        // There can't be more relevant documents indexed by a term than there
        // are relevant documents.
        AssertRel(reltermfreq,<=,get_rset_size());

        Xapian::doccount reldocs_not_indexed = get_rset_size() - reltermfreq;

        // There can't be more relevant documents not indexed by a term than
        // there are documents not indexed by that term.
        AssertRel(reldocs_not_indexed,<=,get_collection_size() - tf);

        Xapian::doccount Q = get_collection_size() - reldocs_not_indexed;

        Xapian::doccount nonreldocs_indexed = tf - reltermfreq;
        double numerator = (reltermfreq + 0.5) * (Q - tf + 0.5);
        double denom = (reldocs_not_indexed + 0.5) * (nonreldocs_indexed + 0.5);
        tw = numerator / denom;
    } else {
        tw = (get_collection_size() - tf + 0.5) / (tf + 0.5);
    }

    AssertRel(tw,>,0);

    // The "official" formula can give a negative termweight in unusual cases
    // (without an RSet, when a term indexes more than half the documents in
    // the database).  These negative weights aren't actually helpful, and it
    // is common for implementations to replace them with a small positive
    // weight or similar.
    //
    // Truncating to zero doesn't seem a great approach in practice as it
    // means that some terms in the query can have no effect at all on the
    // ranking, and that some results can have zero weight, both of which
    // are seem surprising.
    //
    // Xapian 1.0.x and earlier adjusted the termweight for any term indexing
    // more than a third of documents, which seems rather "intrusive".  That's
    // what the code currently enabled does, but perhaps it would be better to
    // do something else. (FIXME)
#if 0
    if (rare(tw <= 1.0)) {
        termweight = 0;
    } else {
        termweight = log(tw) * factor;
        if (param_k3 != 0) {
            double wqf_double = get_wqf();
            termweight *= (param_k3 + 1) * wqf_double / (param_k3 + wqf_double);
        }
    }
#else
    if (tw < 2) tw = tw * 0.5 + 1;
    termweight = log(tw) * factor;
    if (param_k3 != 0) {
        double wqf_double = get_wqf();
        termweight *= (param_k3 + 1) * wqf_double / (param_k3 + wqf_double);
    }
#endif
    termweight *= (param_k1 + 1);

    LOGVALUE(WTCALC, termweight);

    if (param_k2 == 0 && (param_b == 0 || param_k1 == 0)) {
        // If k2 is 0, and either param_b or param_k1 is 0 then the document
        // length doesn't affect the weight.
        len_factor = 0;
    } else {
        len_factor = get_average_length();
        // len_factor can be zero if all documents are empty (or the database
        // is empty!)
        if (len_factor != 0) len_factor = 1 / len_factor;
    }

    LOGVALUE(WTCALC, len_factor);
}

View Code

總的來講，這一階段：

stats設置給LocalSubMatch對象；
獲取倒排列表，根據query-tree構建postlist-tree；同時，clone一個Weight對象，計算BM25所須要的計算因子；平均文檔長度，文檔的最短長度，term最大的wdf（term在某doc中出現的次數）；
計算BM25公式的idf部分：tw = (get_collection_size() - tf + 0.5) / (tf + 0.5); termweight = log(tw) * factor；
計算BM25公式的term在query中的權重部分：double wqf_double = get_wqf(); termweight *= (param_k3 + 1) * wqf_double / (param_k3 + wqf_double);
計算BM25公式的term跟doc相關程度的一部分參數： termweight *= (param_k1 + 1);
計算BM25公式的平均長度分之一：len_factor = 1 / len_factor;
計算maxpart() ，BM25算法，沒有地方用這個值；

這就把BM25公式中，不跟具體doc相關的第一和第三部分計算完成。

構建postlist-tree，若是是And的語法，則使用PostList * AndContext::postlist() 生成postlist，而後把子postlist-tree銷燬掉；

（4）最終召回排序

循環從postlist-tree拉取docid，而後計算BM25打分，
倒排與鏈求交過程：
PostList * MultiAndPostList::find_next_match(double w_min)

兩個有序鏈表求交：
0、第一個鏈表pos往前走一步；
一、取出第一個鏈表的元素；
二、find_next_match() --> check_helper() 將第二鏈表的pos往前走，保證第二鏈表當前位置大於等於第一鏈表；
三、取出來第二鏈表的當前元素，跟第一鏈表原始作比較；
四、若是不匹配則讓第一鏈表往前走；

注：拉鍊法。

主要代碼以下：

/// 注：在調用這個函數以前會先調用next_helper函數，將第一條鏈表的位置向前移動一位（若是是首次調用則不移動），
/// find_next_match函數讓plist[0]倒排鏈定位到合適的位置，當能定位到合適的位置（plist[0]和plist[1]有交集）則返回，
/// 不然說明沒有交集，設置did=0後返回；調用者會經過判斷did==0來肯定當前鏈表交集是否已經作完；
PostList * MultiAndPostList::find_next_match(double w_min) {
advanced_plist0:
    if (plist[0]->at_end()) {
        did = 0;
        return NULL;
    }
    did = plist[0]->get_docid();
    for (size_t i = 1; i < n_kids; ++i) {
        bool valid;
        check_helper(i, did, w_min, valid);
        if (!valid) {
            next_helper(0, w_min);
            goto advanced_plist0;
        }
        if (plist[i]->at_end()) {
            did = 0;
            return NULL;
        }
        Xapian::docid new_did = plist[i]->get_docid();
        if (new_did != did) {
            /// 兩條鏈表的pos元素不相等，只多是由於plist[0].pos的元素比較小，須要向前移
            skip_to_helper(0, new_did, w_min);
            goto advanced_plist0;
        }
    }
    return NULL;
}

獲取BM25打分：

double LeafPostList::get_weight() const {
    if (!weight) return 0;
    Xapian::termcount doclen = 0, unique_terms = 0;
    // Fetching the document length and number of unique terms is work we can
    // avoid if the weighting scheme doesn't use them.
    if (need_doclength)
        doclen = get_doclength();
    if (need_unique_terms)
        unique_terms = get_unique_terms();
    double sumpart = weight->get_sumpart(get_wdf(), doclen, unique_terms); // 這裏對某個doc的最終BM25打分作了彙總，利用到了前面計算的第一和第三部分打分
    AssertRel(sumpart, <=, weight->get_maxpart());
    return sumpart;
}

兩個有序鏈表求並：
PostList * OrPostList::next(double w_min)
兩個鏈表都取，在get_docid()時取最小did；若是其中一條倒排鏈已經取完，則用剩下的鏈替換以前兩條鏈的owner。

percent是怎麼計算的？
percent_scale = greatest_wt_subqs_matched / double(total_subqs);
percent_scale /= greatest_wt;
首先跟命中詞個數佔總搜索term個數有關係，而後，跟最大的匹配得分有關係，percent_scale會做爲percent的因子：

double v = wt * percent_factor + 100.0 * DBL_EPSILON; // percent_scale就是percent_factor，v就是percent

從BM25打分的執行過程，能夠想到，有部分BM25打分因子（第一部分idf因子、第二部分term-doc相關性因子）是不須要在線計算的，只須要離線計算後並存儲在倒排中便可。

當前默認使用的BM25Weight打分策略，沒有使用get_maxextra函數和get_sumextra函數。

percent更詳細的介紹能夠看這裏：http://www.javashuo.com/article/p-exvrpedu-gc.html

最終召回結果怎麼作limit截斷？

當用戶只須要n條，而召回結果大於n條，在處理n+1條時，使用std::make_heap，構造堆（若是以前已經構造了，則不須要再構造，直接往堆里加元素），並彈出打分最小的doc，保證只有n條資源。另外，程序還記錄了min_weight，當資源打分小於min_weight，則直接丟棄，不須要走構建堆的過程。（詳細源碼見 multimatch.cc，746行起）

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。