本文是Xapian檢索過程的分析,本文內容中源碼比較多。檢索過程,總的來講就是拉取倒排鏈,取得合法doc,而後作打分排序的過程。html
1 理論分析算法
1.1 檢索語法api
面對不一樣的檢索業務,咱們會有多種檢索需求,譬如:要求A term和B term都在Doc中出現;要求A term或者B term任意在Doc中出現;要求A term或者B term任意在Doc出現,而且C term不出現…...,用符號表示:app
A & B框架
A || Bide
(A || B) & ~C函數
( A & ( B || C ) ) || Doop
…源碼分析
以上的種種檢索需求,複雜繁多,每個檢索需求都單獨實現一份代碼,是不現實的,須要有一種簡單、高效、可擴展的檢索語法來支持他們。post
1.2 檢索過程
首先是根據業務需求,組裝檢索語句,而後調用檢索內核提供的API獲取檢索結果。
檢索內核的實現,以xapian爲例:首先根據用戶組裝的檢索語句造成query-tree(query檢索樹),而後將query-tree轉換爲postlist-tree(倒排鏈樹),最後獲取postlist-tree運算後的結果。在獲取postlist-tree和最後的計算過程當中,穿插着相關性公式(如:BM25)的運算。
1.3 相關性
計算query跟doc相關性方式有好幾種,
(1) 布爾模型(Boolean Model)
判斷用戶的term在不在文檔中出現,若是出現了則認爲文檔跟用戶需求相關,不然認爲不相關。
優勢:簡單;
缺點:結果是二元的,只有YES 或者 NO, 多條結果之間沒有前後順序;
(2)向量空間模型(Vector Space Model)
將query和doc都向量化,計算query跟doc的餘弦值,這個值就是query跟doc的類似性打分。這裏將查詢跟文檔的內容類似性替換相關性。
這個模型對長文本比較抑制。
consine公式:向量點積 / 向量長度相乘。
那麼,怎麼向量化?每一緯的值,給多少合適?
詞頻因子(TF):某個單詞在文檔中出現的次數;通常取log作平滑,避免直接使用詞頻致使出現1次和出現10次的term權重差別過大。 常見公式: Wtf = 1 + log(TF). 常量1是爲了不TF=1時,log(TF) = 0,致使W變成0。
變體公式: Wtf = a + (1 - a) * TF/Max(TF),其中a是調節因子,取值0.4或者0.5,TF表示詞頻,Max(TF)表示文檔中出現次數最多的單詞對應的詞頻數目。這個變種有利於抑制長文本,使得不一樣長度文檔的詞頻因子具備可比性。
逆文檔頻率因子(IDF):包含有某個詞的文檔數量的倒數。若是一個詞在全部文檔中都出現,那麼這個詞對文檔的區分度貢獻不高,不是那麼重要,反之,則說明這個詞很重要。
公式: IDFk = log(N/nk), N表明文檔集合總共有多少個文檔;nk表明詞在多少個文檔中出現過。
TF*IDF框架:
Weight = TF * IDF
(3)機率檢索模型
BIM模型的公式,由四個部分組成,這四個部分能夠理解爲:
一、含有某term的doc在相關集合中出現的次數,正面因素;
二、不含有某term的doc在相關集合中出現的次數,負面因素;
三、含有某term的doc在不相關集合中出現的次數,負面因素;
四、不含有某term的doc在不相關集合中不出現的次數,正面因素。
BM25公式,三部分:一、BIM模型,等價於IDF;二、term在文檔中的權重(doc-tf);三、term在query中的權重(query-tf);
N,表示索引中總的文檔數,
Ni,表示索引中包含有term的文檔數,也就是df,
fi,表示term在文檔中出現的次數,
qfi,表示term在query中出現的次數,
dl,表示文檔長度,
avdl,表示平均文檔長度
BM25F
考慮到不一樣的域,對第二部分的平均長度、調節因子,須要根據不一樣的域設置不一樣的值,而且須要一個跟域相關聯的權重值。
相關性部分資料參考:《這就是搜索引擎》
2 源碼分析
2.1 主要類
下面以xapian爲例,介紹通常檢索過程,因涉及源碼衆多,部分枝節策略不一一細說。首先,這裏列出,涉及到的主要類,從這裏也能夠一窺xapian在檢索上的設計思路。
綠色背景的塊是用戶看到的,藍色背景是其底層涉及到的。
Enquire::Internal,Enquire的內部實現,Xapian的設計風格都是包一層殼,功能實際的實現放在Internal中;
BM25Weight,Xapian默認使用的相關性打分類;
Weight::Internal,打分須要用到的基礎信息,譬如:索引庫文檔量、索引庫總的term長度、query裏的 term的tf、df數據…;
MultiMatch,檢索的實現類;
LocalSubMatch,本地子索引庫操做的封裝。 xapian支持遠程索引庫,也支持一個索引庫拆分紅多個子索引庫;
QueryOptimiser ,從Query-Tree構建PostList-Tree時的幫助類,主要記錄了一些子索引庫相關的信息,譬如:LocalSubMatch的引用、索引庫DataBase的引用…;
QueryOr、QueryBranch、QueryTerm ,這系列是Query Tree上的一個個類;
PostList、LeafPostList,PostList-Tree上的一個個類;
InMemoryPostList,內存索引庫的PostList封裝;
OrContext,記錄在Query-Tree轉PostList-Tree過程當中的PostList上下文信息,包括:QueryOptimiser對象指針、臨時存放的PostList指針;
2.2 檢索過程
2.2.1 用戶demo代碼
Xapian::Query term_one = Xapian::Query("T世界"); Xapian::Query term_two = Xapian::Query("T比賽"); Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, term_one, term_two); // query組裝 std::cout << "query=" << query.get_description() << std::endl; Xapian::Enquire enquire(db); enquire.set_query(query); Xapian::MSet result = enquire.get_mset(0, 10); // 執行檢索,獲取結果 std::cout << "find results count=" << result.get_matches_estimated() << std::endl; for (auto it = result.begin(); it != result.end(); ++it) { Xapian::Document doc = it.get_document(); std::string data = doc.get_data(); double doc_score_weight = it.get_weight(); int doc_score_percent = it.get_percent(); std::cout << "doc=" << data << ",weight=" << doc_score_weight << ",percent=" << doc_score_percent << std::endl; }
2.2.2 query組裝的實現
只有一個類——Query,經過構造函數重載,提供了一切須要的功能。
eg:
Query::Query(const string & term, Xapian::termcount wqf, Xapian::termpos pos) : internal(new Xapian::Internal::QueryTerm(term, wqf, pos)) { LOGCALL_CTOR(API, "Query", term | wqf | pos); }
Query(op op_, const Xapian::Query & a, const Xapian::Query & b) { init(op_, 2); bool positional = (op_ == OP_NEAR || op_ == OP_PHRASE); add_subquery(positional, a); add_subquery(positional, b); done(); }
/* 根據OP,生成對應的Query派生類,譬如:or的生成 QueryOr類,含有兩個子query,這個QueryOr類對象做爲Query的internal成員存在;
在組合多個query時,直接添加到vector中;
若是最後發現vector是空的則將internal設置爲NULL,或者=1,則將internal設置爲子query的internal,這樣子能夠避免沒必要要的vector嵌套,如:[xxquery],單個元素不必放在vector中。*/
...
檢索樹的組織沒有作特別的設計,譬如:用vector來存儲OR的元素。
2.2.3 檢索的實現
(1)檢索函數入口:
MSet Enquire::Internal::get_mset(Xapian::doccount first, Xapian::doccount maxitems, Xapian::doccount check_at_least, const RSet *rset, const MatchDecider *mdecider) const { LOGCALL(MATCH, MSet, "Enquire::Internal::get_mset", first | maxitems | check_at_least | rset | mdecider); if (percent_cutoff && (sort_by == VAL || sort_by == VAL_REL)) { throw Xapian::UnimplementedError("Use of a percentage cutoff while sorting primary by value isn't currently supported"); } if (weight == 0) { weight = new BM25Weight; // 若是外界沒有指定打分策略,採用BM25Weight } Xapian::doccount first_orig = first; { Xapian::doccount docs = db.get_doccount(); first = min(first, docs); maxitems = min(maxitems, docs - first); check_at_least = min(check_at_least, docs); check_at_least = max(check_at_least, first + maxitems); } AutoPtr<Xapian::Weight::Internal> stats(new Xapian::Weight::Internal); // 用於記錄打分用的全局信息
// MultiMatch對象的初始化,會執行檢索的初始化工做,譬如:填充stats對象,
::MultiMatch match(db, query, qlen, rset, collapse_max, collapse_key, percent_cutoff, weight_cutoff, order, sort_key, sort_by, sort_value_forward, time_limit, *(stats.get()), weight, spies, (sorter.get() != NULL), (mdecider != NULL)); // Run query and put results into supplied Xapian::MSet object. MSet retval; match.get_mset(first, maxitems, check_at_least, retval, *(stats.get()), mdecider, sorter.get()); // 檢索 if (first_orig != first && retval.internal.get()) { retval.internal->firstitem = first_orig; } Assert(weight->name() != "bool" || retval.get_max_possible() == 0); // The Xapian::MSet needs to have a pointer to ourselves, so that it can // retrieve the documents. This is set here explicitly to avoid having // to pass it into the matcher, which gets messy particularly in the // networked case. retval.internal->enquire = this; if (!retval.internal->stats) { retval.internal->stats = stats.release(); } RETURN(retval); }
(2)檢索以前的準備工做,在 MultiMatch 對象構造的時候作,prepare_sub_matches:
static void prepare_sub_matches(vector<intrusive_ptr<SubMatch> > & leaves, Xapian::Weight::Internal & stats) { LOGCALL_STATIC_VOID(MATCH, "prepare_sub_matches", leaves | stats); // We use a vector<bool> to track which SubMatches we're already prepared. vector<bool> prepared; prepared.resize(leaves.size(), false); size_t unprepared = leaves.size(); bool nowait = true; while (unprepared) { for (size_t leaf = 0; leaf < leaves.size(); ++leaf) { if (prepared[leaf]) continue; SubMatch * submatch = leaves[leaf].get(); if (!submatch || submatch->prepare_match(nowait, stats)) { prepared[leaf] = true; --unprepared; } } // Use blocking IO on subsequent passes, so that we don't go into // a tight loop. nowait = false; } }
bool LocalSubMatch::prepare_match(bool nowait, Xapian::Weight::Internal & total_stats) { LOGCALL(MATCH, bool, "LocalSubMatch::prepare_match", nowait | total_stats); (void)nowait; Assert(db); total_stats.accumulate_stats(*db, rset); RETURN(true); }
void Weight::Internal::accumulate_stats(const Xapian::Database::Internal &subdb, const Xapian::RSet &rset) { #ifdef XAPIAN_ASSERTIONS Assert(!finalised); ++subdbs; #endif total_length += subdb.get_total_length(); collection_size += subdb.get_doccount(); rset_size += rset.size(); total_term_count += subdb.get_doccount() * subdb.get_total_length(); Xapian::TermIterator t; for (t = query.get_unique_terms_begin(); t != Xapian::TermIterator(); ++t) { const string & term = *t; Xapian::doccount sub_tf; Xapian::termcount sub_cf; subdb.get_freqs(term, &sub_tf, &sub_cf); TermFreqs & tf = termfreqs[term]; tf.termfreq += sub_tf; tf.collfreq += sub_cf; } const set<Xapian::docid> & items(rset.internal->get_items()); set<Xapian::docid>::const_iterator d; for (d = items.begin(); d != items.end(); ++d) { Xapian::docid did = *d; Assert(did); // The query is likely to contain far fewer terms than the documents, // and we can skip the document's termlist, so look for each query term // in the document. AutoPtr<TermList> tl(subdb.open_term_list(did)); map<string, TermFreqs>::iterator i; for (i = termfreqs.begin(); i != termfreqs.end(); ++i) { const string & term = i->first; TermList * ret = tl->skip_to(term); Assert(ret == NULL); (void)ret; if (tl->at_end()) { break; } if (term == tl->get_termname()) { ++i->second.reltermfreq; } } } }
prepare_sub_matches(): BM25計算以前的準備工做
Weight::Internal::accumulate_stats:
total_length:db的總文檔長度加和;
collection_size:db的總文檔數量;
total_term_count: 存疑,變量名是term計數,其實是總文檔長度加和 * 總文檔數量;
termfreqs: term的tf信息(term在多少個doc中出現)和cf信息(term在索引集合中出現的次數);
query中涉及到的全部term,都獲取到它們的TF、IDF信息;
極致的壓縮:VectorTermList,把幾個string存儲的term壓縮存儲到一個塊內存中。若是使用vector來存儲,則會增長30Byte每個term。
(3)打開倒排鏈,構造postlist-tree:
打開倒排鏈和檢索放在一個800行的超大函數裏面:
void MultiMatch::get_mset(Xapian::doccount first, Xapian::doccount maxitems, Xapian::doccount check_at_least, Xapian::MSet & mset, Xapian::Weight::Internal & stats, const Xapian::MatchDecider *mdecider, const Xapian::KeyMaker *sorter) { ........ }
打開倒排鏈的過程,函數多層嵌套很是深刻,這也是檢索樹解析-->重建過程:
PostList * LocalSubMatch::get_postlist(MultiMatch * matcher, Xapian::termcount * total_subqs_ptr) { LOGCALL(MATCH, PostList *, "LocalSubMatch::get_postlist", matcher | total_subqs_ptr); if (query.empty()) { RETURN(new EmptyPostList); // MatchNothing } // Build the postlist tree for the query. This calls // LocalSubMatch::open_post_list() for each term in the query. PostList * pl; { QueryOptimiser opt(*db, *this, matcher); pl = query.internal->postlist(&opt, 1.0); *total_subqs_ptr = opt.get_total_subqs(); } AutoPtr<Xapian::Weight> extra_wt(wt_factory->clone()); // Only uses term-independent stats. extra_wt->init_(*stats, qlen); if (extra_wt->get_maxextra() != 0.0) { // There's a term-independent weight contribution, so we combine the // postlist tree with an ExtraWeightPostList which adds in this // contribution. pl = new ExtraWeightPostList(pl, extra_wt.release(), matcher); } RETURN(pl); }
PostingIterator::Internal * QueryOr::postlist(QueryOptimiser * qopt, double factor) const { LOGCALL(QUERY, PostingIterator::Internal *, "QueryOr::postlist", qopt | factor); OrContext ctx(qopt, subqueries.size()); do_or_like(ctx, qopt, factor); RETURN(ctx.postlist()); }
void QueryBranch::do_or_like(OrContext& ctx, QueryOptimiser * qopt, double factor, Xapian::termcount elite_set_size, size_t first) const { LOGCALL_VOID(MATCH, "QueryBranch::do_or_like", ctx | qopt | factor | elite_set_size); // FIXME: we could optimise by merging OP_ELITE_SET and OP_OR like we do // for AND-like operations. // OP_SYNONYM with a single subquery is only simplified by // QuerySynonym::done() if the single subquery is a term or MatchAll. Assert(subqueries.size() >= 2 || get_op() == Query::OP_SYNONYM); vector<PostList *> postlists; postlists.reserve(subqueries.size() - first); QueryVector::const_iterator q; for (q = subqueries.begin() + first; q != subqueries.end(); ++q) { // MatchNothing subqueries should have been removed by done(). Assert((*q).internal.get()); (*q).internal->postlist_sub_or_like(ctx, qopt, factor); } if (elite_set_size && elite_set_size < subqueries.size()) { ctx.select_elite_set(elite_set_size, subqueries.size()); // FIXME: not right! } }
...
LeafPostList * LocalSubMatch::open_post_list(const string& term, Xapian::termcount wqf, double factor, bool need_positions, bool in_synonym, QueryOptimiser * qopt, bool lazy_weight) { LOGCALL(MATCH, LeafPostList *, "LocalSubMatch::open_post_list", term | wqf | factor | need_positions | qopt | lazy_weight); bool weighted = (factor != 0.0 && !term.empty()); LeafPostList * pl = NULL; if (!term.empty() && !need_positions) { if ((!weighted && !in_synonym) || !wt_factory->get_sumpart_needs_wdf_()) { Xapian::doccount sub_tf; db->get_freqs(term, &sub_tf, NULL); if (sub_tf == db->get_doccount()) { // If we're not going to use the wdf or term positions, and the // term indexes all documents, we can replace it with the // MatchAll postlist, which is especially efficient if there // are no gaps in the docids. pl = db->open_post_list(string()); // Set the term name so the postlist looks up the correct term // frequencies - this is necessary if the weighting scheme // needs collection frequency or reltermfreq (termfreq would be // correct anyway since it's just the collection size in this // case). pl->set_term(term); } } } if (!pl) { const LeafPostList * hint = qopt->get_hint_postlist(); if (hint) pl = hint->open_nearby_postlist(term); if (!pl) pl = db->open_post_list(term); qopt->set_hint_postlist(pl); } if (lazy_weight) { // Term came from a wildcard, but we may already have that term in the // query anyway, so check before accumulating its TermFreqs. map<string, TermFreqs>::iterator i = stats->termfreqs.find(term); if (i == stats->termfreqs.end()) { Xapian::doccount sub_tf; Xapian::termcount sub_cf; db->get_freqs(term, &sub_tf, &sub_cf); stats->termfreqs.insert(make_pair(term, TermFreqs(sub_tf, 0, sub_cf))); } } if (weighted) { Xapian::Weight * wt = wt_factory->clone(); if (!lazy_weight) { wt->init_(*stats, qlen, term, wqf, factor); // BM25Weight::init()計算不涉及query跟doc相關性部分的打分(只跟term和query相關) stats->set_max_part(term, wt->get_maxpart()); } else { // Delay initialising the actual weight object, so that we can // gather stats for the terms lazily expanded from a wildcard // (needed for the remote database case). wt = new LazyWeight(pl, wt, stats, qlen, wqf, factor); } pl->set_termweight(wt); } RETURN(pl); }
weight的init:
void BM25Weight::init(double factor) { Xapian::doccount tf = get_termfreq(); double tw = 0; if (get_rset_size() != 0) { Xapian::doccount reltermfreq = get_reltermfreq(); // There can't be more relevant documents indexed by a term than there // are documents indexed by that term. AssertRel(reltermfreq,<=,tf); // There can't be more relevant documents indexed by a term than there // are relevant documents. AssertRel(reltermfreq,<=,get_rset_size()); Xapian::doccount reldocs_not_indexed = get_rset_size() - reltermfreq; // There can't be more relevant documents not indexed by a term than // there are documents not indexed by that term. AssertRel(reldocs_not_indexed,<=,get_collection_size() - tf); Xapian::doccount Q = get_collection_size() - reldocs_not_indexed; Xapian::doccount nonreldocs_indexed = tf - reltermfreq; double numerator = (reltermfreq + 0.5) * (Q - tf + 0.5); double denom = (reldocs_not_indexed + 0.5) * (nonreldocs_indexed + 0.5); tw = numerator / denom; } else { tw = (get_collection_size() - tf + 0.5) / (tf + 0.5); } AssertRel(tw,>,0); // The "official" formula can give a negative termweight in unusual cases // (without an RSet, when a term indexes more than half the documents in // the database). These negative weights aren't actually helpful, and it // is common for implementations to replace them with a small positive // weight or similar. // // Truncating to zero doesn't seem a great approach in practice as it // means that some terms in the query can have no effect at all on the // ranking, and that some results can have zero weight, both of which // are seem surprising. // // Xapian 1.0.x and earlier adjusted the termweight for any term indexing // more than a third of documents, which seems rather "intrusive". That's // what the code currently enabled does, but perhaps it would be better to // do something else. (FIXME) #if 0 if (rare(tw <= 1.0)) { termweight = 0; } else { termweight = log(tw) * factor; if (param_k3 != 0) { double wqf_double = get_wqf(); termweight *= (param_k3 + 1) * wqf_double / (param_k3 + wqf_double); } } #else if (tw < 2) tw = tw * 0.5 + 1; termweight = log(tw) * factor; if (param_k3 != 0) { double wqf_double = get_wqf(); termweight *= (param_k3 + 1) * wqf_double / (param_k3 + wqf_double); } #endif termweight *= (param_k1 + 1); LOGVALUE(WTCALC, termweight); if (param_k2 == 0 && (param_b == 0 || param_k1 == 0)) { // If k2 is 0, and either param_b or param_k1 is 0 then the document // length doesn't affect the weight. len_factor = 0; } else { len_factor = get_average_length(); // len_factor can be zero if all documents are empty (or the database // is empty!) if (len_factor != 0) len_factor = 1 / len_factor; } LOGVALUE(WTCALC, len_factor); }
總的來講,這一階段:
stats設置給LocalSubMatch對象;
獲取倒排列表,根據query-tree構建postlist-tree;同時,clone一個Weight對象,計算BM25所須要的計算因子;平均文檔長度,文檔的最短長度,term最大的wdf(term在某doc中出現的次數);
計算BM25公式的idf部分:tw = (get_collection_size() - tf + 0.5) / (tf + 0.5); termweight = log(tw) * factor;
計算BM25公式的term在query中的權重部分:double wqf_double = get_wqf(); termweight *= (param_k3 + 1) * wqf_double / (param_k3 + wqf_double);
計算BM25公式的term跟doc相關程度的一部分參數: termweight *= (param_k1 + 1);
計算BM25公式的平均長度分之一:len_factor = 1 / len_factor;
計算maxpart() ,BM25算法,沒有地方用這個值;
這就把BM25公式中,不跟具體doc相關的第一和第三部分計算完成。
構建postlist-tree,若是是And的語法,則使用PostList * AndContext::postlist() 生成postlist,而後把子postlist-tree銷燬掉;
(4)最終召回排序
循環從postlist-tree拉取docid,而後計算BM25打分,
倒排與鏈求交過程:
PostList * MultiAndPostList::find_next_match(double w_min)
兩個有序鏈表求交:
0、第一個鏈表pos往前走一步;
一、取出第一個鏈表的元素;
二、find_next_match() --> check_helper() 將第二鏈表的pos往前走,保證第二鏈表當前位置大於等於第一鏈表;
三、取出來第二鏈表的當前元素,跟第一鏈表原始作比較;
四、若是不匹配則讓第一鏈表往前走;
注:拉鍊法。
主要代碼以下:
/// 注:在調用這個函數以前會先調用next_helper函數,將第一條鏈表的位置向前移動一位(若是是首次調用則不移動),
/// find_next_match函數讓plist[0]倒排鏈定位到合適的位置,當能定位到合適的位置(plist[0]和plist[1]有交集)則返回,
/// 不然說明沒有交集,設置did=0後返回;調用者會經過判斷did==0來肯定當前鏈表交集是否已經作完;
PostList * MultiAndPostList::find_next_match(double w_min) { advanced_plist0: if (plist[0]->at_end()) { did = 0; return NULL; } did = plist[0]->get_docid(); for (size_t i = 1; i < n_kids; ++i) { bool valid; check_helper(i, did, w_min, valid); if (!valid) { next_helper(0, w_min); goto advanced_plist0; } if (plist[i]->at_end()) { did = 0; return NULL; } Xapian::docid new_did = plist[i]->get_docid(); if (new_did != did) {
/// 兩條鏈表的pos元素不相等,只多是由於plist[0].pos的元素比較小,須要向前移 skip_to_helper(0, new_did, w_min); goto advanced_plist0; } } return NULL; }
獲取BM25打分:
double LeafPostList::get_weight() const { if (!weight) return 0; Xapian::termcount doclen = 0, unique_terms = 0; // Fetching the document length and number of unique terms is work we can // avoid if the weighting scheme doesn't use them. if (need_doclength) doclen = get_doclength(); if (need_unique_terms) unique_terms = get_unique_terms(); double sumpart = weight->get_sumpart(get_wdf(), doclen, unique_terms); // 這裏對某個doc的最終BM25打分作了彙總,利用到了前面計算的第一和第三部分打分 AssertRel(sumpart, <=, weight->get_maxpart()); return sumpart; }
兩個有序鏈表求並:
PostList * OrPostList::next(double w_min)
兩個鏈表都取,在get_docid()時取最小did;若是其中一條倒排鏈已經取完,則用剩下的鏈替換以前兩條鏈的owner。
percent是怎麼計算的?
percent_scale = greatest_wt_subqs_matched / double(total_subqs);
percent_scale /= greatest_wt;
首先跟命中詞個數佔總搜索term個數有關係,而後,跟最大的匹配得分有關係,percent_scale會做爲percent的因子:
double v = wt * percent_factor + 100.0 * DBL_EPSILON; // percent_scale就是percent_factor,v就是percent
從BM25打分的執行過程,能夠想到,有部分BM25打分因子(第一部分idf因子、第二部分term-doc相關性因子)是不須要在線計算的,只須要離線計算後並存儲在倒排中便可。
當前默認使用的BM25Weight打分策略,沒有使用get_maxextra函數和get_sumextra函數。
percent更詳細的介紹能夠看這裏:http://www.javashuo.com/article/p-exvrpedu-gc.html
最終召回結果怎麼作limit截斷?
當用戶只須要n條,而召回結果大於n條,在處理n+1條時,使用std::make_heap,構造堆(若是以前已經構造了,則不須要再構造,直接往堆里加元素),並彈出打分最小的doc,保證只有n條資源。另外,程序還記錄了min_weight,當資源打分小於min_weight,則直接丟棄,不須要走構建堆的過程。(詳細源碼見 multimatch.cc,746行起)