本文主要記錄Xapian的內存索引在添加文檔過程當中,作了哪些事情。api
內容主要爲函數執行過程當中的流水線。數組
demo代碼:app
Xapian::WritableDatabase db = Xapian::InMemory::open(); Xapian::Document doc; // 添加文檔的,T表示字段名字,TERM內容爲世界,position爲1 doc.add_posting("T世界", 1); doc.add_posting("T體育", 2); doc.add_posting("T比賽", 3); // 添加doc的數據 doc.set_data("世界體育比賽"); // 添加doc的惟一term doc.add_boolean_term(K_DOC_UNIQUE_ID); // 採用replace_document,保證擁有K_DOC_UNIQUE_ID的文檔在索引庫中惟一 Xapian::docid innerId = db.replace_document(K_DOC_UNIQUE_ID, doc);
1.建立並填充Document函數
定義好文檔對象,使用add_posting接口,添加term,以及對應的position、wdfinc;post
內部實現細節:this
1.1 先嚐試讀取doc已有term數據;若是讀取到了,則將term以及positions信息記錄到terms中;搜索引擎
void Xapian::Document::Internal::need_terms() const { if (terms_here) { return; } if (database.get()) { Xapian::TermIterator t(database->open_term_list(did)); Xapian::TermIterator tend(NULL); for ( ; t != tend; ++t) { Xapian::PositionIterator p = t.positionlist_begin(); OmDocumentTerm term(t.get_wdf()); for ( ; p != t.positionlist_end(); ++p) { term.append_position(*p); } terms.insert(make_pair(*t, term)); } } termlist_size = terms.size(); terms_here = true; }
1.2 加入全新term,首先,建立新的term對象,爲其添加position信息,最後加入到terms;spa
void Xapian::Document::Internal::add_posting(const string & tname, Xapian::termpos tpos, Xapian::termcount wdfinc) { need_terms(); positions_modified = true; std::map<std::string, OmDocumentTerm>::iterator i = terms.find(tname); if (i == terms.end()) { ++termlist_size; OmDocumentTerm newterm(wdfinc); newterm.append_position(tpos); terms.insert(make_pair(tname, newterm)); } else { // doc已經有這個term if (i->second.add_position(wdfinc, tpos)) { ++termlist_size; } } }
1.3 加入非全新term,調用OmDocumentTerm對象的add_position,爲OmDocumentTerm對象的positions添加元素,保證positions是升序的。在非首次插入position時,這裏採用分批插入排序小技巧,減小了插入排序時的比較次數,值得閱讀。注意:positions信息,在添加完成以後,並非有序的,而是在把doc添加到DB以前,再作了一次merge。設計
技巧:往一個有序數組裏添加元素,通常寫代碼都會採用有序的插入:先定位到插入位置,而後數據日後移,最後插入,時間複雜度是O(n^2)。這裏採用的方式,有點多路歸併的味道:code
(1)數據分爲歷史數據和新增數據,在數據添加的過程當中,須要保證兩個數據組都是升序的,不然就須要對他們作merge合併;
(2)當新加入的數據適合(符合升序要求,且當前新增數據組爲空)放在歷史數據組中,則直接在其尾部append;
(3)不然,判斷是否符合新增數據組要求(升序要求),合適則append到新增數據組中;
(4)若是不合適,則要對歷史數據組和新增數據組作merge,把新增數據組合併到歷史數據組中,這個合併就是兩個升序數組的合併,時間複雜度是O(n+m),合併完成以後,再重複(2)和(3)和(4)這個流程;
(5)當數據添加完畢以後,可能新增數據組尚未合併到歷史數據組中,這個合併的操做延遲到了doc添加到db的時候才作。
實際代碼中,歷史數據組和新增數據組是合併在一塊兒存放的,就一個vector,而後有一個變量記錄當前歷史數據組的位置。
這個技巧下時間複雜度仍然是n^2,但實際耗時跟每次一個數字的插入排序相比,會下降幾倍。
這種設計思路,跟搜索引擎索引庫常見的大小庫(靜、動庫)設計是同樣的。
bool OmDocumentTerm::add_position(Xapian::termcount wdf_inc, Xapian::termpos tpos) { LOGCALL(DB, bool, "OmDocumentTerm::add_position", wdf_inc | tpos); if (rare(is_deleted())) { wdf = wdf_inc; split = 0; positions.push_back(tpos); return true; } wdf += wdf_inc; // Optimise the common case of adding positions in ascending order. if (positions.empty()) { positions.push_back(tpos); return false; } if (tpos > positions.back()) { if (split) { // Check for duplicate before split. auto i = lower_bound(positions.cbegin(), positions.cbegin() + split, tpos); if (i != positions.cbegin() + split && *i == tpos) { return false; } } positions.push_back(tpos); return false; } if (tpos == positions.back()) { // Duplicate of last entry. return false; } if (split > 0) { // We could merge in the new entry at the same time, but that seems to // make things much more complex for minor gains. merge(); } // Search for the position the term occurs at. Use binary chop to // search, since this is a sorted list. vector<Xapian::termpos>::iterator i = lower_bound(positions.begin(), positions.end(), tpos); if (i == positions.end() || *i != tpos) { auto new_split = positions.size(); if (sizeof(split) < sizeof(Xapian::termpos)) { if (rare(new_split > numeric_limits<decltype(split)>::max())) { // The split point would be beyond the size of the type used to // hold it, which is really unlikely if that type is 32-bit. // Just insert the old way in this case. positions.insert(i, tpos); return false; } } else { // This assertion should always be true because we shouldn't have // duplicate entries and the split point can't be after the final // entry. AssertRel(new_split, <=, numeric_limits<decltype(split)>::max()); } split = new_split; positions.push_back(tpos); } return false; }
1.4 添加data信息
void Xapian::Document::Internal::set_data(const string &data_) { data = data_; data_here = true; }
2. Document加入到內存DB
這裏爲了保證文檔惟一,採用replace_document。
作基本的參數檢查以後,判斷是不是多子索引庫,若是是多子索引庫則要判斷數據寫入到哪一個子庫中,同時要刪除其它子索引庫庫中可能存在的同unique_term doc;
判斷倒排鏈裏是否是存在這個unique_term,若是不存在則走添加流程;
Xapian::docid WritableDatabase::replace_document(const std::string & unique_term, const Document & document) { LOGCALL(API, Xapian::docid, "WritableDatabase::replace_document", unique_term | document); if (unique_term.empty()) { throw InvalidArgumentError("Empty termnames are invalid"); } size_t n_dbs = internal.size(); if (rare(n_dbs == 0)) { no_subdatabases(); } if (n_dbs == 1) { RETURN(internal[0]->replace_document(unique_term, document)); } Xapian::PostingIterator postit = postlist_begin(unique_term); // If no unique_term in the database, this is just an add_document(). if (postit == postlist_end(unique_term)) { // Which database will the next never used docid be in? Xapian::docid did = get_lastdocid() + 1; if (rare(did == 0)) { throw Xapian::DatabaseError("Run out of docids - you'll have to use copydatabase to eliminate any gaps before you can add more documents"); } size_t i = sub_db(did, n_dbs); RETURN(internal[i]->add_document(document)); } Xapian::docid retval = *postit; size_t i = sub_db(retval, n_dbs); internal[i]->replace_document(sub_docid(retval, n_dbs), document); // Delete any other occurrences of unique_term. while (++postit != postlist_end(unique_term)) { Xapian::docid did = *postit; i = sub_db(did, n_dbs); internal[i]->delete_document(sub_docid(did, n_dbs)); } return retval; }
2.1 添加新文檔
這裏將添加文檔的過程分爲make_doc和finish_add_doc,多是爲了在真正的replace文檔時,能夠複用finish_add_doc的代碼;
Xapian::docid InMemoryDatabase::add_document(const Xapian::Document & document) { LOGCALL(DB, Xapian::docid, "InMemoryDatabase::add_document", document); if (closed) { InMemoryDatabase::throw_database_closed(); } Xapian::docid did = make_doc(document.get_data()); finish_add_doc(did, document); RETURN(did); }
2.2 make_doc的實現
Xapian::docid InMemoryDatabase::make_doc(const string & docdata) { termlists.push_back(InMemoryDoc(true)); doclengths.push_back(0); doclists.push_back(docdata); AssertEqParanoid(termlists.size(), doclengths.size()); return termlists.size(); }
2.3 finish_add_doc的實現
首先添加value、構造term、填充termlist和postlist結構體。
termlist,即爲文章的詞列表,含有全部的詞信息:詞名、詞在本文章中出現的次數、詞在本文章中出現的位置;
postlist,即爲詞的文章列表,包含文章的信息,包括:docid、詞在這個doc中出現的位置、詞在這個doc中出現的次數;
也就是說,position信息,要存儲兩份,termlist一份,postlist一份;
void InMemoryDatabase::finish_add_doc(Xapian::docid did, const Xapian::Document &document) { { std::map<Xapian::valueno, string> values; Xapian::ValueIterator k = document.values_begin(); for ( ; k != document.values_end(); ++k) { values.insert(make_pair(k.get_valueno(), *k)); LOGLINE(DB, "InMemoryDatabase::finish_add_doc(): adding value " << k.get_valueno() << " -> " << *k); } add_values(did, values); } InMemoryDoc doc(true); Xapian::TermIterator i = document.termlist_begin(); for ( ; i != document.termlist_end(); ++i) { make_term(*i); LOGLINE(DB, "InMemoryDatabase::finish_add_doc(): adding term " << *i); Xapian::PositionIterator j = i.positionlist_begin(); if (j == i.positionlist_end()) { /* Make sure the posting exists, even without a position. */ make_posting(&doc, *i, did, 0, i.get_wdf(), false); } else { positions_present = true; for ( ; j != i.positionlist_end(); ++j) { make_posting(&doc, *i, did, *j, i.get_wdf()); } } Assert(did > 0 && did <= doclengths.size()); doclengths[did - 1] += i.get_wdf(); totlen += i.get_wdf(); postlists[*i].collection_freq += i.get_wdf(); ++postlists[*i].term_freq; } swap(termlists[did - 1], doc); totdocs++; }
在處理position信息的過程當中,有些設計上不合理,在填充doc的時候,已經爲position信息排序過一次,後面將position信息添加到termlist或者postlist的時候,又從新一個個position單獨處理。
文檔添加到DB以後,須要執行commit,而內存索引沒有落地磁盤,因此InMemoryDatabase的commit是空函數。