對搜索引擎開源項目的代碼分析——索引（1）

時間 2019-11-15

原文原文鏈接

首先，須要對基本概念進行簡單的介紹：Keywords:搜索鍵; tokens:關鍵詞; 關鍵詞（tokens）和標籤（labels）組成了索引器中的搜索鍵（keywords）git

1. 上文中，已經從微博上，抓取了相應的微博信息，下面將對其進行搜索引擎的下一步驟：「索引」
github

  // 讀入微博數據file, err := os.Open("../../testdata/weibo_data.txt")

2. 使用悟空引擎你須要import兩個包;第一個包定義了引擎功能，第二個包定告終構體，同時須要對引擎使用以前進行初始化；
安全

import(
"github.com/huichen/wukong/engine"
"github.com/huichen/wukong/types"
)

3. 再索引以前，須要瞭解下須要注意的基本概念：併發

IndexerInitOptions.IndexType的類型選擇，共有三種不一樣類型的索引表進行選擇；
    1） DocIdsIndex，提供了最基本的索引，僅僅記錄搜索鍵出現的文檔docid；
    2） FrequenciesIndex，除了記錄docid外，還保存了搜索鍵在每一個文檔中出現的頻率；
    3.）LocationsIndex，這個不只包括上兩種索引的內容，還額外存儲了關鍵詞在文檔中的具體位置
這三種索引由上到下在提供更多計算能力的同時也消耗了更多的內存，特別是LocationsIndex，當文檔很長時會佔用大量內存。請根據須要平衡選擇。若是沒有選擇，那麼系統會默認選擇FrequenciesIndexapp

4.       悟空引擎容許你加入三種索引數據：
    1）文檔的正文（content），會被分詞爲關鍵詞（tokens）加入索引。
    2）文檔的關鍵詞（tokens）。當正文爲空的時候，容許用戶繞過悟空內置的分詞器直接
          輸入文檔關鍵詞，這使得在引擎外部進行文檔分詞成爲可能。
    3）文檔的屬性標籤（labels），好比微博的做者，類別等。標籤並不出如今正文中。
須要注意的是：文檔的正文是進行關鍵詞的優先擇；關鍵詞（tokens）和標籤（labels）組成了索引器中的搜索鍵（keywords），固然標籤labels是不出如今正文中的；函數

5. 引擎採用了非同步的索引方式，也就是說當IndexDocument返回時索引可能尚未加入索引表中，從而方便的循環併發加入索引；若是你須要等待索引添加完畢後再進行後續操做，請用下面的函數：searcher.FlushIndex()ui

6.下面分析索引的代碼功能，一些功能重疊部分將在後續的索引中進行分析；搜索引擎

下面定義了索引器的一些基本單位，其中添加了sync.RWMutex讀寫鎖實現安全的map；可是瞭解到，自鎖和解鎖的相互過程，試想若是自鎖一次，而在不知道自鎖次數的狀況下解鎖超過自鎖，那麼將要報錯，所以在此能夠進行次數檢查，防止自鎖和解鎖次數的不一致致使的錯誤；spa

// 索引器
type Indexer struct {
    // 從搜索鍵到文檔列表的反向索引
    // 加了讀寫鎖以保證讀寫安全
    tableLock struct {
        sync.RWMutex
        table map[string]*KeywordIndices
    }
    initOptions types.IndexerInitOptions
    initialized bool
    // 這其實是總文檔數的一個近似
    numDocuments uint64
    // 全部被索引文本的總關鍵詞數
    totalTokenLength float32
    // 每一個文檔的關鍵詞長度
    docTokenLengths map[uint64]float32
}

本段代碼定義了的功能已在上面的概念解析中進行了闡述；注意IndexType的選擇符合業務的需求，內存的消耗承擔狀況；
code

// 反向索引表的一行，收集了一個搜索鍵出現的全部文檔，按照DocId從小到大排序。 
type KeywordIndices struct { 
    // 下面的切片是否爲空，取決於初始化時IndexType的值
    docIds      []uint64  // 所有類型都有 
    frequencies []float32 // IndexType == FrequenciesIndex 
    locations   [][]int   // IndexType == LocationsIndex}

對索引器進行相應的初始化

// 初始化索引器
func (indexer *Indexer) Init(options types.IndexerInitOptions) {
    if indexer.initialized == true {
        log.Fatal("索引器不能初始化兩次")
    }
    indexer.initialized = true
    indexer.tableLock.table = make(map[string]*KeywordIndices)
    indexer.initOptions = options
    indexer.docTokenLengths = make(map[uint64]float32)
}

下面將文檔加入索引：提取文檔的關鍵詞，出現頻率甚至是位置信息等等；

// 向反向索引表中加入一個文檔
func (indexer *Indexer) AddDocument(document *types.DocumentIndex) {
    if indexer.initialized == false {
        log.Fatal("索引器還沒有初始化")
    }
    indexer.tableLock.Lock()
    defer indexer.tableLock.Unlock()
    // 更新文檔關鍵詞總長度
    if document.TokenLength != 0 {
        originalLength, found := indexer.docTokenLengths[document.DocId]
        indexer.docTokenLengths[document.DocId] = float32(document.TokenLength)
        if found {
            indexer.totalTokenLength += document.TokenLength - originalLength
        } else {
            indexer.totalTokenLength += document.TokenLength
        }
    } 
    ...
    ...
    ...

查找新文檔以後，進行搜索鍵的查找；

docIdIsNew := true    for _, keyword := range document.Keywords {
        indices, foundKeyword := indexer.tableLock.table[keyword.Text]
        if !foundKeyword {
            // 若是沒找到該搜索鍵則加入
            ti := KeywordIndices{}
            switch indexer.initOptions.IndexType {
            case types.LocationsIndex:
                ti.locations = [][]int{keyword.Starts}
            case types.FrequenciesIndex:
                ti.frequencies = []float32{keyword.Frequency}
            }
            ti.docIds = []uint64{document.DocId}
            indexer.tableLock.table[keyword.Text] = &ti
            continue
        }
        // 查找應該插入的位置
        position, found := indexer.searchIndex(
            indices, 0, indexer.getIndexLength(indices)-1, document.DocId)
        if found {
            docIdIsNew = false
            // 覆蓋已有的索引項
            switch indexer.initOptions.IndexType {
            case types.LocationsIndex:
                indices.locations[position] = keyword.Starts
            case types.FrequenciesIndex:
                indices.frequencies[position] = keyword.Frequency
            }
            continue
        }

此處根據IndexType的選擇進行代碼的索引的插入項；

// 當索引不存在時，插入新索引項        
       switch indexer.initOptions.IndexType {
        case types.LocationsIndex:
            indices.locations = append(indices.locations, []int{})
            copy(indices.locations[position+1:], indices.locations[position:])
            indices.locations[position] = keyword.Starts
        case types.FrequenciesIndex:
            indices.frequencies = append(indices.frequencies, float32(0))
            copy(indices.frequencies[position+1:], indices.frequencies[position:])
            indices.frequencies[position] = keyword.Frequency
        }
        indices.docIds = append(indices.docIds, 0)
        copy(indices.docIds[position+1:], indices.docIds[position:])
        indices.docIds[position] = document.DocId
    }
    // 更新文章總數
    if docIdIsNew {
        indexer.numDocuments++
    }

其中，當搜索鍵是關鍵詞和標籤結合時，能夠更加縮小搜尋範圍；同時注意：標籤並不在正文之中；其中如下代碼中的copy(keywords[len(tokens):], labels)，我認爲是不是copy(keywords[len(tokens)+1:], labels)？

// 查找包含所有搜索鍵(AND操做)的文檔// 當docIds不爲nil時僅從docIds指定的文檔中查找
func (indexer *Indexer) Lookup(
    tokens []string, labels []string, docIds *map[uint64]bool) (docs []types.IndexedDocument) {
    if indexer.initialized == false {
        log.Fatal("索引器還沒有初始化")
    }
    if indexer.numDocuments == 0 {
        return
    }
    // 合併關鍵詞和標籤爲搜索鍵
    keywords := make([]string, len(tokens)+len(labels))
    copy(keywords, tokens)
    copy(keywords[len(tokens):], labels)
    indexer.tableLock.RLock()
    defer indexer.tableLock.RUnlock()
    table := make([]*KeywordIndices, len(keywords))
    for i, keyword := range keywords {
        indices, found := indexer.tableLock.table[keyword]
        if !found {
            // 當反向索引表中無此搜索鍵時直接返回
            return
        } else {
            // 不然加入反向表中
            table[i] = indices
        }
    }
    // 當沒有找到時直接返回
    if len(table) == 0 {
        return
    }

總結：

以上代碼的索引是爲倒排索引的使用提供條件，倒排索引是根據單詞à文檔的模式即根據單詞進行查找包含單詞的全部文檔，同時映射了單詞在相應的文檔裏的出現次數和位置信息；以上的代碼功能簡單的說明索引的前奏細節，接下來將要重點解析索引運用的方法；
以上代碼部分，我的理解是，在Golang語言中，爲了不代碼編譯時出現異常，應儘可能採用err.Error()機制來進行避免，不知道是否穩當，之後在實踐中須要注意；