倒排索引 inverted index

時間 2019-11-11

標籤索引 inverted index 简体版

原文原文鏈接

一、什麼是倒排索引。

e>>>(⊙o⊙)… 這是我見過最垃圾的翻譯了，徹底讓人誤解他的意思。html

這個名稱很容易讓人理解爲從A-Z的排序顛倒成Z-A，其實根本不是這麼回事。java

英文原版爲 inverted index 我的感受翻譯成 反向索引 比較合適。數據庫

倒排索引是區別於正排索引(forward index)來講的。app

解釋：oop

文檔是有許多的單詞組成的，其中每一個單詞也能夠在同一個文檔中重複出現不少次，固然，同一個單詞也能夠出如今不一樣的文檔中。學習

正排索引(forward index)：從文檔角度看其中的單詞，表示每一個文檔（用文檔ID標識）都含有哪些單詞，以及每一個單詞出現了多少次（詞頻）及其出現位置（相對於文檔首部的偏移量）。ui

倒排索引(inverted index，或inverted files)：從單詞角度看文檔，標識每一個單詞分別在那些文檔中出現(文檔ID)，以及在各自的文檔中每一個單詞分別出現了多少次（詞頻）及其出現位置（相對於該文檔首部的偏移量）。搜索引擎

簡單記爲：
正排索引：文檔 ---> 單詞 常規的索引是文檔到關鍵詞的映射
倒排索引：單詞 ---> 文檔 倒排索引是關鍵詞到文檔的映射spa

應用場景：.net

倒排索引有着普遍的應用場景，好比搜索引擎、大規模數據庫索引、文檔檢索、多媒體檢索/信息檢索領域等等。總之，倒排索引在檢索領域是很重要的一種索引機制。

二、inverted index 的java實現

假設有3篇文章，file1, file2, file3，文件內容以下：

那麼創建的倒排索引就是這個樣子：

下面是對於倒排索引的一個簡單的實現。該程序對於輸入的一段文字，查找出該詞所出現的行號以及出現的次數。

import java.io.*;  
import java.util.HashMap;  
import java.util.Map;  
  
  
public class InvertedIndex {  
      
    private Map<String, Map<Integer, Integer>> index;  
    private Map<Integer, Integer> subIndex;  
      
    public void createIndex(String filePath) {  
        index = new HashMap<String, Map<Integer, Integer>>();  
  
        try {  
            File file = new File(filePath);  
            InputStream is = new FileInputStream(file);  
            BufferedReader read = new BufferedReader(new InputStreamReader(is));  
              
            String temp = null;  
            int line = 1;  
            while ((temp = read.readLine()) != null) {  
                String[] words = temp.split(" ");  
                for (String word : words) {  
                    if (!index.containsKey(word)) {  
                        subIndex = new HashMap<Integer, Integer>();  
                        subIndex.put(line, 1);  
                        index.put(word, subIndex);  
                    } else {  
                        subIndex = index.get(word);  
                        if (subIndex.containsKey(line)) {  
                            int count = subIndex.get(line);  
                            subIndex.put(line, count+1);  
                        } else {  
                            subIndex.put(line, 1);  
                        }  
                    }  
                }  
                line++;  
            }  
            read.close();  
            is.close();  
        } catch (IOException e) {  
            System.out.println("error in read file");  
        }  
    }  
      
    public void find(String str) {  
        String[] words = str.split(" ");  
        for (String word : words) {  
            StringBuilder sb = new StringBuilder();  
            if (index.containsKey(word)) {  
                sb.append("word: " + word + " in ");  
                Map<Integer, Integer> temp = index.get(word);  
                for (Map.Entry<Integer, Integer> e : temp.entrySet()) {  
                    sb.append("line " + e.getKey() + " [" + e.getValue() + "] , ");   
                }  
            } else {  
                sb.append("word: " + word + " not found");  
            }  
            System.out.println(sb);  
        }  
    }  
      
    public static void main(String[] args) {  
        InvertedIndex index = new InvertedIndex();  
        index.createIndex("news.txt");  
        index.find("I love Shanghai today");  
    }  
}

其中，輸入文件news.txt內容爲：

I am eriol  
I live in Shanghai and I love Shanghai  
I also love travelling  
life in Shanghai  
is beautiful

輸出結果爲：

word: I in line 1 [1] , line 2 [2] , line 3 [1] ,   
word: love in line 2 [1] , line 3 [1] ,   
word: Shanghai in line 2 [2] , line 4 [1] ,   
word: today not found

參考來自！倒排索引簡單實現

知乎：倒排索引爲何叫倒排索引？

另外的資源學習（本文並未涉及）

倒排索引的java實現