Python基礎:一塊兒來面向對象 (二) 之搜索引擎

實例 搜索引擎

  一個搜索引擎由搜索器、索引器、檢索器和用戶接口四個部分組成正則表達式

  搜索器就是爬蟲(scrawler),爬出的內容送給索引器生成索引(Index)存儲在內部數據庫。用戶經過用戶接口發出詢問(query),詢問解析後送達檢索器,檢索器高效檢索後,將結果返回給用戶。算法

  如下5個文件爲爬取的搜索樣本。數據庫

# # 1.txt
# I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

# # 2.txt
# I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# # 3.txt
# I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# # 4.txt
# This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# # 5.txt
# And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

簡單搜索引擎

  SearchEngineBase 基類,corpus(語料)
class SearchEngineBase(object):
    def __init__(self):
        pass
    #將文件做爲id,與內容一塊兒送到process_corpus
    def add_corpus(self, file_path):
        with open(file_path, 'r') as fin:
            text = fin.read()
        self.process_corpus(file_path, text)
    #索引器 將文件路徑做爲id,將處理的內容做爲索引存儲
    def process_corpus(self, id, text):
        raise Exception('process_corpus not implemented.')
    #檢索器
    def search(self, query):
        raise Exception('search not implemented.')

#多態
def main(search_engine):
    for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
        search_engine.add_corpus(file_path)

    while True:
        query = input()
        results = search_engine.search(query)
        print('found {} result(s):'.format(len(results)))
        for result in results:
            print(result)



class SimpleEngine(SearchEngineBase):
    def __init__(self):
        super(SimpleEngine, self).__init__()
        self.__id_to_texts = dict()

    def process_corpus(self, id, text):
        self.__id_to_texts[id] = text

    def search(self, query):
        results = []
        for id, text in self.__id_to_texts.items():
            if query in text:
                results.append(id)
        return results

search_engine = SimpleEngine()
main(search_engine)

########## 輸出 ##########
# simple
# found 0 result(s):
# whe
# found 2 result(s):
# 1.txt
# 5.txt
  缺點:每次索引與檢索都須要佔用大量空間,還有查詢只能是一個詞或連續的幾個詞,對分散的在不一樣位置的多個詞無能爲力

詞袋模型 (bag of words model)

  運用詞袋模型 (bag of words model),NLP領域最簡單的模型之一。
  process_corpus函數中調用parse_text_to_words把文檔中的各個單詞裝進set集合中。
  search函數中把包含查詢關鍵字也打碎成set,與索引的文檔覈對,將匹配的id加入結果集。
 
import re
class BOWEngine(SearchEngineBase):
    def __init__(self):
        super(BOWEngine, self).__init__()
        self.__id_to_words = dict()

    def process_corpus(self, id, text):
        self.__id_to_words[id] = self.parse_text_to_words(text)

    def search(self, query):
        query_words = self.parse_text_to_words(query)
        results = []
        for id, words in self.__id_to_words.items():
            if self.query_match(query_words, words):
                results.append(id)
        return results
    
    @staticmethod
    def query_match(query_words, words):
        for query_word in query_words:
             if query_word not in words:
                 return False
        return True
        #for query_word in query_words:
        #    return False if query_word not in words else True

        #result = filter(lambda x:x not in words,query_words)
        #return False if (len(list(result)) > 0) else True

    @staticmethod
    def parse_text_to_words(text):
        # 使用正則表達式去除標點符號和換行符
        text = re.sub(r'[^\w ]', ' ', text)
        # 轉爲小寫
        text = text.lower()
        # 生成全部單詞的列表
        word_list = text.split(' ')
        # 去除空白單詞
        word_list = filter(None, word_list)
        # 返回單詞的 set
        return set(word_list)

search_engine = BOWEngine()
main(search_engine)

########## 輸出 ##########
# i have a dream
# found 3 result(s):
# 1.txt
# 2.txt
# 3.txt
# freedom children
# found 1 result(s):
# 5.txt

  缺點:每次search仍是須要遍歷全部文檔緩存

Inverted index 倒序索引

  Inverted index 倒序索引,如今保留的是 word -> id 的字典
import re
class BOWInvertedIndexEngine(SearchEngineBase):
    def __init__(self):
        super(BOWInvertedIndexEngine, self).__init__()
        self.inverted_index = dict()

    #生成索引 word -> id
    def process_corpus(self, id, text):
        words = self.parse_text_to_words(text)
        for word in words:
            if word not in self.inverted_index:
                self.inverted_index[word] = []
            self.inverted_index[word].append(id) #{'little':['1.txt','2.txt'],...}

    def search(self, query):
        query_words = list(self.parse_text_to_words(query))
        query_words_index = list()
        for query_word in query_words:
            query_words_index.append(0)
        
        # 若是某一個查詢單詞的倒序索引爲空,咱們就馬上返回
        for query_word in query_words:
            if query_word not in self.inverted_index:
                return []
        
        result = []
        while True:
            
            # 首先,得到當前狀態下全部倒序索引的 index
            current_ids = []
            
            for idx, query_word in enumerate(query_words):
                current_index = query_words_index[idx]
                current_inverted_list = self.inverted_index[query_word] #['1.txt','2.txt']
                
                # 已經遍歷到了某一個倒序索引的末尾,結束 search
                if current_index >= len(current_inverted_list):
                    return result
                current_ids.append(current_inverted_list[current_index])
            
            # 而後,若是 current_ids 的全部元素都同樣,那麼代表這個單詞在這個元素對應的文檔中都出現了
            if all(x == current_ids[0] for x in current_ids):
                result.append(current_ids[0])
                query_words_index = [x + 1 for x in query_words_index]
                continue
            
            # 若是不是,咱們就把最小的元素加一
            min_val = min(current_ids)
            min_val_pos = current_ids.index(min_val)
            query_words_index[min_val_pos] += 1

    @staticmethod
    def parse_text_to_words(text):
        # 使用正則表達式去除標點符號和換行符
        text = re.sub(r'[^\w ]', ' ', text)
        # 轉爲小寫
        text = text.lower()
        # 生成全部單詞的列表
        word_list = text.split(' ')
        # 去除空白單詞
        word_list = filter(None, word_list)
        # 返回單詞的 set
        return set(word_list)

search_engine = BOWInvertedIndexEngine()
main(search_engine)


########## 輸出 ##########


# little
# found 2 result(s):
# 1.txt
# 2.txt
# little vicious
# found 1 result(s):
# 2.txt
 

LRUCache

  若是90%以上都是重複搜索,爲了提升性能,考慮增長緩存,使用Least Recently Used 近期最少使用算法實現
import pylru
class LRUCache(object):
    def __init__(self, size=32):
        self.cache = pylru.lrucache(size)
    
    def has(self, key):
        return key in self.cache
    
    def get(self, key):
        return self.cache[key]
    
    def set(self, key, value):
        self.cache[key] = value

class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):
    def __init__(self):
        super(BOWInvertedIndexEngineWithCache, self).__init__()
        LRUCache.__init__(self)

    
    def search(self, query):
        if self.has(query):
            print('cache hit!')
            return self.get(query)
        
        result = super(BOWInvertedIndexEngineWithCache, self).search(query)
        self.set(query, result)
        
        return result

search_engine = BOWInvertedIndexEngineWithCache()
main(search_engine)

########## 輸出 ##########
# little
# found 2 result(s):
# 1.txt
# 2.txt
# little
# cache hit!
# found 2 result(s):
# 1.txt
# 2.txt

  注意BOWInvertedIndexEngineWithCache繼承了兩個類。app

  在構造函數裏直接使用super(BOWInvertedIndexEngineWithCache, self).__init__()來初始化BOWInvertedIndexEngine父類
  對於多重繼承的父類就要使用LRUCache.__init__(self)來初始化
  
  BOWInvertedIndexEngineWithCache裏重載了search函數,在函數裏面要調用父類BOWInvertedIndexEngine的search函數,使用:
  result = super(BOWInvertedIndexEngineWithCache, self).search(query)

 

參考

  極客時間《Python核心技術與實戰》專欄

  https://time.geekbang.org/column/intro/176

相關文章
相關標籤/搜索