我把代碼和爬好的數據放在了git上,歡迎你們來參考css
https://github.com/linyi0604/linyiSearcherhtml
我是在 manjaro linux下作的, 使用python3 語言, 爬蟲部分涉及到 安裝ChromeDriver 能夠參考我以前寫的博文。前端
創建索引部分參考: https://baijiahao.baidu.com/s?id=1597426056496128414&wfr=spider&for=pcpython
檢索過程,衡量文檔類似度使用了餘弦類似度,參考:http://www.javashuo.com/article/p-gnoocmvk-kk.htmljquery
爲了完成個人信息檢索選修課大做業,寫下了這個簡單的小項目。
這裏是一個python3 實現的簡易的搜索引擎
我把它取名叫linyiSearcher
--------
所須要的python依賴包在requirements.txt中
能夠使用 pip install -r requirements.txt 一次性安裝所有
--------
一共分紅3部分完成(後面有稍微詳細點的解讀)
1_spider.py 是一個爬蟲, 爬取搜索引擎的語料庫
2_clean_data_and_make_index 是對爬下來的數據 進行一些清晰工做,而且將數據存入數據庫,創建索引
這裏使用了 sqlite數據庫,爲了方便數據和項目一同攜帶
3_searcher.py 簡易的web後端, 實現了
1 在網頁輸入搜索關鍵字, 在後端接收到關鍵字
2 對關鍵字進行分詞
3 在索引中查找和關鍵字有關的文檔
4 按照餘弦類似度 對文檔進行排序
5 把相近的文檔展現出來
--------
本身的知識儲備和代碼能力都捉襟見肘。
大神來看,還望海涵~歡迎你們批評指正共同窗習
--------
1 爬蟲:
由於沒有數據,只能寫爬蟲來作, 又只有本身的筆記原本跑,因此數據量也作不到很是大
在這裏 寫了1程序 爬了百度貼吧 娛樂明星分類下面的全部1級頁面帖子的標題 當作語料庫
爬取下來的數據存在了 ./data/database.csv 下
數據有2列 分別是 title 和url
2 數據清洗 並 創建索引:
database.db 是一個sqlite數據庫文件
首先將每一個文檔存到了數據庫當中
數據庫表爲 page_info(id,keyword, title, url)
id 自增主鍵
keyword: 存了該文檔文字用jieba分詞打散後的詞彙列表(用空格隔開全部詞語的字符串)
title: 文檔的文字內容
url: 該文檔的網頁連接
而後 把每一個文檔 使用jieba分詞工具, 打散成詞語,把全部詞語放到一個集合中(集合能去重)
把全部詞 存入數據庫 創建索引
索引這樣理解:
關鍵詞: 你好 包含關鍵詞的文檔: <1,2,6,8,9>
表爲 page_index(id, word, page_id)
id: 自增 主鍵
word: 當前關鍵詞
page_id: 包含該關鍵詞的文檔id 也就是page_info.id
3 實現檢索:
首先 使用了bottle框架,是一個很是輕巧的web後端框架,實現了一個簡單的web後端
前端頁面使用了bootstrap 的css樣式,,畢竟本身什麼垃圾的一p
檢索的實現過程:
1 後端拿到檢索的關鍵詞,用jieba分詞 把拿到的語句打散成詞彙 造成關鍵詞keyword_list
2 在創建的索引表page_index中,搜關keyword_list中出現的詞彙的page_id
3 在包含全部keyword的文檔上 計算和keyword的餘弦類似度,而後降序排列
4 返回給前端顯示搜索結果
看看檢索結果:
1_spider.py 爬蟲的代碼
1 import requests 2 from lxml import etree 3 import random 4 import COMMON 5 import os 6 from selenium import webdriver 7 import pandas as pd 8 """
9 這裏是創建搜索引擎的第一步 10 """
11
12
13 class Spider_BaiduTieba(object): 14
15 def __init__(self): 16 self.start_url = "/f/index/forumpark?pcn=娛樂明星&pci=0&ct=1&rn=20&pn=1"
17 self.base_url = "http://tieba.baidu.com"
18 self.headers = COMMON.HEADERS 19 self.driver = webdriver.Chrome() 20 self.urlset = set() 21 self.titleset = set() 22
23 def get(self, url): 24 header = random.choice(self.headers) 25 response = requests.get(url=url, headers=header, timeout=10) 26 return response.content 27
28 def parse_url(self, url): 29 """經過url 拿到xpath對象"""
30 print(url) 31 header = random.choice(self.headers) 32 response = requests.get(url=url, headers=header, timeout=10) 33 # 若是獲取的狀態碼不是200 則拋出異常
34 assert response.status_code == 200
35 xhtml = etree.HTML(response.content) 36 return xhtml 37
38 def get_base_url_list(self): 39 """得到第一層url列表"""
40 if os.path.exists(COMMON.BASE_URL_LIST_FILE): 41 li = self.read_base_url_list() 42 return li 43 next_page = [self.start_url] 44 url_list = [] 45 while next_page: 46 next_page = next_page[0] 47 xhtml = self.parse_url(self.base_url + next_page) 48 tmp_list = xhtml.xpath('//div[@id="ba_list"]/div/a/@href') 49 url_list += tmp_list 50 next_page = xhtml.xpath('//div[@class="pagination"]/a[@class="next"]/@href') 51 print(next_page) 52 self.save_base_url_list(url_list) 53 return url_list 54
55 def save_base_url_list(self, base_url_list): 56 with open(COMMON.BASE_URL_LIST_FILE, "w") as f: 57 for u in base_url_list: 58 f.write(self.base_url + u + "\n") 59
60 def read_base_url_list(self): 61 with open(COMMON.BASE_URL_LIST_FILE, "r") as f: 62 line = f.readlines() 63 li = [s.strip() for s in line] 64 return li 65
66 def driver_get(self, url): 67 try: 68 self.driver.set_script_timeout(5) 69 self.driver.get(url) 70 except: 71 self.driver_get(url) 72 def run(self): 73 """爬蟲程序入口"""
74 # 爬取根網頁地址
75 base_url_list = self.get_base_url_list() 76 data_list = [] 77 for url in base_url_list: 78 self.driver_get(url) 79 html = self.driver.page_source 80 xhtml = etree.HTML(html) 81 a_list = xhtml.xpath('//ul[@id="thread_list"]//a[@rel="noreferrer"]') 82 for a in a_list: 83 title = a.xpath(".//@title") 84 url = a.xpath(".//@href") 85 if not url or not title or title[0]=="點擊隱藏本貼": 86 continue
87 url = self.base_url + url[0] 88 title = title[0] 89
90 if url in self.urlset: 91 continue
92
93 data_list.append([title, url]) 94 self.urlset.add(url) 95 data = pd.DataFrame(data_list, columns=["title,", "url"]) 96 data.to_csv("./data/database.csv") 97
98
99
100
101 if __name__ == '__main__': 102 s = Spider_BaiduTieba() 103 s.run()
2 清晰數據 和 創建索引部分代碼 這裏是notebook 完成的, 因此看起來有點奇怪linux
1 #%%
2 import pandas as pd 3 import sqlite3 4 import jieba 5 #%%
6 data = pd.read_csv("./data/database.csv") 7 #%%
8 def check_contain_chinese(check_str): 9 for ch in check_str: 10 if u'\u4e00' <= ch <= u'\u9fff': 11 return True 12 if "a" <= ch <= "z" or "A" <= ch <= "X": 13 return True 14 if "0" <= ch <= "9": 15 return True 16 return False 17 #%%
18 data2 = [] 19 for d in data.itertuples(): 20 title = d[1] 21 url = d[2] 22 cut = jieba.cut(title) 23 keyword = ""
24 for c in cut: 25 if check_contain_chinese(c): 26 keyword += " " + c 27 keyword = keyword.strip() 28 data2.append([title, keyword, url]) 29 #%%
30 data3 = pd.DataFrame(data2, columns=["title", "keyword", "url"]) 31 data3 32 #%%
33 data3.to_csv("./data/cleaned_database.csv", index=False) 34 #%%
35 for line in data3.itertuples(): 36 title, keyword, url = line[1],line[2],line[3] 37 print(title) 38 print(keyword) 39 print(url) 40 break
41
42 #%%
43 conn = sqlite3.connect("./data/database.db") 44 c = conn.cursor() 45
46 # 建立數據庫
47 sql = "drop table page_info;"
48 c.execute(sql) 49 conn.commit() 50
51 sql = """
52 create table page_info( 53 id INTEGER PRIMARY KEY, 54 keyword text not null, 55 url text not null 56 ); 57 """
58 c.execute(sql) 59 conn.commit() 60
61
62 # 建立索引表
63 sql = """
64 create table page_index( 65 id INTEGER PRIMARY KEY, 66 keyword text not null, 67 page_id INTEGER not null 68 ); 69 """
70 c.execute(sql) 71 conn.commit() 72 #%%
73 sql = "delete from page_info;"
74 c.execute(sql) 75 conn.commit() 76
77
78 # 插入到數據庫
79 i = 0 80 for line in data3.itertuples(): 81 title, keyword, url = line[1],line[2],line[3] 82 sql = """
83 insert into page_info (url, keyword) 84 values('%s', '%s') 85 """ % (url, keyword) 86 c.execute(sql) 87 conn.commit() 88 i += 1
89 if i % 50 == 0: 90 print(i, len(data3)) 91
92
93
94 sql = "delete from page_index;"
95 c.execute(sql) 96 conn.commit() 97
98 sql = "select * from page_info;"
99 res = c.execute(sql) 100 res = list(res) 101 length = len(res) 102
103 i = 0 104 for line in res: 105 pid, words, url = line[0], line[1], line[2] 106 words = words.split(" ") 107 for w in words: 108 sql = """
109 insert into page_index (keyword, page_id) 110 values('%s', '%s') 111 """ % (w, pid) 112 c.execute(sql) 113 conn.commit() 114 i += 1
115 if i % 100 == 0: 116 print(i, length) 117 #%%
118
119 #%%
120
121
122 #%%
123 titles = list(words) 124 colums = ["title", "url"] + titles 125 word_vector = pd.DataFrame(columns=colums) 126 word_vector 127 #%%
128
129 #%%
130 data = pd.read_csv("./data/database.csv") 131 #%%
132 data 133 #%%
134 sql = "alter table page_info add title text;"
135 conn = sqlite3.connect("./data/database.db") 136 c = conn.cursor() 137 c.execute(sql) 138 conn.commit() 139 #%%
140 conn = sqlite3.connect("./data/database.db") 141 c = conn.cursor() 142 length = len(data) 143 i = 0 144 for line in data.itertuples(): 145 pid = line[0]+1
146 title = line[1] 147 sql = "UPDATE page_info SET title = '%s' WHERE id = %s "%(title,pid) 148 try: 149 c.execute(sql) 150 conn.commit() 151 except: 152 continue
153 i += 1
154 if i % 50 == 0: 155 print(i, length) 156
157
158 #%%
159
160 #%%
3 web後端 完成檢索功能代碼
1 # coding=utf-8
2 import jieba 3 import sqlite3 4 from bottle import route, run, template, request, static_file, redirect 5
6
7 @route('/static/<filename>') 8 def server_static(filename): 9 if filename == "jquery.min.js": 10 return static_file("jquery.min.js", root='./data/front/js/') 11 elif filename == "bootstrap.min.js": 12 return static_file("bootstrap.js", root='./data/front/js/') 13 elif filename == "bootstrap.min.css": 14 return static_file("bootstrap.css", root='./data/front/css/') 15
16
17 @route('/') 18 def index(): 19 return redirect("/hello/") 20
21
22 @route('/hello/') 23 def index(): 24 form = request.GET.decode("utf-8") 25 keyword = form.get("keyword", "") 26 cut = list(jieba.cut(keyword)) 27 # 根據索引查詢包含關鍵詞的網頁編號
28 page_id_list = get_page_id_list_from_key_word_cut(cut) 29 # 根據網頁編號 查詢網頁具體內容
30 page_list = get_page_list_from_page_id_list(page_id_list) 31 # 根據查詢關鍵字和網頁包含的關鍵字,進行相關度排序 餘弦類似度
32 page_list = sort_page_list(page_list, cut) 33 context = { 34 "page_list": page_list[:20], 35 "keyword": keyword 36 } 37 return template("./data/front/searcher.html", context) 38
39
40 # 計算page_list中每一個page 和 cut的餘弦類似度
41 def sort_page_list(page_list, cut): 42 con_list = [] 43 for page in page_list: 44 url = page[2] 45 words = page[1] 46 title = page[3] 47 vector = words.split(" ") 48 same = 0 49 for i in vector: 50 if i in cut: 51 same += 1
52 cos = same / (len(vector)*len(cut)) 53 con_list.append([cos, url, words, title]) 54 con_list = sorted(con_list, key=lambda i: i[0], reverse=True) 55 return con_list 56
57
58
59 # 根據網頁id列表獲取網頁詳細內容列表
60 def get_page_list_from_page_id_list(page_id_list): 61 id_list = "("
62 for k in page_id_list: 63 id_list += "%s,"%k 64 id_list = id_list.strip(",") + ")"
65 conn = sqlite3.connect("./data/database.db") 66 c = conn.cursor() 67 sql = "select * " \ 68 + "from page_info " \ 69 + "where id in " + id_list + ";"
70 res = c.execute(sql) 71 res = [r for r in res] 72 return res 73
74
75 # 根據關鍵詞在索引中獲取網頁編號
76 def get_page_id_list_from_key_word_cut(cut): 77 keyword = "("
78 for k in cut: 79 if k == " ": 80 continue
81 keyword += "'%s',"%k 82 keyword = keyword.strip(",") + ")"
83 conn = sqlite3.connect("./data/database.db") 84 c = conn.cursor() 85 sql = "select page_id " \ 86 + "from page_index " \ 87 + "where keyword in " + keyword + ";"
88 res = c.execute(sql) 89 res = [r[0] for r in res] 90 return res 91
92
93
94 if __name__ == '__main__': 95 run(host='localhost', port=8080)