python網頁爬蟲開發之一

時間 2019-11-10

標籤 python 網頁爬蟲開發之一欄目 Python 简体版

原文原文鏈接

一、beautifulsoap4 和 scrapy解析和下載網頁的代碼區別

bs能夠離線解釋html文件，可是獲取html文件是由用戶的其餘行爲的定義的，好比urllib或者request ；

而scrapy是一個完整的獲取程序，只須要把網址貼上去，就會自動去爬。省去不少用戶須要關注的細節。

輪子和車子的區別。前者要依附於一個程序，後者本身就能跑。

beautifulsoap4 的性能比lxml要差

二、mongodb非關係型數據庫對網頁的存儲

mongodb安裝注意：不用選中compass，這是界面安裝，須要下載，很慢

mangodb compass單獨下載安裝

----------mangodb管理命令----------

net start MongoDB

net stop MongoDB

net restart MongoDB

安裝服務

mongod --logpath "F:\mangodbDATA\log\mongodb.log" --logappend --dbpath "F:\mangodbDATA\database" --directoryperdb --install

卸載服務(先要中止服務)

mongod --logpath "F:\mangodbDATA\log\mongodb.log" --logappend --dbpath "F:\mangodbDATA\database" --directoryperdb --remove

重裝服務

mongod --logpath "F:\mangodbDATA\log\mongodb.log" --logappend --dbpath "F:\mangodbDATA\database" --directoryperdb --reinstall

三、圖形界面應用開發pyqt5

四、爬蟲開發記錄

爬純文本寫入TXT，反爬，最多幾十章。

直接下載html文件

5秒等待，反爬，單線程下載頁面很慢——6分鐘60章節

req = request.Request(url, headers=headers)
resp = request.urlopen(req)
strhtml = resp.read().decode('gbk', 'ignore')
html_soup = BeautifulSoup(strhtml, 'lxml')
# index = BeautifulSoup(str(html_soup.find_all('div', class_='dir')), 'lxml')
# print(html_soup.find_all(['td', ['span']]))
body_flag = 0
spanId = ''

for element in html_soup.find_all(['td', ['span']]):
   if element.has_attr('id'):
      signId = element['id']
      if signId == 'jianjie': body_flag = 1
      if signId == 'xs555' or signId == 'd999': body_flag = 0
   #         else:body_flag = 0

   if body_flag == 1 and element.name == 'td':
      if not element.a is None:
         chapter_name = element.string
         chapter_url = "https://www.555zw.com/book/40/40943/" + element.a.get('href')
         data = {
            'chapter_name': chapter_name,
            'chapter_url': chapter_url
         }
         chapters.insert_one(data)

with open(filename, "a") as f: responses = request.urlopen(item["chapter_url"]) time.sleep(5) contents = responses.read().decode('gbk', 'ignore').encode('utf8') f.write(contents) #origin_soup = BeautifulSoup(contents, 'lxml') #content = origin_soup.find(id='content') #move = dict.fromkeys((ord(c) for c in u"\xa0\r\t")) #txt = content.text.translate(move) #txt = content.text