Python 3實現網頁爬蟲

時間 2019-12-10

原文原文鏈接

1 什麼是網頁爬蟲

網絡爬蟲（網頁蜘蛛，網絡機器人，網頁追逐者，自動索引，模擬程序）是一種按照必定的規則自動地抓取互聯網信息的程序或者腳本，從互聯網上抓取對於咱們有價值的信息。Tips：自動提取網頁的程序，爲搜索引擎從萬維網上下載網頁，是搜索引擎的重要組成。html

(1) 對抓取目標的描述或定義；node

(2) 對網頁或數據的分析與過濾；python

(3) 對URL的搜索策略。正則表達式

2 Python爬蟲架構

Python爬蟲架構主要由調度器、URL管理器、網頁下載器、網頁解析器、應用程序（爬取的有價值數據）5個部分組成。數據庫

調度器：至關於一臺電腦的CPU，主要負責調度URL管理器、下載器、解析器之間的協調工做。
URL管理器：包括待爬取的URL地址和已爬取的URL地址，防止重複抓取URL和循環抓取URL，實現URL管理器主要用三種方式，經過內存、數據庫、緩存數據庫來實現。
網頁下載器：經過傳入一個URL地址來下載網頁，將網頁轉換成一個字符串，網頁下載器有urllib2（Python官方基礎模塊）包括須要登陸、代理、和cookie，requests(第三方包)
網頁解析器：將一個網頁字符串進行解析，能夠按照咱們的要求來提取出咱們有用的信息，也能夠根據DOM樹的解析方式來解析。網頁解析器有正則表達式（直觀，將網頁轉成字符串經過模糊匹配的方式來提取有價值的信息，當文檔比較複雜的時候，該方法提取數據的時候就會很是的困難）、html.parser（Python自帶的）、beautifulsoup（第三方插件，可使用Python自帶的html.parser進行解析，也可使用lxml進行解析，相對於其餘幾種來講要強大一些）、lxml（第三方插件，能夠解析 xml 和 HTML），html.parser 和 beautifulsoup 以及 lxml 都是以 DOM 樹的方式進行解析的。
應用程序：就是從網頁中提取的有用數據組成的一個應用。

下面用一個圖來解釋一下調度器是如何協調工做的：網頁爬蟲

3 urllib.request實現下載網頁的三種方式

方法一：使用urllib.request.urlopen(url)方法函數實現最基本請求url的發起（打開url網址的操做）瀏覽器

函數原型以下：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)　緩存

方法二：使用response=urllib.request. Request (url)及urllib.request.urlopen(request)函數bash

response=urllib.request. Request (url)實現對目標url，data，headers以及method訪問cookie

urllib.request.urlopen(request)參數爲request對象，代碼中 response就是上一步獲得的request對象（打開url網址的操做）

Tips：構建一個完整的請求，若是請求中須要加入headers（請求頭）等信息，咱們就須要使用更強大的Request類來構建一個請求。Request存在的意義是便於在請求的時候傳入一些信息，而urlopen則不。

方法三：加入urllib.request處理cookie的能力結合urllib.request.urlopen(url)函數實現

Tips：Python 2使用urllib2代替urllib.request，cookies代替http.cookiejar，print代替print()

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import http.cookiejar
import urllib.request

url = "http://www.baidu.com"
response1 = urllib.request.urlopen(url)
print ("第一種方法")
# 獲取狀態碼，200表示成功
print (response1.getcode())
# 獲取網頁內容的長度
print (len(response1.read()))

print ("第二種方法")
request = urllib.request.Request(url)
# 模擬Mozilla瀏覽器進行爬蟲
request.add_header("user-agent", "Mozilla/5.0")
response2 = urllib.request.urlopen(request)
print (response2.getcode())
print (len(response2.read()))

print ("第三種方法")
cookie=http.cookiejar.CookieJar()
# 加入urllib.request處理cookie的能力
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen(url)
print (response3.getcode())
print (len(response3.read()))
print (cookie)

　執行結果見下圖：

4 使用三方庫Beautiful Soup實現解析html文件

4.1 Beautiful Soup的安裝

Beautiful Soup：Python 的第三方插件，用來提取 xml 和 HTML 中的數據，官網地址 https://www.crummy.com/software/BeautifulSoup/。

打開cmd（命令提示符），進入到Python（Python3版本）安裝目錄中的Scripts下，輸入dir查看是否有pip.exe，若是用就可使用Python自帶的pip命令進行安裝，輸入如下命令進行安裝便可：

pip install beautifulsoup4

執行以下圖：

2、測試是否安裝成功

編寫一個 Python 文件test.py，輸入：

import bs4

print (bs4)

運行該文件，若是可以正常輸出則安裝成功，以下。

4.2 使用 Beautiful Soup 解析 html 文件

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import re

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://news.baidu.com" name="tj_trnews" class="mnav">新聞</a>
<a href="https://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a>
<a href="http://map.baidu.com" name="tj_trmap" class="mnav">地圖</a>
<a href="http://v.baidu.com" name="tj_trvideo" class="mnav">視頻</a>
<a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">貼吧</a>
<a href="http://xueshu.baidu.com" name="tj_trxueshu" class="mnav">學術</a>
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 建立一個BeautifulSoup解析對象
soup = BeautifulSoup(html_doc, "html.parser")
# 獲取全部的連接
links = soup.find_all('a')
print ("全部的連接")
for link in links:
    print (link.name, link['href'], link.get_text())

print ("獲取特定的URL地址")
link_node = soup.find('a', href="http://news.baidu.com")
print (link_node.name, link_node['href'], link_node['class'], link_node.get_text())

print ("正則表達式匹配")
link_node = soup.find('a', href=re.compile(r"hao"))
print (link_node.name, link_node['href'], link_node['class'], link_node.get_text())

print ("獲取P段落的文字")
p_node = soup.find('p', class_='story')
print (p_node.name, p_node['class'], p_node.get_text())

執行結果以下：