python簡單爬蟲的實現

時間 2019-12-13

原文原文鏈接

python強大之處在於各類功能完善的模塊。合理的運用能夠省略不少細節的糾纏，提升開發效率。html

用python實現一個功能較爲完整的爬蟲，不過區區幾十行代碼，但想一想若是用底層C實現該是何等的複雜，光一個網頁數據的得到就須要字節用原始套接字構建數據包，而後解析數據包得到，關於網頁數據的解析，更是得喝一壺。node

下面具體分析分析用python如何構建一個爬蟲。python

0X01 簡單的爬蟲主要功能模塊git

URL管理器：管理待抓取URL集合和已抓取URL集合，防止重複抓取、防止循環抓取。主要須要實現：添加新URL到待爬取集合中、判斷待添加URL是否在容器中、判斷是否還有待爬取URL、得到爬取URL、將URL從帶爬取移動到已爬取。URL實現方式能夠採用內存set()集合、關係數據庫、緩存數據庫。通常小型爬蟲數據保存內存中已經足夠了。github

網頁下載器：經過URL得到HTML網頁數據保存成文本文件或者內存字符串。在python中提供了urlllib2模塊、requests模塊來實現這個功能。具體的代碼實如今下面作詳細分析。正則表達式

網頁解析器：經過獲取的HTML文檔，從中得到新的URL以及關心的數據。如何從HTML文檔中得到須要的信息呢？能夠分析信息的結構，而後經過python正則表達式模糊匹配得到，但這種方法再面對複雜的HTML時就有點力不從心。能夠經過python自帶的html.parser來解析，或者經過第三方模塊Beautiful Soup、lxml等來結構化解析。什麼是結構化解析？ 就是把把網頁結構當作一棵樹形結構，官方叫DOM（Document Object Model）。數據庫

而後經過搜索節點的方式來得到關心的節點數據。瀏覽器

運行流程：調度程序詢問URL是否有帶爬取的URL，若是有就得到一個，而後送到下載器得到HTML內容，而後再將內容送到解析器進行解析，獲得新的URL和關心的數據，而後把新增長的URL放入URL管理器。緩存

0X02 urllib2模塊的使用cookie

urllib2的使用有不少種方法。

第一種：

直接經過urlopen的方式得到HTML。

url = "http://www.baidu.com"

print 'The First method'
response1 = urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

第二種：

這個方法是本身構建HTTP請求頭，假裝成一個瀏覽器，能夠繞過一些反爬機制，本身構造HTTP請求頭更加靈活。

url = "http://www.baidu.com"

print 'The Second method'
request = urllib2.Request(url)
request.add_header("user-agent", "Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

第三種：

增長cookie處理，能夠得到須要登陸的頁面信息。

url = "http://www.baidu.com"

print 'The Third method'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print len(response3.read())

固然這幾種方法的使用都須要導入urllib2，第三種還須要導入cookielib。

0X03 BeautifulSoup的實現

下面簡單說說BeautifulSoup的用法。大體也就是三步走：建立BeautifulSoup對象，尋找節點，得到節點內容。

from bs4 import  BeautifulSoup
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')

print 'Get all links'
links = soup.find_all('a')
for link in links:
    print link.name,link['href'],link.get_text()

print 'Get lacie link'
link_node = soup.find('a',href='http://example.com/lacie')
print link_node.name,link_node['href'],link_node.get_text()

print 'match'
link_node = soup.find('a', href=re.compile(r'ill'))
print link_node.name, link_node['href'], link_node.get_text()

print 'p'
p_node = soup.find('p', class_="title")
print p_node.name, p_node.get_text()