Python網絡爬蟲與信息提取

時間 2021-07-24

標籤 javascript css html java python 正則表達式 shell 數據庫 json api 欄目 Python 简体版

原文原文鏈接

1.Requests庫入門

Requests安裝

用管理員身份打開命令提示符：javascript

pip install requests

測試：打開IDLE：css

>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> r.status_code
200
>>> r.encoding = 'utf-8' #修改默認編碼
>>> r.text		#打印網頁內容

HTTP協議

超文本傳輸協議,Hypertext Transfer Protocol.html

HTTP是一個基於「請求與響應」模式的、無狀態的應用層協議。java

HTTP協議採用URL做爲定位網絡資源的標識。python

URL格式

http://host[:port][path]正則表達式

host:合法的Internet主機域名或IP地址shell

port：端口號，缺省端口爲80數據庫

path：請求資源的路徑json

操做

方法	說明
GET	請求獲取URL位置的資源
HEAD	請求獲取URl位置資源的響應消息報告，即得到該資源的頭部信息
POST	請求向URL位置的資源後附加新的數據
PUT	請求向URL位置存儲一個資源，覆蓋原URL位置的資源
PATCH	請求局部更新URL位置的資源，即改變該處資源的部份內容
DELETE	請求刪除URL位置存儲的資源

Requests主要方法

方法	說明
requests.request()	構造一個請求，支撐如下各方法的基礎方法
requests.get()	獲取HTML網頁的主要方法，對應於HTTP的GET
requests.head()	獲取HTML網頁頭信息的方法，對應於HTTP的HEAD
requests.post()	向HTML網頁提交POST請求的方法，對應於HTTP的POST
requests.put()	向HTML網頁提交PUT請求的方法，對應於HTTP的PUT
requests.patch()	向HTML網頁提交局部修改請求，對應於HTTP的PATCH
requests.delete()	向HTML網頁提交刪除請求，對應於HTTP的DELETE

主要方法爲request方法，其餘方法都是在此方法基礎上封裝而來以便使用。api

request()方法

requests.request(method,url,**kwargs)
#method:請求方式，對應get/put/post等7種
#url：擬獲取頁面的url連接
#**kwargs：控制訪問的參數，共13個

**kwargs：控制訪問的參數，均爲可選項

get()方法

r  = requests.get(url)
完整方法：
requests.get(url,params=None,**kwargs)
	url:擬獲取頁面的url連接
	params:url中的額外參數，字典或字節流格式，可選
	**kwargs:12個控制訪問的參數，可選

get()方法：

構造一個向服務器請求資源的Request對象

返回一個包含服務器資源的Response對象

Response對象

屬性	說明
r.status_code	HTTP請求的返回狀態，200表示鏈接成功，404表示失敗
r.text	HTTP響應內容的字符串形式，即：url對應的頁面內容
r.encoding	從HTTP header中猜想的響應內容編碼方式
r.apparent_encoding	從內容中分析出的響應內容編碼方式（備選編碼方式）
r.content	HTTP響應內容的二進制形式

head()方法

r = requests.head('http://httpbin.org/get')
r.headers

獲取網絡資源的概要信息

post()方法

向服務器提交新增數據

payload = {'key1':'value1','key2':'value2'} #新建一個字典
#向URL POST一個字典，自動編碼爲form（表單）
r = requests.post('http://httpbin.org/post',data = payload)
#向URL POST一個字符串，自動編碼爲data
r = requests.post('http://httpbin.org/post',data = 'ABC') 
print(r.text)

put()方法

同post，只不過會把原來的內容覆蓋掉。

patch()方法

delete()方法

Requests庫的異常

異常	說明
requests.ConnectionError	網絡鏈接錯誤異常，如DNS查詢失敗、拒絕鏈接等
requests.HTTPError	HTTP錯誤異常
requests.URLRequired	URL缺失異常
requests.TooManyRedirects	超過最大重定向次數，產生重定向異常
requests.ConnectTimeout	鏈接遠程服務器超時異常
requests.Timeout	請求URL超時，產生超時異常

異常方法	說明
r.raise_for_status	若是不是200產生異常requests.HTTPError

爬取網頁的通用代碼框架

import requests

def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status() #若是不是200，引起HTTPError異常
        r.recoding = r.apparent_encoding
        return r.text
    except:
        return "產生異常"
if __name__ == "__main__":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

實例

向百度提交關鍵詞

import requests

# 向搜索引擎進行關鍵詞提交
url = "http://www.baidu.com"
try:
    kv = {'wd':'python'}
    r = requests.get(url,params =kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("產生異常")

獲取網絡圖片及存儲

import requests
import os
url = "http://image.ngchina.com.cn/2019/0423/20190423024928618.jpg"
root = "D://2345//Temp//"
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)  #r.content返回二進制內容
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失敗")

2.信息提取之Beautiful Soup庫入門

Beautiful Soup庫安裝

pip install beautifulsoup4

測試：

import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
form bs4 import BeautifulSoup #從bs4中引入BeautifulSoup類
soup = BeautifulSoup(demo, "html.parser")

Beautiful Soup庫是解析、遍歷、維護「標籤樹」的功能庫

Beautiful Soup庫的基本元素

Beautiful Soup庫的引用

Beautiful Soup庫，也叫beautifulsoup4或bs4.

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")

Beautiful Soup類的基本元素

基本元素	說明
Tag	標籤，最基本的信息組織單元，分別用<>和</>標明開頭和結尾
Name	標籤的名字， ... 的名字是'p'，格式： .name
Attributes	標籤的屬性，字典形式組織，格式： .attrs
NavigableString	標籤內非屬性字符串，<>...</>中字符串，格式： .string
Comment	標籤內字符串的註釋部分，一種特殊的Comment類型

基於bs4庫的HTML內容遍歷方法

下行遍歷

屬性	說明
.contents(列表類型)	子節點的列表，將全部兒子節點存入列表
.children	子節點的迭代類型，與.contents相似，用於循環遍歷兒子節點
.descendants	子孫節點的迭代類型，包含全部子孫節點，用於循環遍歷

#遍歷兒子節點
for child in soup.body.children
	print(child)
#遍歷子孫節點
for child in soup.body.descendants
	print(child)

上行遍歷

屬性	說明
.parent	節點的父親標籤
.parents	節點先輩標籤的迭代類型，用於循環遍歷先輩節點

soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
#輸出結果
#p
#body
#html
#[document]

平行遍歷

平行遍歷發生在同一個父節點下的各節點間。

下一個獲取的多是字符串類型，不必定是下一個節點。

屬性	說明
.next_sibling	返回按照HTML文本順序的下一個平行節點標籤
.previous_sibling	返回按照HTML文本順序的上一個平行節點標籤
.next_siblings	迭代類型，返回按照HTML文本順序的後續全部平行節點標籤
.previous_siblings	迭代類型，返回按照HTML文本順序的前續全部平行節點標籤

#遍歷後續節點
for sibling in soup.a.next_siblings
	print(sibling)
#遍歷前續節點
for sibling in soup.a.previous_siblings
	print(sibling)

基於bs4庫的HTML格式化和編碼

格式化方法：.prettify()

soup = BeautifulSoup(demo,"html.parser")
print(soup.a.prettify())

編碼：默認utf-8

soup = BeautifulSoup("<p>中文</p>","html.parser")
soup.p.string
#'中文'
print(soup.p.prettify())
#<p>
#  中文
#</p>

3.信息組織與提取

信息標記的三種形式

標記後的信息可造成信息組織結構，增長了信息的維度；

標記後的信息可用於通訊、存儲和展現；

標記的結構和信息同樣具備重要價值；

標記後的信息有利於程序的理解和運用。

XML: eXtensible Matkup Language

最先的通用信息標記語言，可擴展性好，但繁瑣。

用於Internet上的信息交互和傳遞。

<name>...</name>
<name/>
<!--  -->

JSON: JavaScript Object Notation

信息有類型，適合程序處理(js)，較XML簡潔。

用於移動應用雲端和節點的信息通訊，無註釋。

#有類型的鍵值對錶示信息的標記形式
"key":"value"
"key":["value1","value2"]
"key":{"subkey":"subvalue"}

YAMl: YAML Ain't Markup Language

信息無類型，文本信息比例最高，可讀性好。

用於各種系統的配置文件，有註釋易讀。

#無類型的鍵值對錶示信息的標記形式
key : "value"
key : #comment
-value1
-value2
key :
	subkey : subvalue

信息提取的通常方法

方法一：完整解析信息的標記形式，再提取關鍵信息。

XML JSON YAML

須要標記解析器，例如bs4庫的標籤樹遍歷。

優勢：信息解析準確

缺點：提取過程繁瑣，過程慢

方法二：無視標記形式，直接搜索關鍵信息

搜索

對信息的文本查找函數便可。

優勢：提取過程簡潔，速度較快

缺點：提取過程準確性與信息內容相關

融合方法：結合形式解析與搜索方法,提取關鍵信息

XML JSON YAML 搜索

須要標記解析器及文本查找函數。

實例：提取HTML中全部URL連接

思路： 1. 搜索到全部標籤

2.解析標籤格式，提取href後的連接內容

form bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
for link in soup.find_all('a'):
	print(link.get('href'))

基於bs4庫的HTML內容查找方法

方法	說明
<>.find_all(name,attrs,recursive,string,**kwargs)	返回一個列表類型，存儲查找的結果

簡寫形式： (..) 等價於 .find_all(..)

#name:對標籤名稱的檢索字符串
soup.find_all('a')
soup.find_all(['a', 'b'])
soup.find_all(True) #返回soup的全部標籤信息
for tag in soup.find_all(True):
    print(tag.name) #html head title body p b p a a
#輸出全部b開頭的標籤，包括b和body    
#引入正則表達式庫
import re
for tag in soup.find_all(re.compile('b')):
    print(tag.name) #body b

#attrs:對標籤屬性值的檢索字符串，可標註屬性檢索
soup.find_all('p', 'course')
soup.find_all(id='link1')
import re 
soup.find_all(id=re.compile('link'))

#recursive:是否對子孫所有檢索，默認爲True
soup.find_all('p', recursive = False)

#string:<>...</>字符串區域的檢索字符串
soup.find_all(string = "Basic Python")
import re
soup.find_all(string = re.compile('Python'))
#簡寫形式：soup(..) = soup.find_all(..)

拓展方法：參數同.find_all()

方法	說明
<>.find()	搜索且只返回一個結果，字符串類型
<>.find_parents()	在先輩節點中搜索，返回列表類型
<>.find_parent()	在先輩節點中返回一個結果，字符串類型
<>.find_next_siblings()	在後續平行節點中搜索，返回列表類型
<>.find_next_sibling()	在後續平行節點中返回一個結果，字符串類型
<>.find_previous_siblings()	在前續平行節點中搜索，返回列表類型
<>.find_previous_sibling()	在前續平行節點中返回一個結果，字符串類型

4.信息提取實例

中國大學排名定向爬蟲

功能描述：

輸入：大學排名URL連接

輸出：大學排名信息的屏幕輸出（排名，大學名稱，總分）

技術路線：requests-bs4

定向爬蟲：僅對輸入URL進行爬取，不拓展爬取

程序的結構設計：

步驟1：從網絡上獲取大學排名網頁內容

getHTMLText()

步驟2：提取網頁內容中信息到合適的數據結構

fillUnivList()

步驟3：利用數據結構展現並輸出結果

printUnivList()

初步代碼編寫

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout= 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名", "學校名稱", "分數"))
    for i in range(num):
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2]))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20) #20 univs
main()

中文輸出對齊問題

當輸出中文的寬度不夠時，系統會採用西文字符填充，致使對齊出現問題。

可使用中文空格chr(12288)填充解決。

<填充>：用於填充的單個字符

<對齊>：<左對齊 >右對齊 ^居中對齊

<寬度>：槽的設定輸出寬度

,：數字的千位分隔符適用於整數和浮點數

<精度>：浮點數小數部分的精度或字符串的最大輸出長度

<類型>：整數類型b,c,d,o,x,X浮點數類型e,E,f,%

代碼優化

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout= 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名", "學校名稱", "分數",chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2],chr(12288)))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20) #20 univs
main()

5.實戰之Re庫入門

正則表達式

通用的字符串表達框架
簡介表達一組字符串的表達式
針對字符串表達「簡潔」和「特徵」思想的工具
判斷某字符串的特徵歸屬

正則表達式的語法

操做符	說明	實例
.	表示任何單個字符
[ ]	字符集，對單個字符給出取值範圍	[abc]表達式a、b、c,[a-z]表示a到z單個字符
[^ ]	非字符集，對單個字符給出排除範圍	[^abc]表示非a或b或c的單個字符
*	前一個字符0次或無限次擴展	abc* 表示 ab、abc、abcc、abccc等
+	前一個字符1次或無限次擴展	abc+ 表示 abc、abcc、abccc等
?	前一個字符0次或1次擴展	abc？表示 ab、abc
\|	左右表達式任意一個	abc\|def 表示 abc 、def
{m}	擴展前一個字符m次	ab{2}c表示abbc
{m,n}	擴展前一個字符m至n次（含n）	ab{1,2}c表示abc、abbc
^	匹配字符串開頭	^abc表示abc且在一個字符串的開頭
$	匹配字符串結尾	abc$表示abc且在一個字符串的結尾
( )	分組標記，內部只能使用\|操做符	(abc)表示abc，{abc\|def}表示abc、def
\d	數字，等價於[0-9]
\w	單詞字符，等價於[A-Za-z0-9_]

經典正則表達式實例

正則表達式	說明
`^[A-Za-z]+$`	由26個字母組成的字符串
`^[A-Za-z0-9]+$`	由26個字母和數字組成的字符串
`^-?\d+$`	整數形式的字符串
`^[0-9][1-9][0-9]$`	正整數形式的字符串
`[1-9]\d{5}`	中國境內郵政編碼，6位
`[\u4e00-\u9fa5]`	匹配中文字符
`\d{3}-\d{8}\|\d{4}-\d{7}`	國內電話號碼

Re庫的基本使用

Re庫是Python的標準庫，主要用於字符串匹配。

正則表達式的表示類型

raw string類型（原生字符串類型）,是不包含轉義符\的字符串

re庫採用raw string類型表示正則表達式，表示爲：r'text'

例如：r'[1-9]\d{5}'

r'\d{3}-\d{8}|\d{4}-\d{7}'

Re庫主要功能函數

函數	說明
re.search()	在一個字符串中搜索匹配正則表達式的第一個位置，返回match對象
re.match()	從一個字符串的開始位置起匹配正則表達式，返回match對象
re.findall()	搜索字符串，以列表類型返回所有能匹配的子串
re.split()	將一個字符串按照正則表達式匹配結果進行分割，返回列表類型
re.finditer()	搜索字符串，返回一個匹配結果的迭代類型，每一個迭代元素是match對象
re.sub()	在一個字符串中替換全部匹配正則表達式的子串，返回替換後的字符串

re.search(pattern,string,flags=0)

re.search(pattern,string,flags=0)

在一個字符串中搜索匹配正則表達式的第一個位置，返回match對象；

pattern：正則表達式的字符串或原生字符串表示；
string：待匹配字符串；

flags：正則表達式使用時的控制標記；

經常使用標記	說明
re.I\|re.IGNORECASE	忽略正則表達式的大小寫，[A-Z]能匹配小寫字符
re.M\|re.MUTILINE	正則表達式中的^操做符可以將給定字符串的每行當作匹配開始
re.S\|re.DOTILL	正則表達式中的.操做符可以匹配全部字符，默認匹配除換行符外的全部字符

例子：

import re
match = re.search(r'[1-9]\d{5}','BIT 100081')
if match:
    print(match.group(0))  #'100081'

re.match(pattern,string,flags=0)

re.match(pattern,string,flags=0)

從一個字符串的開始位置起匹配正則表達式，返回match對象
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- flags：正則表達式使用時的控制標記；

例子：

import re
match = re.match(r'[1-9]\d{5}','BIT 100081')
if match:
    print(match.group(0))  #NULL
match = re.match(r'[1-9]\d{5}','100081 BIT')
if match:
    print(match.group(0))  #'100081'

re.findall(pattern,string,flags=0)

re.findall(pattern,string,flags=0)

搜索字符串，以列表類型返回所有能匹配的子串
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- flags：正則表達式使用時的控制標記；

例子：

import re
ls = re.findall(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(ls) #['100081', '100084']

re.split(pattern,string,maxsplit=0,flags=0)

re.split(pattern,string,flags=0)

將一個字符串按照正則表達式匹配結果進行分割，返回列表類型
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- maxsplit：最大分割數，剩餘部分做爲最後一個元素輸出；
- flags：正則表達式使用時的控制標記；

例子：

import re
ls = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(ls) #['BIT', ' TSU', '']
ls2 = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1)
print(ls2) #['BIT', ' TSU10084']

re.finditer(pattern,string,flags=0)

re.finditer(pattern,string,flags=0)

搜索字符串，返回一個匹配結果的迭代類型，每一個迭代元素都是match對象
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- flags：正則表達式使用時的控制標記；

例子：

import re
for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
    if m:
        print(m.group(0)) #100081 100084

re.sub(pattern,repl,string,count=0,flags=0)

re.sub(pattern,repl,string,count=0,flags=0)

在一個字符串中替換全部匹配正則表達式的子串，並返回替換後的字符串
- pattern：正則表達式的字符串或原生字符串表示；
- repl：替換匹配字符串的字符串；
- string：待匹配字符串；
- count：匹配的最大替換次數
- flags：正則表達式使用時的控制標記；

例子：

import re
rst = re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT 100081,TSU 100084')
print(rst) # 'BIT :zipcode TSU :zipcode'

Re庫的另外一種用法

編譯後的對象擁有的方法和re庫主要功能函數相同

#函數式用法：一次性操做
rst = re.search(r'[1-9]\d{5}', 'BIT 100081')

#面向對象用法：編譯後的屢次操做
pat = re.compile(r'[1-9]\d{5}')
rst = pat.search('BIT 100081')

re.compile(pattern,flags=0)

將正則表達式的字符串形式編譯成正則表達式對象
- pattern：正則表達式的字符串或原生字符串表示；
- flags：正則表達式使用時的控制標記；

regex = re.compile(r'[1-9]\d{5}')

Re庫的match對象

import re
match = re.search(r'[1-9]\d{5}','BIT 100081')
if match:
    print(match.group(0))  # '100081'
print(type(match)) # <class 're.Match'>

Match對象的屬性

屬性	說明
.string	待匹配的文本
.re	匹配時使用的pattern對象（正則表達式）
.pos	正則表達式搜索文本的開始位置
.endpos	正則表達式搜索文本的結束位置

Match對象的方法

方法	說明
.group(0)	得到匹配後的字符串
.start()	匹配字符串在原始字符串的開始位置
.end()	匹配字符串在原始字符串的結束位置
.span()	返回(.start(),.end())

import re
m = re.search(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(m.string) # BIT100081 TSU100084
print(m.re) # re.compile('[1-9]\\d{5}')
print(m.pos) # 0
print(m.endpos) # 19
print(m.group(0)) # '100081' 返回的是第一次匹配的結果,獲取全部使用re.finditer()方法
print(m.start()) # 3
print(m.end()) # 9
print(m.span()) # (3, 9)

Re庫的貪婪匹配和最小匹配

Re庫默認採用貪婪匹配，即輸出匹配最長的子串。

import re
match = re.search(r'PY.*N', 'PYANBNCNDN')
print(match.group(0)) # PYANBNCNDN

最小匹配方法：

import re
match = re.search(r'PY.*?N', 'PYANBNCNDN')
print(match.group(0)) # PYAN

最小匹配操做符

操做符	說明
*?	前一個字符0次或無限次擴展，最小匹配
+?	前一個字符1次或無限次擴展，最小匹配
??	前一個字符0次或1次擴展，最小匹配
{m,n}?	擴展前一個字符m至n次（含n），最小匹配

Re庫實例之淘寶商品比價定向爬蟲

功能描述：

目標：獲取淘寶搜索頁面的信息，提取其中的商品名稱和價格

理解：

淘寶的搜索接口

翻頁的處理

技術路線：requests-re

程序的結構設計：

步驟1：提交商品搜索請求，循環獲取頁面

步驟2：對於每一個頁面，提取商品的名稱和價格信息

步驟3：將信息輸出到屏幕上

import requests
import re

def getHTMLText(url):
    #瀏覽器請求頭中的User-Agent，表明當前請求的用戶代理信息（下方有獲取方式）
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
    try:
        #瀏覽器請求頭中的cookie，包含本身帳號的登陸信息（下方有獲取方式）
        coo = ''
        cookies = {}
        for line in coo.split(';'): #瀏覽器假裝
            name, value = line.strip().split('=', 1)
            cookies[name] = value
        r = requests.get(url, cookies = cookies, headers=headers, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

#解析請求到的頁面，提取出相關商品的價格和名稱
def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序號", "價格", "商品名稱"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))


def main():
    goods = '書包'
    depth = 2 #爬取深度，2表示爬取兩頁數據
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

main()

須要注意的是，淘寶網站自己有反爬蟲機制，因此在使用requests庫的get()方法爬取網頁信息時，須要加入本地的cookie信息，不然淘寶返回的是一個錯誤頁面，沒法獲取數據。

代碼中的coo變量中須要本身添加瀏覽器中的cookie信息，具體作法是在瀏覽器中按F12，在出現的窗口中進入network（網絡）內，搜索「書包」，而後找到請求的url（通常是第一個），點擊請求在右側header（消息頭）中找到Request Header（請求頭），在請求頭中找到User-Agent和cookie字段，放到代碼相應位置便可。

Re庫實例之股票數據定向爬蟲

功能描述：

目標：獲取上交所和深交所全部股票的名稱和交易信息

輸出：保存到文件中

技術路線：requests-bs4-re

候選數據網站的選擇：

新浪股票：https://finance.sina.com.cn/stock/

百度股票：https://gupiao.baidu.com/stock/

選取原則：股票信息靜態存在於HTML頁面中，非js代碼生成，沒有Robots協議限制。

程序的結構設計

步驟1：從東方財富網獲取股票列表

步驟2：根據股票列表逐個到百度股票獲取個股信息

步驟3：將結果存儲到文件

初步代碼編寫(error)

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue
 
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

代碼優化(error)

速度提升：編碼識別的優化

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
            continue
 
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

測試成功代碼

因爲東方財富網連接訪問時出現錯誤，因此更換了一個新的網站去獲取股票列表，具體代碼以下：

import requests
import re
import traceback
from bs4 import BeautifulSoup
import bs4


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return""


def getStockList(lst, stockListURL):
    html = getHTMLText(stockListURL)
    soup = BeautifulSoup(html, 'html.parser')
    a = soup.find_all('a')
    lst = []
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[S][HZ]\d{6}", href)[0])
        except:
            continue
    lst = [item.lower() for item in lst]  # 將爬取信息轉換小寫
    return lst


def getStockInfo(lst, stockInfoURL, fpath):
    count = 0
    for stock in lst:
        url = stockInfoURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class': 'stock-bets'})

            if isinstance(stockInfo, bs4.element.Tag):  # 判斷類型
                name = stockInfo.find_all(attrs={'class': 'bets-name'})[0]
                infoDict.update({'股票名稱': name.text.split('\n')[1].replace(' ','')})
                keylist = stockInfo.find_all('dt')
                valuelist = stockInfo.find_all('dd')
                for i in range(len(keylist)):
                    key = keylist[i].text
                    val = valuelist[i].text
                    infoDict[key] = val

            with open(fpath, 'a', encoding='utf-8') as f:
                f.write(str(infoDict) + '\n')
                count = count + 1
                print("\r當前速度：{:.2f}%".format(count*100/len(lst)), end="")
        except:
            count = count + 1
            print("\r當前速度：{:.2f}%".format(count*100/len(lst)), end="")
            traceback.print_exc()
            continue


def main():
    fpath = 'D://gupiao.txt'
    stock_list_url = 'https://hq.gucheng.com/gpdmylb.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    slist = []
    list = getStockList(slist, stock_list_url)
    getStockInfo(list, stock_info_url, fpath)


main()

6.爬蟲框架-Scrapy

爬蟲框架：是實現爬蟲功能的一個軟件結構和功能組件集合。

爬蟲框架是一個半成品，可以幫助用戶實現專業網絡爬蟲。

安裝Scrapy

pip install scrapy
#驗證
scrapy -h

遇到錯誤

building 'twisted.test.raiser' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

解決方案

查看python版本及位數

C:\Users\ASUS>python
Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

可知，python版本爲3.7.2, 64位

下載Twisted

到 http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下載twisted對應版本的whl文件;
根據版本應下載Twisted‑17.9.0‑cp37‑cp37m‑win_amd64.whl

注意：cp後面是python版本，amd64表明64位，32位的下載32位對應的文件。

安裝Twisted

python -m pip install D:\download\Twisted‑19.2.0‑cp37‑cp37m‑win_amd64.whl

安裝Scrapy
```
python -m pip install scrapy
```

Scrapy爬蟲框架解析

Engine：不須要用戶修改
- 控制全部模塊之間的數據流
- 根據條件觸發事件
Downloader：不須要用戶修改
- 根據請求下載網頁
Scheduler：不須要用戶修改
- 對全部爬取請求進行調度管理
Downloader Middleware：用戶可編寫配置代碼
- 目的：實施Engine、Scheduler和Downloader之間進行用戶可配置的控制
- 功能：修改、丟棄、新增請求或響應
Spider：須要用戶編寫配置代碼
- 解析Downloader返回的響應（Response）
- 產生爬取項（scraped item）
- 產生額外的爬取請求（Request）
Item Pipelines：須要用戶編寫配置代碼
- 以流水線方式處理Spider產生的爬取項
- 由一組操做順序組成，相似流水線，每一個操做是一個Item Pipeline類型
- 可能操做包括：清理、檢驗、和查重爬取項中的HTML數據、將數據存儲到數據庫
Spider Middleware：用戶能夠編寫配置代碼
- 目的：對請求和爬取項的再處理
- 功能：修改、丟棄、新增請求或爬取項

requests vs. Scrapy

相同點
- 二者均可以進行頁面請求和爬取，Python爬蟲的兩個重要技術路線
- 二者可用性都好，文檔豐富，入門簡單
- 二者都沒有處理js、提交表單、應對驗證碼等功能（可擴展）

不一樣點

requests	Scrapy
頁面級爬蟲	網站級爬蟲
功能庫	框架
併發性考慮不足，性能較差	併發性好，性能較高
重點在於頁面下載	重點在於爬蟲結構
定製靈活	通常定製靈活，深度定製困難
上手十分簡單	入門稍難

Scrapy爬蟲的經常使用命令

Scrapy命令行

Scrapy是爲持續運行設計的專業爬蟲框架，提供操做的Scrapy命令行

命令	說明	格式
startproject	建立一個新工程	scrapy startproject [dir]
genspider	建立一個爬蟲	scrapy genspider [options]
settings	得到爬蟲配置信息	scrapy setting [options]
crawl	運行一個爬蟲	scrapy crawl
list	列出工程中全部爬蟲	scrapy list
shell	啓動URL調試命令行	scrapy shell [url]

Scrapy框架的基本使用

步驟1：創建一個Scrapy爬蟲工程

#打開命令提示符-win+r 輸入cmd
#進入存放工程的目錄
D:\>cd demo
D:\demo>
#創建一個工程，工程名python123demo
D:\demo>scrapy startproject python123demo
New Scrapy project 'python123demo', using template directory 'd:\program files\python\lib\site-packages\scrapy\templates\project', created in:
    D:\demo\python123demo

You can start your first spider with:
    cd python123demo
    scrapy genspider example example.com
D:\demo>

生成的目錄工程介紹：

python123demo/ ----------------> 外層目錄

scrapy.cfg ---------> 部署Scrapy爬蟲的配置文件

python123demo/ ---------> Scrapy框架的用戶自定義Python代碼

__init__.py ----> 初始化腳本

items.py ----> Items代碼模板（繼承類）

middlewares.py ----> Middlewares代碼模板（繼承類）

pipelines.py ----> Pipelines代碼模板（繼承類）

settings.py ----> Scrapy爬蟲的配置文件

spiders/ ----> Spiders代碼模板目錄（繼承類）

spiders/ ----------------> Spiders代碼模板目錄（繼承類）

__init__.py --------> 初始文件，無需修改

__pycache__/ --------> 緩存目錄，無需修改

步驟2：在工程中產生一個Scrapy爬蟲

#切換到工程目錄
D:\demo>cd python123demo
#產生一個scrapy爬蟲
D:\demo\python123demo>scrapy genspider demo python123.io
Created spider 'demo' using template 'basic' in module:
  python123demo.spiders.demo

D:\demo\python123demo>

步驟3：配置產生的spider爬蟲

修改D:\demo\python123demo\python123demo\spiders\demo.py

# -*- coding: utf-8 -*-
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo'
#    allowed_domains = ['python123.io']
    start_urls = ['http://python123.io/ws/demo.html']

    def parse(self, response):
        fname = response.url.split('/')[-1]
        with open(fname, 'wb') as f:
            f.write(response.body)
        self.log('Save file %s.' % name)

完整版代碼編寫方式

import scrapy

class DemoSpider(scrapy.Spider):
    name = "demo"
    
    def start_requests(self):
        urls = [
        		'http://python123.io/ws/demo.html'
        ]
        for url in urls:
            yield scrapy.Requests(url=url, callback=self.parse)
            
    def parse(self, response):
        fname = response.url.split('/')[-1]
        with open(fname, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s.' % fname)

步驟4：運行爬蟲，獲取網頁

#輸入運行命令 scrapy crawl
D:\demo\python123demo>scrapy crawl demo

可能出現的錯誤

ModuleNotFoundError: No module named 'win32api'

解決方法

到 https://pypi.org/project/pypiwin32/#files 下載py3版本的pypiwin32-223-py3-none-any.whl文件；

安裝pypiwin32-223-py3-none-any.whl

pip install D:\download\pypiwin32-223-py3-none-any.whl

再次在工程目錄下運行爬蟲
```
scrapy crawl demo
```

yield關鍵字的使用

yield<----------------------->生成器
- 生成器是一個不斷產生值的函數；
- 包含yield語句的函數是一個生成器；
- 生成器每次產生一個值（yield語句），函數會被凍結，被喚醒後再產生一個值；

實例：

def gen(n):
    for i in range(n):
        yield i**2
#產生小於n的全部2的平方值
for i in gen(5):
    print(i, " ", end="")
#0 1 4 9 16

#普通寫法
def square(n):
    ls = [i**2 for i in range(n)]
    return ls
for i in square(5):
    print(i, " ", end="")
#0 1 4 9 16

爲什麼要有生成器？

生成器比一次列出全部內容的優點
- 更節省存儲空間
- 響應更迅速
- 使用更靈活

Scrapy爬蟲的使用步驟

步驟1：建立一個工程和Spider模板；
步驟2：編寫Spider；
步驟3：編寫Item Pipeline
步驟4：優化配置策略

Scrapy爬蟲的數據類型

Request類

class scrapy.http.Request()

Request對象表示一個HTTP請求
由Spider生成，由Downloader執行

屬性或方法	說明
.url	Request對應的請求URL地址
.method	對應的請求方法，’GET‘ ’POST‘等
.headers	字典類型風格的請求頭
.body	請求內容主體，字符串類型
.meta	用戶添加的擴展信息，在Scrapy內部模塊間傳遞信息使用
.copy()	複製該請求

Response類

class scrapy.http.Response()

Response對象表示一個HTTP響應
由Downloader生成，由Spider處理

屬性或方法	說明
.url	Response對應的URL地址
.status	HTTP狀態碼，默認是200
.headers	Response對應的頭部信息
.body	Response對應的內容信息，字符串類型
.flags	一組標記
.request	產生Response類型對應的Request對象
.copy()	複製該響應

Item類

class scrapy.item.Item()

Item對象表示一個從HTML頁面中提取的信息內容
由Spider生成，由Item Pipeline處理
Item相似字典類型，能夠按照字典類型操做

CSS Selector的基本使用

.css('a::attr(href)').extract()

CSS Selector由W3C組織維護並規範。

股票數據Scrapy爬蟲實例

功能描述：

技術路線：scrapy

目標：獲取上交所和深交所全部股票的名稱和交易信息

輸出：保存到文件中

實例編寫

步驟1：首先進入命令提示符創建工程和Spider模板

scrapy startproject BaiduStocks
cd BaiduStocks
scrapy genspider stocks baidu.com
#進一步修改spiders/stocks.py文件

步驟2：編寫Spider
- 配置stock.py文件
- 修改對返回頁面的處理
- 修改對新增URL爬取請求的處理

打開spider.stocks.py文件

# -*- coding: utf-8 -*-
import scrapy
import re
    
    
class StocksSpider(scrapy.Spider):
    name = "stocks"
    start_urls = ['https://quote.eastmoney.com/stocklist.html']
    
    def parse(self, response):
        for href in response.css('a::attr(href)').extract():
            try:
                stock = re.findall(r"[s][hz]\d{6}", href)[0]
                url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
                yield scrapy.Request(url, callback=self.parse_stock)
            except:
                continue
    
    def parse_stock(self, response):
        infoDict = {}
        stockInfo = response.css('.stock-bets')
        name = stockInfo.css('.bets-name').extract()[0]
        keyList = stockInfo.css('dt').extract()
        valueList = stockInfo.css('dd').extract()
        for i in range(len(keyList)):
            key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
            try:
                val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]
            except:
                val = '--'
            infoDict[key]=val
    
        infoDict.update(
            {'股票名稱': re.findall('\s.*\(',name)[0].split()[0] + \
                re.findall('\>.*\<', name)[0][1:-1]})
        yield infoDict

步驟3：編寫Pipelines
- 配置pipelines.py文件
- 定義對爬取項（Scrapy Item）的處理類
- 配置ITEM_PIPELINES選項

pipelines.py

# -*- coding: utf-8 -*-
    
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
class BaidustocksPipeline(object):
    def process_item(self, item, spider):
        return item
    
class BaidustocksInfoPipeline(object):
    def open_spider(self, spider):
        self.f = open('BaiduStockInfo.txt', 'w')
    
    def close_spider(self, spider):
        self.f.close()
    
    def process_item(self, item, spider):
        try:
            line = str(dict(item)) + '\n'
            self.f.write(line)
        except:
            pass
        return item

setting.py

# Configure item pipelines
# See https://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
	'BaiduStocks.pipelines.BaidustocksInfoPipeline': 300,
}

配置併發鏈接選項

settings.py

選項	說明
CONCURRENT_REQUESTS	Downloader最大併發請求下載數量，默認爲32
CONCURRENT_ITEMS	Item Pipeline最大併發ITEM處理數量，默認爲100
CONCURRENT_REQUESTS_PRE_DOMAIN	每一個目標域名最大的併發請求數量，默認爲8
CONCURRENT_REQUESTS_PRE_IP	每一個目標IP最大的併發請求數量，默認爲0，非0有效