Python網絡爬蟲入門篇

時間 2019-11-08

標籤 python 網絡爬蟲入門欄目 Python 简体版

原文原文鏈接

1. 預備知識

學習者須要預先掌握Python的數字類型、字符串類型、分支、循環、函數、列表類型、字典類型、文件和第三方庫使用等概念和編程方法。php

Python入門篇：http://www.javashuo.com/article/p-cgbvzsyw-ch.htmlhtml

2. Python爬蟲基本流程

a. 發送請求html5

使用http庫向目標站點發起請求，即發送一個Request，Request包含：請求頭、請求體等。 python

Request模塊缺陷：不能執行JS 和CSS 代碼。web

b. 獲取響應內容正則表達式

若是requests的內容存在於目標服務器上，那麼服務器會返回請求內容。數據庫

Response包含：html、Json字符串、圖片，視頻等。編程

c. 解析內容json

對用戶而言，就是尋找本身須要的信息。對於Python爬蟲而言，就是利用正則表達式或者其餘庫提取目標信息。瀏覽器

解析html數據：正則表達式（RE模塊），第三方解析庫如Beautifulsoup，pyquery等

解析json數據：json模塊

解析二進制數據:以wb的方式寫入文件

d. 保存數據

解析獲得的數據能夠多種形式，如文本，音頻，視頻保存在本地。

數據庫（MySQL，Mongdb、Redis）

文件

3. Requests庫入門

Requests是用python語言基於urllib編寫的，採用的是Apache2 Licensed開源協議的HTTP庫。

3.1 Requests庫安裝和測試

安裝：

Win平臺：以「管理員身份運行cmd」，執行 pip install requests

測試：

3.2 Requests庫的7個主要方法

方法	說明
requests.request()	構造一個請求，支撐一下個方法的基礎方法。
requests.get()	獲取HTML網頁的主要方法，對應HTTP的GET
requests.head()	獲取HTML網頁投信息的方法，對應HTTP的HEAD
requests.post()	向HTML網頁提交POST請求的方法，對應HTTP的POST
requests.put()	向HTML網頁提交PUT請求的方法，對應HTTP的PUT
requests.patch()	向HTML網頁提交局部修改請求，對應HTTP的PATCH
requests.delete()	向HTML網頁提交刪除請求，對應HTTP的DELETE

帶可選參數的請求方式：

requests.request(method,url,**kwargs)

method:請求方式，對應get/put/post等7種

url：獲取頁面的url連接

**kwargs：控制訪問的參數，均爲可選項，共如下13個

params：字典或字節系列，做爲參數增長到url中

>>> kv = {'key1':'value1','key2':'value2'}
>>> r = requests.request('GET','http://python123.io/ws',params=kv)
>>> print(r.url)
https://python123.io/ws?key1=value1&key2=value2

data：字典、字節系列或文件對象，做爲requests的內容

>>> kv = {'key1':'value1','key2':'value2'}
>>> r = requests.request('POST','http://python123.io/ws',data=kv)
>>> body = '主題內容'
>>> r = requests.request('POST','http:///python123.io/ws',data=body)

json：JSON格式的數據，做爲equests的內容

>>> kv = {'key1':'value1','key2':'value2'}
>>> r = requests.request('POST','http://python123.io/ws',json=kv)

headers：字典，HTTP定製頭

>>> hd = {'user-agent':'Chrome/10'}
>>> r = requests.request('POST','http://www.baidu.com',headers=hd)

cookies：字典或cookieJar，Request中的cookie

files：字典類型，傳輸文件

>>> f = {'file':open('/root/po.sh','rb')}
>>> r = requests.request('POST','http://python123.io/ws',file=f)

timeout:設置超時時間，秒爲單位。

>>> r = requests.request('GET','http://python123.io/ws',timeout=30)

proxies:字典類型，設置訪問代理服務器，能夠增長登陸驗證。

>>> pxs = {'http':'http://user:pass@10.10.10.2:1234',
... 'https':'https://10.10.10.3:1234'}
>>> r = requests.request('GET','http://www.baidu.com',proxies=pxs)

allow_redirects:True/False,默認爲True，重定向開關

stream：True/False,默認爲True，獲取內容當即下載開關

verify：rue/False,默認爲True，認證SSL證書開關

Cert：本地SSL證書路徑

auth：元組類型，支持HTTP認證功能

3.3 Requests庫的get()方法

3.4 Requests的Response對象

Response對象包含服務器返回的全部信息，也包含請求的Request信息

Response對象的屬性

3.5 理解Response的編碼

注意：編碼爲ISO-8859-1不支持編譯中文，須要設置 r = encoding="utf-8"

3.6 理解Requests庫的異常

Requests庫支持常見的6種鏈接異常

注意：網絡鏈接有風險。異常處理很重要。raise_for_status()若是不等於200則產生異常requests.HTTPError。

3.7 爬取網頁的通用代碼框架

import requests
def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "產生異常"

if __name__ == "__main__":
    url = "http://www.baidu.com"
    print(getHTMLText(url))

4. 網絡爬蟲的「盜亦有道」：Robots協議

robots是網站跟爬蟲間的協議，robots.txt（統一小寫）是一種存放於網站根目錄下的ASCII編碼的文本文件，它一般告訴網絡搜索引擎的漫遊器（又稱網絡蜘蛛），此網站中的哪些內容是不該被搜索引擎的漫遊器獲取的，哪些是能夠被漫遊器獲取的。由於一些系統中的URL是大小寫敏感的，因此robots.txt的文件名應統一爲小寫。robots.txt應放置於網站的根目錄下。

網絡爬蟲的尺寸：

4.1 網絡爬蟲引起的問題

a. 網絡爬蟲的「性能」騷擾

web默認接受人類訪問，因爲網絡爬蟲的頻繁訪問會給服務器帶來巨大的額資源開銷。

b. 網絡爬蟲的法律風險

服務器上的數據有產權歸屬，網絡爬蟲獲取數據牟利將帶來法律風險

c. 網絡爬蟲的隱私泄露

網絡爬蟲可能具有突破簡單控制訪問的能力，獲取被保護的數據從而泄露我的隱私。

4.2 網絡爬蟲限制

a. 來源審查：判斷User-Agent進行限制

檢查來訪HTTP協議頭的user-agent域，只響應瀏覽器或友好爬蟲的訪問

b. 發佈公告：Robots協議

告知全部爬蟲網站的爬取策略，要求遵照Robots協議

4.3 真實的Robots協議案例

京東的Robots協議：

https://www.jd.com/robots.txt

#註釋，*表明全部，/表明根目錄

4.4 robots協議的遵照方式

對robots協議的理解

自動或人工識別roboes.txt,z再進行內容爬取。

robots協議是建議但非約束性，網絡爬蟲能夠補遵照，但存在法律風險。

原則：人類行爲能夠補參考robots協議，好比正常閱覽網站，或者較少爬取網站頻率。

5. Requests庫網絡爬蟲實戰

5.1 京東商品頁面爬取

目標頁面地址：https://item.jd.com/5089267.html

實例代碼：

import requests
url = 'https://item.jd.com/5089267.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding =r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失敗")

結果：

5.2 噹噹網商品頁面爬取

目標頁面地址：http://product.dangdang.com/26487763.html

代碼：

import requests
url = 'http://product.dangdang.com/26487763.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding =r.apparent_encoding
    print(r.text[:1000])
except IOError as e:
    print(str(e))

出現報錯：

HTTPConnectionPool(host='127.0.0.1', port=80): Max retries exceeded with url: /26487763.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10fc390>: Failed to establish a new connection: [Errno 111] Connection refused',))

報錯緣由：噹噹網拒毫不合理的瀏覽器訪問。

查看初識的http請求頭：

print(r.request.headers)

代碼改進：構造合理的HTTP請求頭

import requests
url = 'http://product.dangdang.com/26487763.html'
try:
    kv = {'user-agent':'Mozilla/5.0'}
    r = requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding =r.apparent_encoding
    print(r.text[:1000])
except IOError as e:
    print(str(e))

結果正常爬取：

5.3 百度360搜索引擎關鍵詞提交

百度關鍵詞接口：http://www.baidu.com/s?wd=keyword

代碼實現：

import requests
keyword = "python"
try:
    kv = {'wd':keyword}
    r = requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except IOError as e:
    print(str(e))

執行結果：

360關鍵詞接口：

http://www.so.com/s?q=keyword

代碼實現：

import requests
keyword = "Linux"
try:
    kv = {'q':keyword}
    r = requests.get("http://www.so.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except IOError as e:
    print(str(e))

執行結果：

5.4 網絡圖片爬取和存儲

網絡圖片連接的格式：

http://FQDN/picture.jpg

校花網：http://www.xiaohuar.com

選擇一個圖片地址：http://www.xiaohuar.com/d/file/20141116030511162.jpg

實現代碼：

import requests
import os
url = "http://www.xiaohuar.com/d/file/20141116030511162.jpg"
dir = "D://pics//"
path = dir + url.split('/')[-1] #設置圖片保存路徑並以原圖名名字命名
try:
    if not os.path.exists(dir):
        os.mkdir(dir)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except IOError as e:
    print(str(e))

查看圖片已經存在：

5.5 ip地址歸屬地查詢

ip地址歸屬地查詢網站接口：http://www.ip138.com/ips138.asp?ip=

實現代碼：

import requests
url = "http://www.ip38.com/ip.php?ip="
try:
    r = requests.get(url+'104.193.88.77')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except IOError as e:
    print(str(e))

5.5 有道翻譯翻譯表單提交

打開有道翻譯，在開發者模式依次單擊「Network」按鈕和「XHR」按鈕，找到翻譯數據：

import requests
import json

def get_translate_date(word=None):
    url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
    #post參數須要放在請求實體裏，構建一個新字典
    form_data = {'i': word,
                 'from': 'AUTO',
                 'to': 'AUTO',
                 'smartresult': 'dict',
                 'client': 'fanyideskweb',
                 'salt': '15569272902260',
                 'sign': 'b2781ea3e179798436b2afb674ebd223',
                 'ts': '1556927290226',
                 'bv': '94d71a52069585850d26a662e1bcef22',
                 'doctype': 'json',
                 'version': '2.1',
                 'keyfrom': 'fanyi.web',
                 'action': 'FY_BY_REALTlME'
                 }
    #請求表單數據
    response = requests.post(url,data=form_data)
    #將JSON格式字符串轉字典
    content = json.loads(response.text)
    #打印翻譯後的數據
    print(content['translateResult'][0][0]['tgt'])

if __name__ == '__main__':
    word = input("請輸入你要翻譯的文字：")
    get_translate_date(word)

執行結果：

6 Beautiful Soup庫入門

6.1 簡介

  Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析「標籤樹」等功能。它是一個工具箱，經過解析文檔爲用戶提供須要抓取的數據，由於簡單，因此不須要多少代碼就能夠寫出一個完整的應用程序。 

  Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼。你不須要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。而後，你僅僅須要說明一下原始編碼方式就能夠了。 

  Beautiful Soup已成爲和lxml、html6lib同樣出色的python解釋器，爲用戶靈活地提供不一樣的解析策略或強勁的速度。 

6.2 Beautiful Soup安裝

  目前,Beautiful Soup的最新版本是4.x版本，以前的版本已經中止開發，這裏推薦使用pip來安裝，安裝命令以下： 

  pip install beautifulsoup4 

  驗證安裝： 

  from bs4 import BeautifulSoup 

  soup = BeautifulSoup('<p>Hello</p>','html.parser') 

  print(soup.p.string) 

  執行結果以下： 

  Hello 

  注意：這裏雖然安裝的是beautifulsoup4這個包，可是引入的時候倒是bs4，由於這個包源代碼自己的庫文件名稱就是bs4，因此安裝完成後，這個庫文件就被移入到本機Python3的lib庫裏，識別到的庫文件就叫做bs4。 

  所以，包自己的名稱和咱們使用時導入包名稱並不必定是一致的。 

6.3 BeautifulSoup庫解析器

解析器	使用方法	條件
bs4的HTML解析器	BeautifulSoup(mk,'html.parser')	安裝bs4庫
lxml的HTML解析器	BeautifulSoup(mk,'lxml')	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,'xml')	pip install lxml
html5lib的解析器	BeautifulSoup(mk,'htmlslib')	pip install html5lib

  若是使用lxml,在初始化BeautifulSoup時，把第二個參數改成lxml便可： 

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>','lxml')
print(soup.p.string)

6.4 BeautifulSoup的基本用法

 
 BeautifulSoup類的基本元素 

基本元素	說明
Tag	標籤，基本信息組織單元，分別用<>和</>標明開頭和結尾
Name	標籤的名字，<p></p>的名字是‘p’，格式：<tag>.name
Attributes	標籤的屬性，字典形式組織，格式：<tag>.attrs
NavigableString	標籤內非屬性字符串，<>...<>中字符串，格式：<tag>.string
Comment	標籤內字符串的註釋部分，一種特殊的Comment類型

 
 實例展現BeautifulSoup的基本用法： 

>>> from bs4 import BeautifulSoup
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title #獲取標題
<title>This is a python demo page</title>
>>> soup.a #獲取a標籤
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.title.string
'This is a python demo page'
>>> soup.prettify() #輸出html標準格式內容
'<html>\n <head>\n <title>\n This is a python demo page\n </title>\n </head>\n <body>\n <p class="title">\n <b>\n The demo python introduces several python courses.\n </b>\n </p>\n <p class="course">\n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n Basic Python\n </a>\n and\n <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n Advanced Python\n </a>\n .\n </p>\n </body>\n</html>'
>>> soup.a.name #每一個<tag>都有本身的名字，經過<tag>.name獲取
'a'
>>> soup.p.name
'p'
>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>>

6.5 標籤樹的遍歷

  標籤樹的下行遍歷 

  標籤樹的上行遍歷：遍歷全部先輩節點，包括soup自己 

  標籤樹的平行遍歷：同一個父節點的各節點間 

  實例演示： 

from bs4 import BeautifulSoup
import requests
demo = requests.get("http://python123.io/ws/demo.html").text
soup = BeautifulSoup(demo,"html.parser")
#標籤樹的上行遍歷
print("遍歷兒子節點：\n")
for child in soup.body.children:
 print(child)
 
print("遍歷子孫節點：\n")
for child1 in soup.body.descendants:
 print(child1)
 
print(soup.title.parent)
print(soup.html.parent)
for parent in soup.a.parents:
 if parent is None:
 print(parent)
 else:
 print(parent.name)
#標籤樹的平行遍歷
print(soup.a.next_sibling)
print(soup.a.next_sibling.next_sibling)
print(soup.a.previous_sibling)

7 正則表達式

正則表達式是處理字符串的強大工具，它有本身特定的語法結構，實現字符串的檢索、替換、匹配驗證均可以。對於爬蟲來講，

從HTML裏提取想要的信息很是方便。python的re庫提供了整個正則表達式的實現

7.1 案例引入

這裏介紹一個正則表達式測試工具http://tool.oschina.net/regex，輸入待匹配的文本，然選擇經常使用的正則表達式，獲得相應的匹配結果，

適合新手入門。這裏輸入：

hello,my phone is 18898566588 and email is david@gmail.com, and wen is https://www.cnblogs.com/wenwei-blog/

點擊「匹配Email地址」，便可匹配出網址。

7.2 經常使用正則表達式匹配規則

  '.' 匹配全部字符串，除\n之外 

  ‘-’ 表示範圍[0-9] 

  '*' 匹配前面的子表達式零次或屢次。要匹配 * 字符，請使用 \*。 

  '+' 匹配前面的子表達式一次或屢次。要匹配 + 字符，請使用 \+ 

  '^' 匹配字符串開頭 

  ‘$’ 匹配字符串結尾 re 

  '\' 轉義字符， 使後一個字符改變原來的意思，若是字符串中有字符*須要匹配，能夠\*或者字符集[*] re.findall(r'3\*','3*ds')結['3*'] 

  '*' 匹配前面的字符0次或屢次 re.findall("ab*","cabc3abcbbac")結果：['ab', 'ab', 'a'] 

  ‘?’ 匹配前一個字符串0次或1次 re.findall('ab?','abcabcabcadf')結果['ab', 'ab', 'ab', 'a'] 

  '{m}' 匹配前一個字符m次 re.findall('cb{1}','bchbchcbfbcbb')結果['cb', 'cb'] 

  '{n,m}' 匹配前一個字符n到m次 re.findall('cb{2,3}','bchbchcbfbcbb')結果['cbb'] 

  '\d' 匹配數字，等於[0-9] re.findall('\d','電話:10086')結果['1', '0', '0', '8', '6'] 

  '\D' 匹配非數字，等於[^0-9] re.findall('\D','電話:10086')結果['電', '話', ':'] 

  '\w' 匹配字母和數字，等於[A-Za-z0-9] re.findall('\w','alex123,./;;;')結果['a', 'l', 'e', 'x', '1', '2', '3'] 

  '\W' 匹配非英文字母和數字,等於[^A-Za-z0-9] re.findall('\W','alex123,./;;;')結果[',', '.', '/', ';', ';', ';'] 

  '\s' 匹配空白字符 re.findall('\s','3*ds \t\n')結果[' ', '\t', '\n'] 

  '\S' 匹配非空白字符 re.findall('\s','3*ds \t\n')結果['3', '*', 'd', 's'] 

  '\A' 匹配字符串開頭 

  '\Z' 匹配字符串結尾 

  \t 匹配衣蛾製表符 

  '\b' 匹配單詞的詞首和詞尾，單詞被定義爲一個字母數字序列，所以詞尾是用空白符或非字母數字符來表示的 

  '\B' 與\b相反，只在當前位置不在單詞邊界時匹配 

 
 '(?P<name>...)' 分組，除了原有編號外在指定一個額外的別名 re.search("(?P<province>[0-9]{4})(?P<city>[0-9]{2})(?P<birthday>[0-9]{8})","371481199306143242").groupdict("city") 結果{'province': '3714', 'city': '81', 'birthday': '19930614'} 

  [] 是定義匹配的字符範圍。好比 [a-zA-Z0-9] 表示相應位置的字符要匹配英文字符和數字。[\s*]表示空格或者*號。 

  經常使用的re函數： 

  [^...] 不在[]中的字符，好比[^abc]匹配除了a、b、c以外的字符。 

  .* 具備貪婪的性質，首先匹配到不能匹配爲止，根據後面的正則表達式，會進行回溯。 
 
.*？ 知足條件的狀況只匹配一次，即懶惰匹配。 

7.3 經常使用匹配方法屬性函數

方法/屬性	做用
re.match(pattern, string, flags=0)	從字符串的起始位置匹配，若是起始位置匹配不成功的話，match()就返回none
re.search(pattern, string, flags=0)	掃描整個字符串並返回第一個成功的匹配
re.findall(pattern, string, flags=0)	找到RE匹配的全部字符串，並把他們做爲一個列表返回
re.finditer(pattern, string, flags=0)	找到RE匹配的全部字符串，並把他們做爲一個迭代器返回
re.sub(pattern, repl, string, count=0, flags=0)	替換匹配到的字符串

  函數參數說明： 

  pattern:匹配的正則表達式 

  string：要匹配的字符串 

  flags：標記爲，用於控制正則表達式的匹配方式，如：是否區分大小寫，多行匹配等等。 

  repl：替換的字符串，也可做爲一個函數 

  count：模式匹配後替換的最大次數，默認0表示替換全部匹配 

  例子1： 

#!/usr/bin/python3
import re
#替換
phone = '18898537584 #這是個人電話號碼'
print('個人電話號碼:',re.sub('#.*','',phone)) #去掉註釋
print(re.sub('\D','',phone))
#search
ip_addr = re.search('(\d{3}\.){1,3}\d{1,3}\.\d{1,3}',os.popen('ifconfig').read())
print(ip_addr)
#match
>>> a = re.match('\d+','2ewrer666dad3123df45')
>>> print(a.group())
2

  獲取匹配的函數： 

方法/屬性	做用
group(num=0)	匹配的整個表達式的字符串，group() 能夠一次輸入多個組號，在這種狀況下它將返回一個包含那些組所對應值的元組。
groups()	返回包含全部小組字符串的元組，從1到所含的小組
groupdict()	返回以有別名的組的別名爲鍵、以該組截獲的子串爲值的字典
start()	返回匹配開始的位置
end()	返回匹配結束的位置
span()	返回一個元組包含匹配（開始，結束）的位置

 
 re模塊中分組的做用？ 

（1）判斷是否匹配（2）靈活提取匹配到各個分組的值。

>>> import re
>>> print(re.search(r'(\d+)-([a-z])','34324-dfsdfs777-hhh').group(0)) #返回總體
34324-d
>>> print(re.search(r'(\d+)-([a-z])','34324-dfsdfs777-hhh').group(1))  #返回第一組
34324
>>> print(re.search(r'(\d+)-([a-z])','34324-dfsdfs777-hhh').group(2))  #獲取第二組
d
>>> print(re.search(r'(\d+)-([a-z])','34324-dfsdfs777-hhh').group(3))  #不存在。報錯「no such group」
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: no such group

7.4 re.compile 函數compile 函數用於編譯正則表達式，生成一個正則表達式（ Pattern ）對象。語法格式：
re.compile(pattern[, flags])
參數：
pattern : 一個字符串形式的正則表達式
flags : 可選，表示匹配模式，好比忽略大小寫，多行模式等，具體參數爲：
re.I 忽略大小寫
re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依賴於當前環境
re.M 多行模式
re.S 即爲 . 而且包括換行符在內的任意字符（. 不包括換行符）
re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依賴於 Unicode 字符屬性數據庫
re.X 爲了增長可讀性，忽略空格和 # 後面的註釋
經常使用的是re.I和re.S 

>>> import re
>>> pattern = re.compile('\d+',re.S)  #用於匹配至少一個數字
>>> res = re.findall(pattern,"my phone is 18898566588")
>>> print(res)
['18898566588']

7.5 爬取貓眼電影TOP排行

  利用requests庫和正則表達式來抓取貓眼電影TOP100的相關內容。requests比urllib使用更加方便。 

 
 抓取目標 

  提取貓眼電影TOP的電影名稱、時間、評分 、圖片等信息。提取的站點URL爲 
 https://maoyan.com/board/4 

 
 提取結果已文件形式保存下來。 

 
 URL提取分析 

 
 打開站點https://maoyan.com/board/4，直接點擊第二頁和第三頁，觀察URL的內容產生的變化。 

 
 第二頁：https://maoyan.com/board/4?offset=10 

 
 第三頁：https://maoyan.com/board/4?offset=20 

 
 總結出規律，惟一變化的是offset=x，若是想獲取top100電影，只需分開請求10次，offset參數分別設置爲0、十、20...90便可。 

 
 源碼分析和正則提取 

 
 打開網頁按F12查看頁面源碼，能夠看到，一部電影信息對應的源代碼是一個dd節點，首先須要提取排名信息，排名信息在class爲board-index的i節點內，這裏使用懶惰匹配提取i節點內的信息，正則表達式爲： 

 
 <dd>.*?board-index.*?>(.*?)</i> 

 
 隨後提取電影圖片，能夠看到後面有a節點，其內部有兩個img節點，通過檢查後發現，第二個img節點的data-src屬性是圖片的連接。這裏提取第二個img節點的data-src屬性，正則表達式改寫以下： 

 
 <dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a> 

 
 再提取主演、發佈時間、評分等內容時，都是一樣的原理。最後，正則表達式寫爲： 

 
 <dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i></p>.*?</dd> 

  注意：這裏不要在Element選項卡中直接查看源碼，由於那裏的源碼可能通過JavaScript操做而與原始請求不通，而是須要從NetWork選項卡部分查看原始請求獲得的源碼。 

代碼整合

import json
import requests
from requests.exceptions import RequestException #引入異常
import re
import time
def get_one_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200: #由狀態碼判斷返回結果
            return response.text #返回網頁內容
        return None
    except RequestException:
        return None

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S) #compile函數表示一個方法對象，re.s匹配多行
    items = re.findall(pattern, html) #以列表形式返回所有能匹配的字符串。
    for item in items:  #將結果以字典形式返回鍵值對
        yield {   #把這個方法變成一個生成器
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5] + item[6]  #將評分整數和小數結合起來
        }

def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:  #將結果寫入文件
        f.write(json.dumps(content, ensure_ascii=False) + '\n')

def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)

if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)
        time.sleep(1)

8 Scrapy框架

Scrapy是一個爲了爬取網站數據，提取結構性數據而編寫的應用框架。其能夠應用在數據挖掘，信息處理或存儲歷史數據等一系列的程序中。
其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的，也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途普遍，能夠用於數據挖掘、監測和自動化測試。Scrapy 使用了 Twisted異步網絡庫來處理網絡通信。總體架構大體以下

Scrapy主要包括瞭如下組件：

引擎(Scrapy)
用來處理整個系統的數據流處理, 觸發事務(框架核心)
調度器(Scheduler)
用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL（抓取網頁的網址或者說是連接）的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址
下載器(Downloader)
用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是創建在twisted這個高效的異步模型上的)
爬蟲(Spiders)
爬蟲是主要幹活的, 用於從特定的網頁中提取本身須要的信息, 即所謂的實體(Item)。用戶也能夠從中提取出連接,讓Scrapy繼續抓取下一個頁面
項目管道(Pipeline)
負責處理爬蟲從網頁中抽取的實體，主要的功能是持久化實體、驗證明體的有效性、清除不須要的信息。當頁面被爬蟲解析後，將被髮送到項目管道，並通過幾個特定的次序處理數據。
下載器中間件(Downloader Middlewares)
位於Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請求及響應。
爬蟲中間件(Spider Middlewares)
介於Scrapy引擎和爬蟲之間的框架，主要工做是處理蜘蛛的響應輸入和請求輸出。
調度中間件(Scheduler Middewares)
介於Scrapy引擎和調度之間的中間件，從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程大概以下：

引擎從調度器中取出一個連接(URL)用於接下來的抓取
引擎把URL封裝成一個請求(Request)傳給下載器
下載器把資源下載下來，並封裝成應答包(Response)
爬蟲解析Response
解析出實體（Item）,則交給實體管道進行進一步的處理
解析出的是連接（URL）,則把URL交給調度器等待抓取

scrapy經常使用命令

scrapy startproject <爬蟲名稱> 建立爬蟲名稱（惟一）

scrapy genspider<爬蟲項目名稱> 建立爬蟲項目名稱

scrapy list 列出全部爬蟲名稱

scrapy crawl <爬蟲名稱> 運行爬蟲

8.1 scrapy爬蟲項目一：爬取豆瓣電影TOP250

爬取目標：電影排名、電影名稱、電影評分、電影評論數

建立爬蟲項目和爬蟲

scrapy startproject DoubanMovieTop

cd DoubanMovieTop

scrapy genspider douban

修改默認「user-agent」和reboots爲True

修改settings.py文件如下參數：

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'

ROBOTSTXT_OBEY = False

Item使用簡單的class定義語法以及Field對象來聲明。

寫入下列代碼聲明Item

import scrapy
class DoubanmovietopItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #排名
    ranking = scrapy.Field()
    #電影名稱
    movie_name = scrapy.Field()
    #評分
    score = scrapy.Field()
    #評論人數
    score_num = scrapy.Field()

分析網頁源碼抓取所需信息

# -*- coding: utf-8 -*-
import scrapy
from DoubanMovieTop.items import DoubanmovietopItem

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    #allowed_domains = ['movie.douban.com']
    def start_requests(self):
        start_urls = 'https://movie.douban.com/top250'
        yield scrapy.Request(start_urls)

    def parse(self, response):
        item = DoubanmovietopItem()
        movies = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movies:
            item['ranking'] = movie.xpath('.//div[@class="pic"]/em/text()').extract()[0]
            item['movie_name'] = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract()[0]
            item['score'] = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
            item['score_num'] = movie.xpath('.//div[@class="star"]/span/text()').re(r'(\d+)人評價')[0]  #Selector也有一種.re()
            yield item
        next_url = response.xpath('//span[@class="next"]/a/@href').extract()
        if next_url:
            next_url = 'https://movie.douban.com/top250' + next_url[0]
            yield scrapy.Request(next_url)