《python3網絡爬蟲開發實戰》--基本庫的使用

時間 2019-11-21

標籤 python3 python 網絡爬蟲開發實戰基本使用欄目 Python 简体版

原文原文鏈接

1. urllib:html

request:它是最基本的 HTTP 請求模塊，能夠用來模擬發送請求。就像在瀏覽器裏輸入網撾而後回車同樣，只須要給庫方法傳入 URL 以及額外的參數，就能夠模擬實現這個過程了。
error:
parse:一個工具模塊，提供了許多 URL處理方法，好比拆分、解析、合併等。
robotparser:主要是用來識別網站的 robots.txt文件，而後判斷哪些網站能夠爬，哪些網站不能夠爬，它其實用得比較少。

2. Handle類：python

當須要實現高級的功能時，使用Handlegit

 1 import http.cookiejar,urllib.request
 2 
 3 filename = 'cookies.txt'
 4 #cookie = http.cookiejar.CookieJar
 5 #cookie = http.cookiejar.MozillaCookieJar(filename)
 6 cookie = http.cookiejar.LWPCookieJar(filename)
 7 cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
 8 handle = urllib.request.HTTPCookieProcessor(cookie)
 9 opener = urllib.request.build_opener(handle)
10 response = opener.open('http://www.baidu.com')
11 #for item in cookie:
12    # print(item.name+"="+item.value)
13 
14 #cookie.save(ignore_discard=True, ignore_expires=True)
15 print(response.read().decode('utf-8'))

3. urljoingithub

咱們能夠提供一個 base_url (基礎連接 )做爲第一個參數，將新的連接做爲第二個參數，該方法會分析 base_url 的 scheme、 netloc 和 path這 3個內容並對新連接缺失的部分進行補充，最後返回結果。正則表達式

4. urlencode()瀏覽器

1 from urllib.parse import urlencode
2 
3 params = {
4     'name': 'germey',
5     'age': '23'
6 }
7 base_url = 'http://www.baidu.com?'
8 url = base_url+urlencode(params)
9 print(url)

5.parse_qscookie

反序列化,將get請求的參數，轉回字典網絡

1 from urllib.parse import parse_qs
2 query= 'name=germey&age=22'
3 print(parse_qs(query))

parse_qsl:轉化爲元組組成的列表數據結構

1 from urllib.parse import parse_qsl
2 print(parse_qsl(query))

6. quote工具

將內容轉化爲URL編碼模式

7.分析Robots協議

1. robots協議

Robots 協議也稱做爬蟲協議、機器人協議，它的全名叫做網絡爬蟲排除標準( Robots ExclusionProtocol)，用來告訴爬蟲和搜索引擎哪些頁面能夠抓取，哪些不能夠抓取。它一般是一個叫做 robots.txt的文本文件，通常放在網站的根目錄下。

2. robotparser

set_url:用來設置 robots.txt 文件的連接。若是在建立 RobotFileParser 對象時傳入了連接，那麼就不須要再使用這個方法設置了

read:讀取 robots.txt 文件並進行分析。注意，這個方法執行一個讀取和分析操做，若是不調用這個方法，接下來的判斷都會爲 False，因此必定記得調用這個方法。這個方法不會返回任何內容，可是執行了讀取操做。

parse:用來解析robots.txt文件，傳人的參數是robots.txt某些行的內容，它會按照robots.txt的語法規則來分析這些內容。

can_fetch:該方法傳人兩個參數，第一個是 User-agent，第二個是要抓取的 URL。返回的內容是該搜索引擎是否能夠抓取這個 URL，返回結果是 True 或 Falsea

mtime:返回的是上次抓取和分析 robots.txt的時間，這對於長時間分析和抓取的搜索爬蟲是頗有必要的，你可能須要按期檢查來抓取最新的 robots.txt。

modified:

1 from urllib.robotparser import RobotFileParser
2 
3 rp = RobotFileParser()
4 rp.set_url('http://www.jianshu.com/robots.txt')
5 rp.read()
6 print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
7 print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=l&type=collections'))

8. requests

它一樣對長時間分析和抓取的搜索爬蟲頗有幫助，將當前時間設置爲上次抓取和分析 robots.txt 的時間。

1. get:

 1 import requests
 2 import re
 3 
 4 #瀏覽器標時，若是沒有，會禁止爬取
 5 headers = {
 6     'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
 7 }
 8 r = requests.get("http://www.zhihu.com/explore",headers=headers)
 9 pattern = re.compile('explore-feed.*?question.*?>(.*?)</a>',re.S)
10 titles = re.findall(pattern, r.text)
11 print(titles)
12 
13 r = requests.get("http://github.com/favicon.ico")
14 with open('favicon.ico','wb') as f:
15     f.write(r.content)

2. post:

 1 import requests
 2 
 3 data = {
 4     'name': 'name',
 5     'age': '22'
 6 }
 7 r = requests.post("http://httpbin.org/post", data=data)
 8 print(r.text)
 9 r = requests.get('http://www.zhihu.com')
10 print(type(r.status_code), r.status_code)#獲得狀態碼
11 print(type(r.headers), r.headers)#獲得響應頭
12 print(type(r.cookies), r.cookies)#獲得cookies
13 print(type(r.url), r.url)#獲得URL
14 print(type(r.history), r.history)#獲得請求歷史

9. request的高級語法：

1.文件上傳：

2. cookies:

 1 import requests
 2 
 3 files = {'file':open('favicon.ico', 'rb')}
 4 r = requests.post("http://httpbin.org/post", files=files)
 5 print(r.text)
 6 r = requests.get("http://www.baidu.com")
 7 print(r.cookies)
 8 for key, value in r.cookies.items():
 9     print(key + '=' + value)
10 
11 headers = {
12     'Cookies': 'tst=r; __utma=51854390.2112264675.1539419567.1539419567.1539433913.2; __utmb=51854390.0.10.1539433913; __utmc=51854390; __utmv=51854390.100--|2=registration_date=20160218=1^3=entry_date=20160218=1; __utmz=51854390.1539433913.2.2.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; tgw_l7_route=e0a07617c1a38385364125951b19eef8; q_c1=d3c7341e344d460ead79171d4fd56f6f|1539419563000|1516290905000; _xsrf=713s0UsLfr6m5Weplwb4offGhSqnugCy; z_c0="2|1:0|10:1533128251|4:z_c0|92:Mi4xS2VDaEFnQUFBQUFBZ09DVGo1ZUtEU1lBQUFCZ0FsVk5PX3hPWEFEVXNtMXhSbmhjbG5NSjlHQU9naEpLbkwxYlpB|e71c25127cfb23241089a277f5d7c909165085f901f9d58cf93c5d7ec7420217"; d_c0="AIDgk4-Xig2PTlryga7LwT30h_-3DUHnGbc=|1525419053"; __DAYU_PP=zYA2JUmBnVe2bBjq7qav2ac8d8025bbd; _zap=d299f20c-20cc-4202-a007-5dd6863ccce9',
13     'Host': 'www.zhihu.com',
14     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
15 
16 }
17 r = requests.get("http://www.zhihu.com",headers=headers)
18 print(r.text)

3. 會話維持:

1 import requests
2 
3 requests.get("http://httpbin.org/cookies/set/umber/123456789")
4 r = requests.get("http://httpbin.org/cookies")
5 print(r.text)
6 s = requests.Session()
7 s.get("http://httpbin.org/cookies/set/umber/123456789")
8 r = s.get('http://httpbin.org/cookies')
9 print(r.text)

 1 {
 2   "cookies": {}
 3 }
 4 
 5 {
 6   "cookies": {
 7     "umber": "123456789"
 8   }
 9 }
10 
11 
12 Process finished with exit code 0

4. SSl證書驗證

requests還提供了證書驗證的功能。當發送 HTTP請求的時候，它會檢查 SSL證書，咱們可使用 verify參數控制是否檢查此證書。其實若是不加 verify參數的話，默認是 True，會自動驗證。

1 import requests
2 #from requests.packages import urllib3
3 import logging
4 
5 logging.captureWarnings(True)
6 #urllib3.disable_warnings()
7 response = requests.get('https://www.12306.cn', verify=False)
8 print(response.status_code)

5. 代理設置

6. 超時設置

r = requests.get('http://www.taobao.com', timeout=1)

7. 身份認證

1 import requests
2 from requests.auth import HTTPBasicAuth
3 
4 r = requests.get('http://localhost:5000', auth=HTTPBasicAuth('username', 'password'))
5 print(r.status_code)

8.Prepared Request:將請求表示爲數據結構

from requests import Request, Session

url = 'http://httpbin.org/post'
data = {
    'name': 'germey'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
s = Session()
req = Request('POST',url,data=data,headers=headers)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)