Python--開發流程丶requests丶正則表達式丶re等基礎學習

時間 2019-11-19

標籤 python 開發流程 requests 正則表達式基礎學習欄目 Python 简体版

原文原文鏈接

一 . 認識爬蟲

1.1 爬蟲概念

　　網絡爬蟲是一個程序,可以自動,批量,下載,保存網絡資源。css

　　網絡爬蟲是假裝成客戶端與服務端運行數據交互的程序。html

1.2相關概念

　　1.2.1 網絡應用架構

　　　　- c / s client / server 客戶端/服務端python

　　　　- b / s browser 瀏覽器git

　　　　- m / s mobile 移動端github

　　　　　　　　　　　　　　　　　　#*Client發送數據請求給Server,Server接收請求並處理返回一個包含結果的響應體給Client*#web

　　1.2.2 http協議

　　　　1.http協議又稱爲超文本傳輸協議.(從服務器傳輸數據到客戶端,客戶端是沒有數據的,都是從服務器下載過來的)正則表達式

　　　　2.http完整事務流程:數據庫

　　　　　1.瀏覽器輸入請求網址url,dns(域名解析成惟一的IP地址,找到服務器)json

　　　　　2.底層協議(TCP/IP鏈接),三次握手。api

　　　　　3.客戶端發送HTTP請求報文。

　　　　　4.服務端收到並處理請求,返回一個包含結果的響應。

　　　　　5.瀏覽器對響應結果進行渲染和展現。

　　1.2.3 請求(request)

　　　　完整報文(四部分)：請求行 , 請求頭 , 空行 , 請求數據/請求體

　　　　　　　 #請求方法-http協議及版本-空格換行-Host(主機)：-www.??.com(域名和端口,端口已隱藏)-空格換行空格換行# Get請求不要數據

　　　　1. 請求方法

　　　　　1.0版本 - get post head

　　　　　1.1版本 - 如圖

　　　　2. 請求頭

　　　　　'名稱 + : + 空格 + 值 '

　　　　3. 響應

　　　　　狀態行 , 消息報頭 , 空行 , 響應正文

　　　　4. 響應狀態碼(5種類型,3位整數)

　　　　5. 響應報頭

　　1.2.4 會話技術

　　　　cookie , session

　　　　　　cookie：憑證,將信息存於客戶端,不安全

　　　　　　session：會話,基於Cookie,將信息存於服務端

　　　　#Session#

　　　　　　在Requests中,實現了Session(會話)功能,當咱們使用Session時,可以像瀏覽器同樣,在沒有關閉瀏覽器時,可以保持訪問狀態.經常使用於登陸以後的數據獲取.

 1 import requests
 2 session =requests.Session()
 3 
 4 session.get('http://httbin.org/cookies/set/sessioncookie/123456789')
 5 resp = session.get('http://httbin.org/cookies')
 6 
 7 print(resp.text)
 8 
 9 #設置整個headers---若是在get()方法中傳入headers和cookies等數據,只有1次有效,想要整個生命週期有效,須要如下方法設置
10 session.headers = {
11     'user-agent':'my-app/0.0.1'
12 }
13 #增長一條headers
14 session.headers.updatedate({'x-test':'true'})

View Code

　　1.2.5 網絡資源

　　　　網絡資源：只有經過互聯網訪問到的資源就叫網絡資源。

　　1.2.6 URL

　　　　統一資源定位符,網址,惟一標識

　　1.2.7 https

　　　　http缺陷:發送傳輸數據是明文,在隱私數據方面不安全.

　　　　依託於http,ssl?TCL 協議

　　　　　- ca證書

　　　　　- 加密

　　　　　- 標準端口:80(http) 443(https)

　　　　　- ↓(密鑰每次都會從新生成,沒法破解)

1.3 應用領域

　　1.3.1 數據採集

　　　　到網絡上爬取資源,雙方數據交互,上網時查看互聯網數據,互聯網同時採集本身的信息數據.

　　1.3.2 搜索引擎

　　　　全部搜索引擎都是爬蟲,不分日夜到互聯網爬取網頁頭存在本身的資源裏.

　　1.3.3 模擬操做

　　　　對於重複的操做,例如貼吧的灌水,水軍,可使用爬蟲實現.

　　　　過年火車一票難求,能夠實現搶票功能,讓這個程序一直幫我搶票.360搶票就是爬蟲實現的

二 . 開發流程

　　1. 分析請求流程

　　　　目的:找到目標資源的http請求。具體指標：

　　　　　1. 請求方法

　　　　　2. url

　　　　　3. 請求頭

　　　　　4. 請求數據(參數)

　　　　#*工具(抓包)

　　　　　fiddler,安裝複雜,使用也比較複雜.對於google抓不到的,可使用fiddler解決.

　　　　　谷歌瀏覽器,按F12開啓調試者模式(Ctrl+Shift+I)

　　2. 發送請求

　　　　2.1 經過soket 發送HTTP請求

 1 from socket import socket
 2 
 3 #建立一個客戶單端
 4 client = socket()
 5 
 6 #鏈接服務器
 7 client.connect(('www.baidu.com',80))
 8 
 9 #構造http請求報文
10 data =b'GET / HTTP/1.0\r\nHost: www.baidu.com\r\n\r\n'
11 
12 #發送報文
13 client.send(data)
14 
15 #接收響應
16 res = b''
17 temp = client.recv(1024)#接收數據每次1024字節
18 print('*'*20)
19 while temp:
20     res += temp
21     temp = client.recv(1024)
22     
23 print(res)#請求演示
24 client.close()

View Code

　　　　2.2 工具庫

　　　　　1. urllib　　python標準庫,(專用於)網絡請求的

　　　　　2. urllib3　　基於python3,牛人開發

　　　　　3. requests 簡單易用,牛逼

　　　　2.3 requests

　　　　　簡單介紹:轉爲人類而構建,優雅和簡單的python庫,也是有史以來下載次數軟件python包之一,下載次數天天超過40w/次　　　　　　

 1 import requests #很是簡單明瞭,易用
 2 resp = requests.get('https://baidu.com')
 3 #網頁狀態碼
 4 resp.status_code
 5 #按字典取得值
 6 resp.headers['content-type']
 7 #獲取字符編碼類型
 8 resp.encoding
 9 #獲取文本
10 resp.text

View Code

　　　　　Requests目前基本上徹底知足web請求的全部需求,如下是Requests的特性:

 1 　              1. Keep-Alive & 鏈接池
 2 　　　　　　2. 國際化域名和URL
 3 　　　　　　3. 帶持久Cookie的會話
 4 　　　　　　4. 瀏覽器式的SSL認證
 5 　　　　　　5. 自動內容解碼
 6 　　　　　　6. 基本/摘要式的身份認證
 7 　　　　　　7. 優雅的key/value Cookie
 8 　　　　　　8. 自動解壓
 9 　　　　　　9. Unicode響應體
10 　　　　　　10. HTTP(S)代理支持
11 　　　　　　11. 文件分塊上傳
12 　　　　　　12. 流下載
13 　　　　　　13. 鏈接超時
14 　　　　　　14. 分塊請求
15 　　　　　　15. 支持 .netrc

View Code

　　　Requests的安裝:pip install requests ##(pip list 可查看已安裝庫)##　

 1 #Requests發起請求--請求方法
 2 import requests
 3 resp = requests.get('https://baidu.com')
 4 #Post請求
 5 resp = requests.post('http://httpbin.org/post',data={'key':'value'})
 6 #其餘請求類型
 7 resp = requests.put('http://httpbin.org/put',data={'key':'value'})
 8 resp = requests.delete('http://httpbin.org/delete')
 9 resp = requests.head('http://httpbin.org/get')
10 resp = requests.options('http://httpbin.org/get')
11 #傳遞URL參數
12 import requests
13 params = {'key1':'value1','key2':'value2'}
14 resp =requests.get('http://httpbin.org/get',params=params)
15 #自定義Headers
16 url = 'https://api.github.com/some/endpoint'
17 headers = {'user-agent':'my-app/0.0.1'}
18 resp = requests.get(url,headers=headers)
19 #自定義Cookies
20 url='http://httpbin.org/cookies'
21 cookies={'cookies_are':'working'}
22 resp = requests.get(url,cookies=cookies)
23 resp.text
24 #設置代理
25 proxies={
26     'http':'http://10.10.1.10:3128',
27     'https':'http://10.10.1.10:1080'
28 }
29 requests.get('http://example.org',proxies=proxies)
30 #重定向
31 resp = requests.get('http://github.com',allow_redirects=False)
32 resp.status_code
33 #禁止證書驗證,默認sure
34 resp = requests.get('http://httpbin.org/post',verify=False)
35 #設置禁用證書出現的warning關閉方法
36 from requests.packages.urllib3.exceptions import InsecureRequestWarning
37 #禁用安全請求警告
38 requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
39 
40 #設置超時
41 requests.get('http://github.com',timeout=0.001)
42 #響應內容
43 resp.text
44 #狀態碼
45 resp.status_code
46 #響應報頭
47 resp.headers
48 #服務器返回的cookies
49 resp.cookies
50 #url
51 resp.url

View Code

　　3. 獲取響應內容

　　　　利用socket下載一張圖片,較麻煩,基礎

 1 #經過socket 下載一張圖片
 2 from socket import socket
 3 
 4 #建立客戶端
 5 client = socket()
 6 img_url = 'http://pic22.nipic.com/20120725/9676681_001949824394_2.jpg'
 7 
 8 #鏈接服務器
 9 client.connect(('pic22.nipic.com',80))
10 
11 #構建請求報文
12 data = b'GET /20120725/9676681_001949824394_2.jpg HTTP/1.0\r\nHost: pic22.nipic.com\r\n\r\n'
13 
14 #發送請求
15 client.send(data)
16 
17 #接收響應
18 res= b''
19 temp =client.recv(1024)
20 
21 while temp:
22     res += temp
23     temp =client.recv(1024)
24     
25 headers,img_data=res.split(b'\r\n\r\n')#自動分割字符串
26 
27 #保存圖片
28 with open(r'C:\Users\luowe\Desktop\test.jpg','wb') as f:
29     f.write(img_data)
30     
31 print('完成')

View Code

　　　利用requests下載一張圖片,簡單粗暴

 1 #經過requests 下載一張圖片
 2 import requests
 3 
 4 img_url = 'http://pic22.nipic.com/20120725/9676681_001949824394_2.jpg'
 5 
 6 #接收響應
 7 response = requests.get(img_url)
 8 #保存圖片
 9 with open(r'C:\Users\luowe\Desktop\test.jpg','wb') as f:
10     f.write(response.content)
11 print('完成')

View Code

　4. 解析內容

　　　　響應體：文本(text) , 二進制數據(content)

　　　　文本：html , json , (js , css)

　　　　　1. html解析

　　　　　2. 正則表達式

　　　　　3. beautiful soup

　　　　　4. xpath

　　　　　#json解析--jsonpath#

　　5. 數據持久化

　　　　1. 寫文件

　　　　2. 寫數據庫

三 . 重點 , 難點

　　1.1 數據獲取(難點)

　　　　1. 請求頭反爬(UA,用戶代理)

　　　　2. cookie(set-cookie,較麻煩,requests直接完爆它)

　　　　3. 驗證碼(點觸,加減,扭曲,文字,滑動,語音等)

　　　　4. 行爲檢測(用戶點擊頻率,停留時間經過時間和行爲來判斷視爲爬蟲.)

　　　　5. 參數加密(MD5加密和其餘加密方式)

　　　　6. 字體反爬(服務端用本身的字體對某些數據加密,解密有些困難)

　　1.2 爬取效率

　　　　1. 併發--多線程,多進程同時爬取

　　　　2. 異步--?

　　　　3. 分佈式--爬蟲會監測IP,若是同一個IP訪問次數過多,服務端就會考慮封IP,爬蟲分佈在不一樣的電腦上一塊兒爬取.

四 . 正則表達式

　　1.1 概念

　　　　正則表達式是對字符串操做的一種邏輯公式,就是用事先定義好的一些特定字符·及這些特定字符的組合,組成一個"規則字符串",這個'規則字符串

　　　'用來表達對字符串的一種過濾邏輯.

　　1.2 特色

　　　　1. 靈活性,邏輯性和功能性很是強;

　　　　2. 能夠迅速地用極簡單的方式達到字符串的賦值控制;

　　　　3. 對於剛接觸的人來講,比較晦澀難懂.

　　1.3 符號

　　　　1.普通字符(區分大小寫)　　--->例如'testing' 能匹配testing.testing123,不能匹配Testing

　　　　2.元字符　　--->. ^ $ * + ? { } [ ] | ( ) \

　　1.4 Python中經常使用的正則表達式處理函數

　　　　re模塊使Python 語言擁有所有的正則表達式功能

　　　　1. re.match 函數　　

　　　　　re.match 嘗試從字符串的起始位置匹配一個模式,若是不是起始位置匹配成功, match() 就返回none.　　　

re.match(pattern , string , flags=0)

　　　　2. re.search() 方法

　　　　　re.search 掃描整個字符串並返回第一個成功的匹配

re.search(pattern , string , flags = 0)

　　　　3. re.match 與 re.search 的區別

　　　　　 re.match 只匹配字符串的開始,若是字符串開始不符合正則表達式,則匹配失敗,函數返回None; 而re.search 匹配整個字符串,直到找到第一個匹配.

　　　　4. 檢索和替換

　　　　　 Python的re 模塊提供了re.sub 用於替換字符串中的匹配項.　

re.sub(pattern , repl , string , count=0 , flags=0)

　　　　5. re.findall 函數

　　　　　在字符串中找到正則表達式所匹配的全部子串,並返回一個列表,若是沒有找到匹配的,則返回空列表.

　　　　　　# match 和 search是隻匹配一次,, findall匹配全部

re.findall(pattern , string , flags=0)

　　　　 6. re.compile 函數

　　　　　　compile 函數用於編譯正則表達式, 生成一個正則表達式( Pattern )對象, 供match() 和 search() 這兩個函數使用.　　　　

re.compile(pattern[ , flags])

　　　　7. 正則表達式修飾符 - 可選標誌

　　　　　正則表達式能夠包含一些可選標誌修飾符來控制匹配的模式.修飾符被指定爲一個可選的標誌.多個標誌能夠經過安位 OR( | ) 它們來指定. 如re.I | re.M 被設置成I 　　　　　　　　　　　　　　和 M標誌.

五 . 使用requests抓取30張圖片　　

 1 import re
 2 import requests
 3 
 4 page_url='http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1571210386205_R&pv=&ic=&nc=1&z=&hd=&latest=&copyright=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&sid=&word=妹子&f=3&oq=meiz&rsp=0'
 5 
 6 #下載頁面的html
 7 response = requests.get(page_url)
 8 html = response.text
 9 
10 #解析出圖片的url
11 res = re.findall(r'"thumbURL":"(.*?)"',html)
12 #下載圖片
13 #假裝成瀏覽器
14 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
15           ,'Referer':'image.baidu.com'}
16 for index,url in enumerate(res):
17     response = requests.get(url,headers=headers)
18     #寫圖片
19     with open(r'C:\Users\luowe\Desktop\爬圖\%s.jpg'%index , 'wb') as f:
20         f.write(response.content)
21     print(url)