day02-爬蟲入門

時間 2019-11-15

原文原文鏈接

1、請求百度翻譯（post方式）
　　問題來源
　　　　百度翻譯電腦網頁版所帶請求中包含js生成的sign，因此改用手機版網頁訪問，能夠簡單構造
　　構造請求
　　　　requests.post(post_url,post_data,headers=headers)
　　實現構造
　　　　headers={'User-Agent':'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'}
　　　　post_data={
　　　　　　'query':query_string,
　　　　　　'from':'zh',
　　　　　　'to':'en',
　　　　　　}
　　　　post_url = 'https://fanyi.baidu.com/basetrans'
　　　　ret = requests.post(post_url,data=post_data,headers=headers)php

 1 import requests
 2 import json
 3 
 4 query_string = input('請輸入中文：')
 5 
 6 # post請求
 7 headers={'User-Agent':'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'}
 8 post_data={
 9     'query':query_string,
10     'from':'zh',
11     'to':'en',
12 }
13 post_url = 'https://fanyi.baidu.com/basetrans'
14 
15 ret = requests.post(post_url,data=post_data,headers=headers)
16 print(ret.content.decode()) # 須要json處理
17 print(json.loads(ret.content.decode()))
18 print(json.loads(ret.content.decode())['trans'][0]['dst'])

View Code

2、使用代理html

　　代理做用
　　　　讓服務器覺得不是同一個客戶端在請求
　　　　防止咱們的真實地址被泄露，防止被追究
　　沒有代理
　　　　瀏覽器——————>服務器
　　反向代理
　　　　瀏覽器——————>nginx——————>服務器
　　正向代理
　　　　瀏覽器——————>代理——————>服務器
　　代理使用
　　　　用法
　　　　　　requests.get(url,proxies=proxies)
　　　　　　proxies的形式：字典
　　　　　　proxies={
　　　　　　　　‘http’:‘http://12.34.56.79:9527’
　　　　　　}
　　代理資源
　　　　https://proxy.mimvp.com/freesecret.php
　　　　https://proxy.coderbusy.com/python

 1 import requests
 2 
 3 p = {
 4     'http':'http://113.73.65.121:7153'
 5 }
 6 headers = {
 7     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
 8 }
 9 
10 ret = requests.get('http://www.baidu.com',headers=headers,proxies=p)
11 print(ret.status_code)

View Code

　　使用注意nginx

　　　　準備一堆的ip地址，組成ip池，隨機選擇一個ip來用
　　　　如何隨機選擇代理ip，讓使用次數較少的ip地址有更大的可能性被用到
　　　　　　{'ip':ip,'times':0}
　　　　　　[{},{},{},{}],對以上ip放入列表，對列表進行排序，按照使用次數排序
　　　　　　選擇使用次數較少的10個ip，從中隨機選取一個
　　　　檢查ip的可用性
　　　　　　能夠使用requests添加超時參數，判斷ip地址的質量
　　　　　　在線代理ip質量檢測網站web

3、cookie和session
　　區別
　　　　cookie數據存放在客戶端的瀏覽器上，session數據放在服務器上
　　　　cookie不是很安全，別人能夠分析存放在本地的cookie並進行cookie欺騙
　　　　session會在必定時間內保存在服務器上，當訪問增多，會佔用服務器的性能
　　　　單個cookie保存的數據不能超過4k，不少瀏覽器都限制一個站點最多保存20個cookie
　　爬蟲
　　　　帶上cookie、session的好處
　　　　　　可以請求到登陸以後的頁面
　　　　帶上cookie、session的弊端
　　　　　　一套cookie和session每每和一個用戶對應
　　　　請求太快，請求次數太多，容易被服務器識別爲爬蟲
　　　　注意
　　　　　　不須要cookie的時候儘可能不去使用cookie
　　　　　　可是爲了獲取登陸以後的頁面，咱們必須發送帶有cookie的請求
　　　　使用
　　　　　　攜帶一堆cookie進行請求，把cookie組成cookie池
　　　　requests提供的session
　　　　　　requests提供了一個叫session類，來實現客戶端和服務端的會話保持
　　　　　　使用方法：
　　　　　　　　1.實例化一個session對象
　　　　　　　　2.讓session發送get或者post請求
　　　　　　　　　　session=requests.session
　　　　　　　　　　response=session.get(url,headers)
　　請求登陸以後的網站的思路（使用requests）
　　　　實例化session
　　　　先使用session發送請求，登陸網站把cookie保存在session
　　　　再使用session請求對應的網站，session可以自動的攜帶登陸成功時保存在其中的cookie進行請求
　　不發送post請求，使用cookie獲取登錄後的頁面
　　　　cookie過去時間很長的網站
　　　　在cookie過時以前可以拿到全部的數據，比較麻煩
　　　　配合其餘程序一塊兒使用，其餘程序專門獲取cookie，當前程序專門請求頁面
　　三種登陸方式
　　　　實例化session，使用session發送post請求，再使用它獲取登陸後的頁面
　　　　headers中添加cookie鍵，值爲cookie字符串
　　　　在請求方法中添加cookies參數，接收字典形式的cookie。字典形式的cookie中的鍵是cookie的name，值是cookie的valuechrome

 1 import requests
 2 
 3 
 4 ######### 實例化session #########
 5 
 6 session = requests.session()
 7 post_url = 'http://www.renren.com/PLogin.do'
 8 # 注意這裏登陸所找的地址，是在form表單中找到的
 9 post_data = {
10     'email':'13061455882',
11     'password':'714466'
12 }
13 
14 headers = {
15     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
16 }
17 # 使用session發送post請求，cookie保存在其中
18 session.post(post_url,headers=headers,data=post_data)
19 # 再使用session進行請求登陸以後才能訪問的地址
20 ret = session.get('http://www.renren.com/969010233/profile',headers=headers)
21 # 保存頁面
22 with open('renren1.html','w',encoding='utf-8') as f:
23     f.write(ret.content.decode())
24 
25 
26 
27 '''
28 ######### 直接找到cookie #########
29 headers ={
30     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
31     'Cookie':'p005baksess=18e345f691fb6a301bb064fd093c9d80d48e7545'
32 }
33 
34 ret = requests.get('https://p013.zhenlutech.com/index.php/admin',headers=headers)
35 with open('admin.html','w',encoding='utf-8') as f:
36     f.write(ret.content.decode())
37 '''
38 
39 
40 '''
41 ######### 傳入cookie #########
42 
43 
44 headers = {
45     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
46 }
47 
48 cookies = 'anonymid=jp9m917q-35ny33; depovince=SD; _r01_=1; JSESSIONID=abcZF4nEVWnOPeRBv04Dw; ick_login=de6c9926-372d-4a35-aacd-0dcb7d1253fb; jebecookies=16ef79e7-76b7-4aa8-8014-34a30ea13eb2|||||; _de=FCDB90B914A526660E0B69E00FD48194; p=6bed82a76eef3b37a31d84c21b1ee0ec3; first_login_flag=1; ln_uact=13061455882; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; t=f03bbd03c6c2cf9d0700aedcff9944d03; societyguester=f03bbd03c6c2cf9d0700aedcff9944d03; id=969010233; xnsid=86d166c9; loginfrom=syshome; ch_id=10016; jebe_key=95c30273-9252-4373-adab-c3adda9c5db1%7C21a3b51c0c3373888942d7937b020f92%7C1543926359197%7C1%7C1543926520005; wp_fold=0'
49 cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
50 # 注意split知識點，將字符串切割爲字典
51 print(cookies)
52 ret = requests.get('http://www.renren.com/969010233/profile',headers=headers,cookies=cookies)
53 with open('renren.html','w',encoding='utf-8') as f:
54     f.write(ret.content.decode())
55 '''

View Code

　　尋找登陸的post地址json

　　　　在form表單中尋找action對應的url地址
　　　　　　post的數據是input標籤中name的值做爲鍵，真正的用戶名密碼做爲值的字典，post的url地址就是action對應的url地址
　　　　抓包
　　　　　　尋找登陸的url地址
　　　　　　勾選preserve log按鈕，防止頁面跳轉找不到url
　　　　　　尋找post地址，肯定參數
　　　　　　　　參數不會變，直接用，好比密碼不是動態加密的
　　　　　　　　參數會變
　　　　　　　　　　參數在當前的響應中
　　　　　　　　　　經過js生成
　　　　定位到想要的js
　　　　　　選擇會出發js時間的按鈕，點擊event listener，找到js的位置
　　　　　　經過chrome中的search all file來搜索url中關鍵字
　　　　　　添加斷點的方式來查看js操做，經過python來進行一樣的操做瀏覽器

4、requests模塊
　　入門
　　　　requests的底層實現就是urllib
　　　　requests在python2和python3中通用，方法徹底同樣
　　　　requests簡單易用
　　　　requests可以自動幫助咱們解壓（gzip壓縮的等）網頁內容
　　　　文檔學習地址：http://docs.python-requests.org/zh_CN/latest/
　　編解碼
　　　　response.content.decode()
　　　　response.content.decode('utf-8')
　　　　response.text
　　response.text和response.content的區別
　　　　response.text
　　　　　　類型：str
　　　　　　解碼類型：根據HTTP頭部對響應的編碼做出有根據的推測，推測的文本編碼
　　　　　　如何修改編碼方式：response.encoding=‘gbk’
　　　　response.content
　　　　　　類型：bytes
　　　　　　解碼類型：沒有指定
　　　　　　如何修改編碼方式：response.content.decode('utf-8')
　　　　推薦使用response.content.decode()的方式獲取響應的html頁面
　　　　練習保存圖片安全

1 import requests
2 
3 resposne = requests.get('http://docs.python-requests.org/zh_CN/latest/_static/requests-sidebar.png')
4 
5 with open('a.png','wb') as f:
6     f.write(resposne.content)

View Code

　　requests小技巧
　　　　requests.util.dict_from_cookiejar() 把cookie對象轉化爲字典
　　　　請求SSL證書驗證
　　　　　　response=request.get('https://www.12306.cn/mornhweb/'，verify=False)
　　　　設置超時
　　　　　　response=requests.get(url,timeout=10)
　　　　配合狀態碼判斷是否請求成功
　　　　　　assert response.status_code==200
　　　　示例：服務器

 1 import requests
 2 
 3 '''
 4 response = requests.get('http://www.baidu.com')
 5 print(response.cookies) # <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
 6 print(requests.utils.dict_from_cookiejar(response.cookies)) # {'BDORZ': '27315'}
 7 
 8 print(requests.utils.cookiejar_from_dict({'BDORZ': '27315'}))
 9 
10 print(requests.utils.unquote('https://tieba.baidu.com/f?ie=utf-8&kw=%E6%9D%8E%E6%AF%85'))
11 print(requests.utils.quote('https://tieba.baidu.com/f?ie=utf-8&kw=李毅'))
12 '''
13 
14 '''
15 # 異常處理
16 headers ={'User-Agent':'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'}
17 
18 def _parse_url(url):
19     response = requests.get(url,headers=headers,timeout=3)
20     assert response.status_code == 200 # 上下兩個均可能報錯，因此捕獲異常
21     return response.content.decode()
22 
23 def parse_url(url):
24     try:
25         html_str = _parse_url(url)
26     except:
27         html_str = None
28     return html_str
29 
30 if __name__ == '__main__':
31     post_url = ('https://www.baidu.com/')
32     print(parse_url(post_url))
33 '''
34 
35 
36 # 由於請求的url可能多，好比請求1000個url，不免某個url會出現問題，那麼引入retry模塊，讓請求屢次執行
37 from retrying import retry
38 
39 headers ={'User-Agent':'Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Mobile Safari/537.36'}
40 
41 @retry(stop_max_attempt_number=3)
42 # 在源碼中找到參數，表示重複請求進行三次
43 def _parse_url(url):
44     print('*'*20) # 判斷一下是否重複了三次
45     response = requests.get(url,headers=headers,timeout=3)
46     assert response.status_code == 200
47     return response.content.decode()
48 
49 def parse_url(url):
50     try:
51         html_str = _parse_url(url)
52     except:
53         html_str = None
54     return html_str
55 
56 if __name__ == '__main__':
57     post_url = ('www.baidu.com/')
58     print(parse_url(post_url))

View Code

1. 爬蟲入門
2. Python網絡爬蟲（Day02-1）
3. Python網絡爬蟲（Day02-3）
4. Python網絡爬蟲（Day02-2）
5. 爬蟲入門——01
6. 爬蟲入門（Java）
7. Python 爬蟲入門
8. Java爬蟲入門
9. 爬蟲入門（1）
10. （一）爬蟲入門
更多相關文章...
• Memcached入門教程 - NoSQL教程
• Neo4j數據庫入門教程 - NoSQL教程
• YAML 入門教程
• Java Agent入門實戰（一）-Instrumentation介紹與使用

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。