爬蟲之requests

時間 2019-11-13

標籤爬蟲 requests 欄目網絡爬蟲简体版

原文原文鏈接

爬蟲概述

什麼是爬蟲？php
- 經過編寫程序讓其模擬瀏覽器上網，而後去互聯網中抓取數據的過程html
爬蟲的分類python
- 通用爬蟲：就是抓取一整張頁面源碼內容。jquery
- 聚焦爬蟲：抓取的是頁面中局部的內容web
- 增量式爬蟲：能夠監測網站數據更新的狀況。抓取網站中最新更新出來的數據。chrome
反爬機制：對應的載體數網站。json
反反爬策略：對應的載體爬蟲程序。後端
探究一下爬蟲的合法性：api
- 爬蟲自己是不被法律禁止（中立性）瀏覽器
- 爬取數據的違法風險的體現：
  - 爬蟲干擾了被訪問網站的正常運營
  - 爬蟲抓取了受到法律保護的特定類型的數據或信息。
- 如何規避違法的風險？
  - 嚴格遵照網站設置的robots協議；
  - 在規避反爬蟲措施的同時，須要優化本身的代碼，避免干擾被訪問網站的正常運行；
  - 在使用、傳播抓取到的信息時，應審查所抓取的內容，如發現屬於用戶的我的信息、隱私或者他人的商業祕密的，應及時中止並刪除。
第一個反爬機制：robots協議
- 特性：防君子不防小人

https和http相關

http協議：客戶端和服務器端進行數據交互的形式。
經常使用的請求頭信息
- User-Agent：請求載體的身份標識
- Connection：close
響應頭信息
- content-type
https：安全的http（加密）
- 對稱祕鑰加密
- 非對稱祕鑰加密
- 證書祕鑰加密（***）：證書

requests模塊

做用：模擬瀏覽器發請求。
編碼流程：
- 指定url
- 發起請求，獲取響應對象
- 獲取響應數據
- 持久化存儲

通用爬蟲小練習

1.簡易的網頁採集器

import requests
url = 'https://www.sogou.com/web'
# 動態的參數
wd = input('enter a key word:')
param = { 'query':wd}
# 攜帶了動態參數進行的請求發送
response = requests.get(url=url,params=param)
page_text = response.text
fileName = wd+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
    fp.write(page_text)
print(fileName,'下載成功！')

上述程序執行後：

亂碼：修改響應數據的編碼
數據丟失：UA檢測反爬機制

UA檢測：網站會檢測當前請求的請求載體的身份標識

UA假裝:
- 須要將User-Agent對應的數據封裝到字典種，將字典做用的請求方法的headers參數中

# 簡易的網頁採集器
url = 'https://www.sogou.com/web'
# 動態參數
msg = input('enter a key word:')
params = { 'query':msg}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",}
# 攜帶了動態參數進行的請求發送
reponse = requests.get(url=url, params=params,headers=headers)
#能夠手動修改響應數據的編碼
response.encoding = 'utf-8'
page_text = reponse.text
filename = msg+'.html'
with open(filename,'w',encoding='utf8') as f:
    f.write(page_text)

2. 爬取豆瓣電影中電影詳情數據

import requests
url = 'https://movie.douban.com/j/chart/top_list'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
}
start = input('起始愛情電影：')
limit = input('TOP愛情電影：')
params = {
    "type":"13",
    "interval_id": "100:90",
    "action": "" ,
    "start": start,
    "limit":limit,
}
page_text = requests.get(url=url,params=params,headers=headers).json()
for dic in page_text:
    title,score = dic['title'],dic['score']
    print('電影名:{}，評分:{}'.format(title,score))

動態加載的數據

須要藉助於抓包工具進行分析和動態加載數據對應url的提取

3. 爬取肯德基餐廳的位置信息

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
city = input('請輸入要查詢城市的名稱：')
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
}
for i in range(1,8):
    data = {
        "cname":"", 
        "pid": "",
        "keyword": city,
        "pageIndex": i,
        "pageSize": "10",
    }
    page_json = requests.post(url=url,headers=headers,data=data).json()
#     print(page_json)
    for dic in page_json['Table1']:
        rownum = dic['rownum']
        storeName = dic['storeName']
        addressDetail = dic['addressDetail']
        print('{}-{}-{}'.format(rownum,storeName,addressDetail))

4. 化妝品公司詳情爬取

http://125.35.6.84:81/xk/

# 1.檢測頁面中的數據是否爲動態加載
# 2.經過抓包工具找到動態加載數據對應的數據包，數據包中提取該請求的url
# 3.發現首頁中動態加載的數據包中並無詳情頁的url，可是有每一家企業的id
# 4.經過觀察發現，每一家企業詳情頁的域名是一致的只有攜帶的id參數不一樣
# 5.能夠經過相同的域名結合着不一樣企業的id組成一個企業詳情頁的url
# 6.經過抓包工具發現詳情頁中的企業詳情數據是動態加載出來的
# 7.在抓包工具中經過全局搜索找到了動態加載數據對應的數據包，而且能夠提取url
# 8.多家企業對應的詳情數據對應的數據包的url都是同樣的，只有攜帶的參數id的值不一樣
url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}
for page in range(1,11):
    data = {
        "on": "true",
        "page": str(page),
        "pageSize": "15",
        "productName": "",
        "conditionType": "1",
        "applyname": "",
        "applysn": "",
    }
    json_data = requests.post(url=url,data=data,headers=headers).json()
    print('第{}頁爬取結束！'.format(page))
    for dic in json_data['list']:
        _id = dic.get('ID')
        #對企業詳情數據對應的url發起一個post請求
        post_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
        data = {'id':_id}
        detail_data = requests.post(url=post_url,headers=headers,data=data).json()
        company_name = detail_data['epsName']
        print(company_name)

5.爬取喜馬拉雅的免費音頻

import requests
url = 'https://www.ximalaya.com/revision/play/album?'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
for num in range(1,10):
    params = {
        "albumId": "9742789",
        "pageNum": num,
        "sort": "0",
        "pageSize": "30",
    }
    json_data = requests.get(url=url,headers=headers,params=params).json()
    for dic in json_data.get('data').get('tracksAudioPlay'):
        trackName = dic['trackName']
        trackUrl = dic['trackUrl']
        src = dic['src']
        url = 'https://www.ximalaya.com{}'.format(trackUrl)
        print(url)
        xs_data = requests.get(url=src,headers=headers).content
    #     print(xs_data)
        with open('./{}.mp3'.format(trackName),'wb') as fp:
            fp.write(xs_data)

聚焦爬蟲之數據解析

實現數據解析的方式：
- 正則
- bs4
- xpath
- pyquery
爲何要使用數據解析？
- 數據解析是實現聚焦爬蟲的核心技術，就是在一張頁面源碼中提取出指定的文本內容。
數據解析的通用解析原理？
- 咱們要解析提取的數據都是存儲在標籤中間或者是標籤的屬性中
- 1.標籤訂位
- 2.取文本或者取

1、正則解析

1. 案例爬糗事百科糗圖

// 經過檢查源碼發現目標div標籤
<div class="thumb">

<a href="/article/121960143" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12196/121960143/medium/5E2GWJ5SP18QQ051.jpg" alt="重要通知">
</a>

</div>

import re
import requests
import os
from urllib import request
# 通用的url模板不可變的
url = 'https://www.qiushibaike.com/pic/page/%d/?s=5205111'
if not os.path.exists('./qiutuPic'):
    os.mkdir('./qiutuPic')
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}
for page in range(1,36):
    # 新的url地址
    new_url = format(url%page)
    # 從新發送請求
    page_text = requests.get(url=new_url,headers=headers).text
    # 數據解析：img的src的屬性值（圖片的鏈接）
    ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
    img_src_list = re.findall(ex,page_text,re.S) # 在這裏使用正則的時候，必須有re.S參數
    for src in img_src_list:
        src = 'https:'+src
        img_name = src.split('/')[-1]
        img_path = './qiutuPic/'+img_name
        request.urlretrieve(src,img_path)
        print(img_name+'下載成功！')

2、bs4 解析

解析原理：
- 1.實例化一個BeautifulSoup的一個對象，且將即將被解析的頁面源碼加載到該對象中
- 2.須要調用bs對象中相關屬性和方法進行標籤訂位和數據的獲取
環境安裝
- pip install lxml(解析器)
- pip install bs4
BeautifulSoup對象的實例化
- BeautifulSoup('fp','lxml'):將本地存儲的一張HTML頁面中的頁面源碼加載到bs4對象中 fp文件句柄
- BeautifulSoup(page_text,'lxml'):將互聯網請求到的頁面源碼數據加載到bs4對象

`bs`相關屬性和方法

soup.tagName：能夠定位到第一次出現的tagName標籤，返回值是一個單數
find('tagName') == soup.taagName
屬性定位:find('tagName',attrName='value'),返回的也是單數
find_all():和find的用法同樣，只是返回值是一個列表
select('選擇器'):id,class,標籤，層級選擇器，返回值爲列表.>表示一個層級空格表示多個層級
取文本：string定位的是直系的文本內容，text,get_text()定位的是全部文本內容兩個得到東西是同樣的
取屬性：tag['attrName']

2、bs4案例

1. 基礎

from bs4 import BeautifulSoup
#使用bs解析本地存儲的一張頁面中相關的局部數據
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
#標籤的定位
soup.a
soup.find('div',class_='tang')
soup.find_all('div')
soup.select('.song')
soup.select('.tang > ul > li > a')
soup.select('.tang > ul > li > a')[1].string
soup.find('div',class_='song').text
soup.find('div',class_='song').get_text()
soup.select('.song > img')[0]['src']

2. 爬取小說案例

爬取詩詞名句網中的西遊記小說

from bs4 import BeautifulSoup
import requests
url = 'http://www.shicimingju.com/book/xiyouji.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}
page_text = requests.get(url,headers=headers).text
# 數據解析
soup = BeautifulSoup(page_text,'lxml')
a_list = soup.select('.book-mulu > ul > li > a') # 或者能夠用 ./book-mulu li > a
fp = open('./西遊記.edub','w',encoding='utf-8')
for a in a_list:
    title = a.string
    detail_url = 'http://www.shicimingju.com'+a['href']
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    # 解析詳情頁的頁面源碼
    soup = BeautifulSoup(detail_page_text,'lxml')
    content = soup.find('div',class_='chapter_content').text
    fp.write(title+':'+content+'\n')
    print(title,'下載成功！！')
fp.close()

3、xpath解析

1.介紹

優勢：通用性強
解析原理：
- 1.實例化一個etree的對象，將即將被解析的頁面加載到該對象中
- 2.調用etree對象中的xpath方法結合着不一樣的xpath表達式實現標籤訂位和數據提取
環境安裝：
- pip install lxml
etree對象實例化：
- etree.parse('filePath')
- etree.HTML(page_text)

2. xpath表達式

xpath方法返回值是列表.
最左側若是爲一個斜槓，則表示必須從跟節點開始進行標籤訂位
最左側爲兩個斜槓，表示能夠從任意位置標籤訂位
非最左側的一個斜槓表示一個層級，兩個斜槓表示多個層級
屬性定位：//tagName[@attrName='value']
索引定位：//div[@class="tang"]/ul/li[2] 索引是從1開始
取文本： /text() //text()
取屬性：/@attrName

3. xpath案例

1 .boss爬蟲崗位爬取

from lxml import etree
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}
url = 'https://www.zhipin.com/c101010100/?query=python爬蟲&page=%d'
for page in range(1,5):
    new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text
    # 解析
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div/div[3]/ul/li')
    for li in li_list:
        job_name = li.xpath('./div/div[@class="info-primary"]/h3/a/div[1]/text()')[0]
        salary = li.xpath('./div/div[@class="info-primary"]/h3/a/span/text()')[0]
        company = li.xpath('./div/div[2]/div/h3/a/text()')[0]
        detail_url = 'https://www.zhipin.com'+li.xpath('./div/div[1]/h3/a/@href')[0]
        detail_page_text = requests.get(url=detail_url,headers=headers).text
        # 詳情頁工做描述
        tree = etree.HTML(detail_page_text)
        job_desc = tree.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()')
        job_desc = ''.join(job_desc)
        print(job_name,salary,company,job_desc)

中文亂碼問題

#通用的url模板（不可變）
url = 'http://pic.netbian.com/4kdongwu/index_%d.html'
for page in range(1,11):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kdongwu/'
    else:
        new_url = format(url%page)
    response = requests.get(new_url,headers=headers)
#     response.encoding = 'utf-8'
    page_text = response.text
    
    #解析：圖片名稱和圖片的地址
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    for li in li_list:
        img_title = li.xpath('./a/img/@alt')[0]
        #通用解決中文亂碼的處理方式
        img_title = img_title.encode('iso-8859-1').decode('gbk')
        img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]

3.xpath之管道符|

#爬取https://www.aqistudy.cn/historydata/熱門城市和所有城市的城市名稱
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
# #解析熱門城市
# hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()')
# #解析所有城市
# all_cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text()')
# print(all_cities)

#好處：使得xpath表達式更具備通用性
tree.xpath('//div[@class="bottom"]/ul/li/a/text() | //div[@class="bottom"]/ul/div[2]/li/a/text()')

# 糗事百科中爬取做者名 匿名的處理
url = 'https://www.qiushibaike.com/text/page/%d/'
for page in range(1,10):
    print(str(page)+'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text
    
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//div[@id="content-left"]/div')
    for div in div_list:
        author = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()')[0]
        print(author)

4. 圖片懶加載

圖片懶加載是一種網頁優化技術。圖片做爲一種網絡資源，在被請求時也與普通靜態資源同樣，將佔用網絡資源，而一次性將整個頁面的全部圖片加載完，將大大增長頁面的首屏加載時間。爲了解決這種問題，經過先後端配合，使圖片僅在瀏覽器當前視窗內出現時才加載該圖片，達到減小首屏圖片請求數的技術就被稱爲「圖片懶加載」。
網站通常如何實現圖片懶加載技術呢？

在網頁源碼中，在img標籤中首先會使用一個「僞屬性」（一般使用src2，original......）去存放真正的圖片連接而並不是是直接存放在src屬性中。當圖片出現到頁面的可視化區域中，會動態將僞屬性替換成src屬性，完成圖片的加載。

import requests
from lxml import etree
url = 'http://sc.chinaz.com/tag_tupian/YaZhouMeiNv.html'
page_text = requests.get(url,headers=headers).text
for page in range(1,10):
    if page == 1:
        new_url = 'http://sc.chinaz.com/tag_tupian/YaZhouMeiNv.html'
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@id="container"]/div')
for div in div_list:
    img_src = div.xpath('./div/a/img/@src2')[0]
    print(img_src)

5. 爬取簡歷並持久化存儲

import os
import requests
from lxml import etree
from urllib import request
if not os.path.exists('./簡歷'):
    os.mkdir('./簡歷')
url = 'http://sc.chinaz.com/jianli/free_%d.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}
for page in range(1,20):
    if page == 1:
        new_url = 'http://sc.chinaz.com/jianli/free.html'
    else:
        new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//*[@id="container"]')
    for div in div_list:
        a_href = div.xpath('./div/a/@href')[0]
        a_name = div.xpath('./div/p/a/text()')[0]
        a_name = a_name.encode('iso-8859-1').decode('utf8')
        detail_text = requests.get(url=a_href,headers=headers).text
        tree = etree.HTML(detail_text)
        detail_url = tree.xpath('//*[@id="down"]/div[2]/ul/li[1]/a/@href')[0]
        jianli_name = a_name
        jianli_path = './簡歷/'+jianli_name+'.rar'
        request.urlretrieve(detail_url,jianli_path)
        print(jianli_name+'下載成功！')

代理ip

代理
cookie
模擬登錄
- 驗證碼的識別
線程池在requests的應用
單線程+多任務異步協程
代理：代理服務器
基於代理的網站：
- 站大爺
- goubanjia
- 快代理
- 西祠代理
代理的匿名度
- 透明：使用透明代理，對方服務器能夠知道你使用了代理，而且也知道你的真實IP
- 匿名：對方服務器能夠知道你使用了代理，但不知道你的真實IP。
- 高匿：對方服務器不知道你使用了代理，更不知道你的真實IP
類型：
- http：代理服務器只能夠轉發http協議的請求
- https：代理服務器只能夠轉發https協議的請求
編碼：
- 在get或者post方法中應用一個proxies的參數，給其參數賦值爲{'http'：'ip:port'}

cookie的處理方式

處理方式：
- 手動處理：將cookie的鍵值對手動添加到headers字典中，而後將headers做用到get或者post方法的headers參數中
- 自動處理：使用Session。
  - session做用：session能夠和requests同樣進行請求的發送
  - 特性：使用session發起請求，若是請求過程當中產生cookie，則cookie會被自動存儲到session中

url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=-1&count=10&category=-1'
#建立一個session對象
session = requests.Session()
#獲取cookie
session.get('https://xueqiu.com/',headers=headers)
#攜帶cookie進行的請求發送
session.get(url,headers=headers).json()

模擬登錄

什麼是模擬登錄
- 使用requests對登錄按鈕的請求進行發送
爲何要模擬登錄
- 有的頁面必須登錄以後才顯示

驗證碼的識別
- 線上的打碼平臺：超級鷹，雲打碼，打碼兔......
- 超級鷹：http://www.chaojiying.com/

超級鷹的使用流程
- 註冊：用戶中心身份的帳戶
- 登錄：
  - 查看提分的剩餘
  - 建立一個軟件ID：軟件ID-》生成一個軟件ID（ID的名稱和說明）
  - 下載示例代碼：開發文檔-》選擇語言-》點擊下載

import requests from hashlib import md5
class Chaojiying_Client(object):

def __init__(self, username, password, soft_id):
    self.username = username
    password =  password.encode('utf8')
    self.password = md5(password).hexdigest()
    self.soft_id = soft_id
    self.base_params = {
        'user': self.username,
        'pass2': self.password,
        'softid': self.soft_id,
    }
    self.headers = {
        'Connection': 'Keep-Alive',
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
    }

def PostPic(self, im, codetype):
    """
    im: 圖片字節
    codetype: 題目類型 參考 http://www.chaojiying.com/price.html
    """
    params = {
        'codetype': codetype,
    }
    params.update(self.base_params)
    files = {'userfile': ('ccc.jpg', im)}
    r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
    return r.json()

def ReportError(self, im_id):
    """
    im_id:報錯題目的圖片ID
    """
    params = {
        'id': im_id,
    }
    params.update(self.base_params)
    r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
    return r.json()
#識別古詩文網中的驗證碼圖片
url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
code_img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(code_img_src,headers=headers).content
with open('./code.jpg','wb') as fp:
    fp.write(img_data)
    
#使用線上平臺識別驗證碼
chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370') #用戶中心>>軟件ID 生成一個替換 96001
im = open('code.jpg', 'rb').read() # 本地圖片文件路徑 來替換 a.jpg 有時WIN系統需要//
print(chaojiying.PostPic(im, 1902)['pic_str'])

1.案例-基於古詩文網的模擬登錄

#識別古詩文網中的驗證碼圖片
s = requests.Session()

url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = s.get(url,headers=headers).text
tree = etree.HTML(page_text)
code_img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
#使用session對驗證碼圖片發請求（會產生cookie）
img_data = s.get(code_img_src,headers=headers).content
with open('./code.jpg','wb') as fp:
    fp.write(img_data)
   
# 解析出動態參數的數據值
__VIEWSTATE = tree.xpath('//input[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//input[@id="__VIEWSTATEGENERATOR"]/@value')[0]
    
#使用線上平臺識別驗證碼
chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')             # 用戶中心>>軟件ID 生成一個替換 96001
im = open('code.jpg', 'rb').read()                              #  本地圖片文件路徑 來替換 a.jpg 有時WIN系統需要//
#驗證碼圖片的文本數據
code_img_text = chaojiying.PostPic(im, 1902)['pic_str']
print(code_img_text)
#模擬登錄
login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
data = {
    #動態參數：
    #一般狀況下動態參數每每都被隱藏在了前臺頁面中
    "__VIEWSTATE": __VIEWSTATE,
    "__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR,
    "from": "http://so.gushiwen.org/user/collect.aspx",
    "email": "www.zhangbowudi@qq.com",
    "pwd": "bobo328410948",
    "code": code_img_text,
    "denglu": "登陸",
}
login_page_text = s.post(login_url,headers=headers,data=data).text
with open('gushiwen.html','w',encoding='utf-8') as fp:
    fp.write(login_page_text)

線程池的應用

- 異步操做能夠和非異步操做結合使用
- 線程池最好只被應用在較爲耗時的操做中

同步：

def request(url):
    print('正在請求:',url)
    time.sleep(2)
    print('請求成功:',url)
start = time.time()
urls = [
    'www.1.com',
    'www.2.com',
    'www.3.com',
    'www.4.com',
]
for url in urls:
    request(url)    
print(time.time()-start)

基於異步的操做：

from multiprocessing.dummy import Pool
start = time.time()
pool = Pool(4)
def request(url):
    print('正在請求:',url)
    time.sleep(2)
    print('請求成功:',url)
urls = [
    'www.1.com',
    'www.2.com',
    'www.3.com',
    'www.4.com',
]
pool.map(request,urls)
pool.close()
pool.join()
print(time.time()-start)

2.案例-爬取梨視頻的短視頻數據

import requests
from lxml import etree
import re
#爬取梨視頻的短視頻數據
#var contId="1570697",liveStatusUrl="liveStatus.jsp",liveSta="",
# playSta="1",autoPlay=!1,isLiving=!1,isVrVideo=!1,hdflvUrl="",sdflvUrl="",hdUrl="",sdUrl="",ldUrl="",
# srcUrl="https://video.pearvideo.com/mp4/adshort/20190626/cont-1570697-14061078_adpkg-ad_hd.mp4",vdoUrl=srcUrl,skinRes="//www.pearvideo.com/domain/skin",videoCDN="//video.pearvideo.com";
pool = Pool(4)

url = 'https://www.pearvideo.com/category_1'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="listvideoListUl"]/li')
all_video_urls = []
for li in li_list:
    detail_url = 'https://www.pearvideo.com/'+li.xpath('./div/a/@href')[0]
    detail_page_text = requests.get(detail_url,headers=headers).text
    #解析出視頻的url
    ex = 'srcUrl="(.*?)",vdoUrl'
    video_url = re.findall(ex,detail_page_text,re.S)[0]
    all_video_urls.append(video_url)
    
def dowmload_save(url):
    video_data = requests.get(url,headers=headers).content
    fileName = url.split('/')[-1]
    with open(fileName,'wb') as fp:
        fp.write(video_data)
    print(fileName,'下載成功')
    
#視頻數據的請求和持久化存儲是比較耗時的，能夠基於線程池來實現
pool.map(dowmload_save,all_video_urls)#參數1對應的函數必須只能夠有一個參數

多任務異步協程（併發）

- 協程：對象。若是一個函數在定義的時候被async修飾了，則該函數被調用的時候會返回一個協程對象
，函數內部的程序語句不會其調用的時候被執行（特殊的函數）

- 任務對象：對象，就是對協程的進一步封裝。任務對象==協程==特殊的函數.任務對象中能夠顯示
協程的相關狀態。任務對象能夠被綁定一個回調。
- 綁定回調：

- 事件循環：無限（不肯定循環次數）的循環。須要向其內部註冊多個任務對象（特殊的函數）。
- async：專門用來修飾函數
- await:掛起

- requests和aiohttp 的區別
- session.get、post(url,headers,params/data,proxy="http://ip:port") ### aiohttp
- response.text()、json().read()

案例1 多任務異步協程爬取音頻

import aiohttp
import requests
import asyncio

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"
}
url = 'https://www.ximalaya.com/revision/play/album?albumId=11219576&pageNum=1&sort=1&pageSize=30'
json_data = requests.get(url, headers=headers).json()

urls = []  # 30個音頻的url和name
for dic in json_data['data']['tracksAudioPlay']:
    audio_url = dic['src']
    audio_name = dic['trackName']
    urls.append({'name': audio_name, 'url': audio_url})


# 定義特殊的修飾函數用來發送請求
async def request(dic):
    async with aiohttp.ClientSession() as s:
        # s是aiohttp中的一個請求對象
        # await：阻塞操做對應的代碼中（請求，獲取響應數據）
        # async:只要是跟aiohttp相關聯的代碼前
        # proxy='http://ip:port' 代理操做
        async with await s.get(dic['url'], headers=headers) as response:
            # text()字符串形式的響應數據
            # json(),read()二進制類型的數據
            audio_data = await response.read()
            name = dic['name']
            return {'data': audio_data, 'name': name}


# 回調函數的封裝：必須有一個task的參數
def saveData(task):
    print('開始保存')
    dic = task.result()  # 音頻的數據和名字  async修飾函數的返回值
    filename = dic['name'] + '.m4a'  #
    data = dic['data']
    with open(f'./相聲/{filename}', 'wb') as f:
        f.write(data)
    print(filename, '下載完成！！')


tasks = []  # 任務列表
for dic in urls:
    # 協程
    c = request(dic)
    # 任務對象
    task = asyncio.ensure_future(c)
    task.add_done_callback(saveData)
    tasks.append(task)
# 建立事件循環對象，而後進行任務對象的註冊，且啓動事件循環
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

selenium

概念：模塊。基於瀏覽器自動化的模塊。
爬蟲之間的關聯
- 實現模擬登錄
- 很是便捷的獲取動態加載的頁面數據
環境的安裝：pip install selenium
selenium的基本使用
- 實例化一個瀏覽器對象（必須將瀏覽器的驅動程序進行加載）
  - 驅動程序下載：http://chromedriver.storage.googleapis.com/index.html
  - 映射關係表：http://blog.csdn.net/huilan_same/article/details/51896672
- 經過代碼指定相關的行爲動做

使用實例京東joy

from selenium import webdriver
from time import sleep
# 實例化了一個谷歌瀏覽器對象且將驅動程序進行了加載
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
# 發起請求
bro.get('https://hellojoy.jd.com/')
# 使用find系列的函數進行標籤訂位
input_tag = bro.find_element_by_id('key')
# 進行節點交互
input_tag.send_keys('macbook')
sleep(2)
# 搜索按鈕的定位
btn = bro.find_element_by_xpath('//*[@id="search-2014"]/div/button')
btn.click()
sleep(10)
# 執行js，實現滑動窗口
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)

bro.close()

案例1 使用selenuim對動態數據進行爬取數據

# 動態請求數據
from bs4 import BeautifulSoup
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('http://125.35.6.84:81/xk/')
# 獲取當前頁面中展現的企業名稱
# 獲取瀏覽器打開頁面的頁面源碼數據（可見便可爬）
page_text = bro.page_source
# 使用bs4 解析企業名稱
soup = BeautifulSoup(page_text,'lxml')
dl_list = soup.select('#gzlist > li > dl')
for dl in dl_list:
    name = dl.string
    print(name)
sleep(2)
bro.close()

案例2 使用selenuim多頁處理

#處理多頁
from selenium import webdriver
from time import sleep
#實例化了一個谷歌瀏覽器對象且將驅動程序進行了加載
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('http://125.35.6.84:81/xk/')
alist = [] #存放不一樣頁碼對應的頁面源碼數據（page_source）
#獲取當前頁面中展現的企業名稱
sleep(2)
#獲取瀏覽器打開頁面的頁面源碼數據(可見便可爬)
page_text = bro.page_source
alist.append(page_text)
id_value_model = 'pageIto_first%d'
for page in range(2,8):
    id_value = format(id_value_model%page)
    btn = bro.find_element_by_id(id_value)
    btn.click()
    sleep(3)
    page_text = bro.page_source
    alist.append(page_text)
    sleep(2)    
for page_text in alist:
    sleep(1)
    soup = BeautifulSoup(page_text,'lxml')
    dl_list = soup.select('#gzlist > li > dl')
    for dl in dl_list:
        name = dl.string
        print(name)        
bro.quit()

案例3 模擬登錄

#模擬登錄
from selenium import webdriver
from time import sleep
#實例化了一個谷歌瀏覽器對象且將驅動程序進行了加載
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

bro.get('https://qzone.qq.com/')
#分析發現定位的a標籤是出如今iframe標籤之下，則必須經過switch_to.frame操做後，才能夠進行標籤訂位
bro.switch_to.frame('login_frame')

a_tag = bro.find_element_by_id('switcher_plogin')
a_tag.click()
sleep(2)
bro.find_element_by_id('u').send_keys('123456')
sleep(2)
bro.find_element_by_id('p').send_keys('XXXXXXXXXXXXX')
sleep(2)
bro.find_element_by_id('login_button').click()

#登錄成功後的頁面源碼數據
page_text = bro.page_source
sleep(2)
bro.quit()

不太經常使用的操做

動做鏈

from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep
#實例化了一個谷歌瀏覽器對象且將驅動程序進行了加載
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')

#基於動做鏈實現拖動操做
bro.switch_to.frame('iframeResult')
#定位即將被拖動的標籤
div_tag = bro.find_element_by_id('draggable')

#實例化一個動做鏈對象,將當前瀏覽器對象做爲參數進行傳入
action = ActionChains(bro)
#點擊且長按的操做
action.click_and_hold(div_tag)

for i in range(5):
    #perform()表示當即執行動做鏈
    action.move_by_offset(15,0).perform()
    sleep(0.5)    
action.release()
sleep(2)
bro.quit()

phantomJs:是一個無可視化界面的瀏覽器
谷歌無頭瀏覽

from selenium import webdriver
from time import sleep

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')


#實例化了一個谷歌瀏覽器對象且將驅動程序進行了加載
bro = webdriver.Chrome(executable_path='./chromedriver.exe',chrome_options=chrome_options)
bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')

bro.save_screenshot('./1.png')

print(bro.page_source)

如何規避selenium被檢測的風險

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions


option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = Chrome(executable_path='./chromedriver.exe',options=option)

案例4 12306模擬登錄

超級鷹接口

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 圖片字節
        codetype: 題目類型 參考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:報錯題目的圖片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()

from selenium import webdriver
import time
import requests
from selenium.webdriver import ActionChains
from PIL import Image

# Pillow  == PIL

bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/login/init')

time.sleep(3)

# 定位到了img標籤（驗證碼）,想要經過img標籤獲取驗證碼圖片的左上角和右下角兩點座標
code_img_ele = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')
time.sleep(3)
# 獲取驗證碼圖片的左上角和右下角兩點座標
location = code_img_ele.location  # 驗證碼圖片左上角座標
print(location, ':左上角座標！')

size = code_img_ele.size  # 返回的是驗證碼的尺寸（長和寬）
print(size, ':size的值')
# 矩形區域：表示的就是驗證碼圖片的區域（裁剪的區域）
rangle = (
    int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))

# 將瀏覽器打開的登陸頁面進行總體截圖
bro.save_screenshot('aa.png')

i = Image.open('./aa.png')
# 驗證碼圖片的名稱
code_img_name = 'code.png'
# 根據制定好的矩形區域（左上和右下兩點座標）進行驗證碼圖片的裁剪
frame = i.crop(rangle)

frame.save(code_img_name)

chaojiying = Chaojiying_Client('martin144', 'martin144', '900215')

# 用戶中心>>軟件ID 生成一個替換 96001
im = open('./code.png', 'rb').read()
result = chaojiying.PostPic(im, 9004)['pic_str']
# x1,y1      x1,y1|x2,y2   55,99
#  x1,y1|x2,y2  ==》 [[x1,y1],[x2,y2]]
all_list = []  # 存儲的是超級鷹返回的座標數據
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
print(all_list)

# 根據all_list中的數據進行點擊操做
action = ActionChains(bro)

for l in all_list:
    x = l[0]
    y = l[1]
    action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
    time.sleep(1)

bro.find_element_by_id('username').send_keys('123ertghjk') # 12306帳號
time.sleep(2)
bro.find_element_by_id('password').send_keys('asdfghjka') # 12306密碼
time.sleep(2)
bro.find_element_by_id('loginSub').click()
time.sleep(10)
bro.quit()