Python爬蟲合集：花6k學習爬蟲，終於知道爬蟲能幹嗎了

時間 2020-12-02

標籤 php css html python jquery web ajax chrome json api 欄目 Python 简体版

原文原文鏈接

爬蟲Ⅰ:爬蟲的基礎知識

爬蟲的基礎知識使用實例、應用技巧、基本知識點總結和須要注意事項php

爬蟲初始：

爬蟲：css

+ Request
+ Scrapy

數據分析+機器學習html

+ numpy,pandas,matplotlib

jupyter:python

+ 啓動：到你須要進去的文件夾，而後輸入jupyter notebook

cell是分爲不一樣模式的：（Code:編寫代碼、markdown:編寫筆記）jquery

jupyter的快捷鍵：web

添加cell: a, b (a向前添加，b前後添加)ajax

刪除cell: xchrome

執行：shift+enter(執行而且光標到下一行)，ctrl+enter(執行而且光標仍然在這一行)json

tab:自動補全切換cell的模式：api

m :makedown模式
y：代碼模式

打開幫助文檔：shift + tab

爬蟲簡介（瞭解）：

一、什麼是爬蟲？

經過編寫程序模擬瀏覽器上網，而後讓其去互聯網上爬取數據的過程

二、爬蟲的分類：

通用爬蟲：抓取互聯網中的一整張頁面數據

聚焦爬蟲：抓取頁面中的局部數據

增量式爬蟲：用來監測網站數據更新的狀況，以便爬取到網站最新更新出來的數據

三、反爬機制

四、反反爬策略

五、爬蟲合法嗎？

5.1爬取數據的行爲風險的體現：

爬蟲干擾了被訪問網站的正常運營；

爬蟲抓取了受到法律保護的特定類型的數據或信息。

5.2規避風險：

嚴格遵照網站設置的robots協議；

在規避反爬蟲措施的同時，須要優化本身的代碼，避免干擾被訪問網站的正常運行；

在使用、傳播抓取到的信息時，應審查所抓取的內容，如發現屬於用戶的我的信息、隱私或者他人的商業祕密的，應及時中止並刪除。

六、robots協議：

文本協議特性：防君子不防小人的文本協議

request模塊的基本使用：

什麼是requests模塊？Python中封裝好的一個基於網絡請求的模塊。

requests模塊的做用？用來模擬瀏覽器發請求

requests模塊的環境安裝：pip install requests

requests模塊的編碼流程：指定url、發起請求、獲取響應數據、持久化存儲

爬取搜狗首頁源碼數據

import requests
# 1.指定url
url = 'https://www.sogou.com/'
# 2.請求發送get:get返回值是一個響應對象
response = requests.get(url=url)
# 3.獲取響應數據
page_text = response.text    # 返回的是字符串形式的響應數據
# 4.持久化存儲
with open('sogou.html',mode='w',encoding='utf-8') as fp:
    fp.write(page_text)

實現一個簡易的網頁採集器

初版：

須要讓url攜帶的參數動態化
import requests
url = 'https://www.sogou.com/web'
# 實現參數動態化
wd = input('enter a key:')
params = {
    'query': wd
}
# 在請求中須要將請求參數對應的字典做用到params這個get方法的參數中
response = requests.get(url=url, params=params)

page_text = response.text
file_name = wd+'.html'
with open(file_name,encoding='utf-8',mode='w') as fp:
    fp.write(page_text)

第二版：

上述代碼運行後發現：出現了亂碼、數據量級不對
解決亂碼：解決響應數據的編碼方式

import requests
url = 'https://www.sogou.com/web'
wd = input('enter a key')
params = {
    'query': wd
}
response = requests.get(url=url, params=params)
response.encoding = 'utf-8'
page_text = response.text
filename = wd + '.html'
with open(filename, mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

第三版:（加一個headers,模擬瀏覽器登入）

UA檢測：門戶網站經過檢測請求載體的身份標識斷定該請求是否爲爬蟲發起的請求
UA假裝：Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36

import requests
url = 'https://www.sogou.com/web'
wd = input('enter a key')
params = {
    'query': wd
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)
response.encoding = 'utf-8'
page_text = response.text
filename = wd + '.html'
with open(filename, mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

當網頁發生局部刷新

爬取的是豆瓣電影中電影的詳情數據https://movie.douban.com/typerank?type_name=%E7%88%B1%E6%83%85&type=13&interval_id=100:90&action=
分析：當滾動條被滑動到頁面底部的時候，當前頁面發生了局部刷新（ajax的請求）

動態加載的頁面數據
是經過例一個單獨的請求請求到的數據
import requests
url = 'https://movie.douban.com/j/chart/top_list'
start = input('電影開始')
end = input('電影結束')
dic = {
    'type': '13',
    'interval_id': '100:90',
    'action': '',
     'start': start,
     'end': end
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
response = requests.get(url=url, params=dic, headers=headers)
page_text = response.json()    # json返回的是序列化好的實例對象
for dic in page_text:
    print(dic['title']+dic['score'])

請求爲post是參數爲data——肯德基案例

肯德基餐廳查詢http://www.kfc.com.cn/kfccda/storelist/index.aspx
注意：get請求參數時params,可是post請求參數時data

import requests
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
site = input('請輸入地點>>')
for page in range(1, 5):
    data = {
        'cname':'',
            'pid':'',
    'keyword': site,
    'pageIndex': '1',
    'pageSize': '10'
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    }
    response = requests.post(url=url, data=data,headers=headers)
    print(response.json())

數據解析簡介

數據解析

數據解析的做用：能夠幫助咱們實現聚焦爬蟲

數據解析的實現方式：正則、bs四、xpath、pyquery

數據解析的通用原理：

1.爬蟲爬取的數據都被存儲在了相關的標籤之中和相關標籤的屬性中

2.定位標籤

3.取文本或者取屬性

爬取byte類型數據

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}

1.爬取byte類型數據(如何爬取圖片)
url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg'
img_data = requests.get(url=url).content    # 爬取byte類使用.content
with open('./img.jpg',mode='wb') as fp:
    fp.write(img_data)


# 弊端：不能使用UA假裝
from urllib import request
# url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg'
# request.urlretrieve(url, filename='./qutu.jpg')

正則解析

先用通用模板找到對應的網頁，再用正則找到你所須要的內容

import os
import re
# 糗圖爬取1-3頁全部的圖片
# 1.使用通用爬蟲將前3頁對應的頁面源碼數據進行爬取
# 通用的url模板(不可變)
1.建立目錄
dirName = "./imgLibs"
if not os.path.exists(dirName):
    os.mkdir(dirName)
url = f"https://www.qiushibaike.com/imgrank/page/%d/"
# 2.下載圖片
for page in range(1, 3):
    new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text    # 每個頁碼對應的源碼數據
    ex = '<div class="thumb">.*?<img src="(.*?)".*?</div>'
    img_src_list = re.findall(ex, page_text, re.S)
    for src in img_src_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dirName + '/' + img_name    #./imgLibs/xxxx.jpg
        request.urlretrieve(src, filename=img_path)
        print(img_name, '下載成功')

bs4解析

1.知識點：

bs4解析bs4解析的原理：

實例化一個BeautifulSoup的對象，須要將即將被解析的頁面源碼數據加載到該對象中

調用BeautifulSoup對象中的相關方法和屬性進行標籤訂位和數據提取

環境的安裝：

pip install bs4

pip install lxml

BeautifulSoup的實例化：

BeautifulSoup(fp,'lxml')：將本地存儲的一個html文檔中的數據加載到實例化好的BeautifulSoup對象中

BeautifulSoup(page_text,'lxml')：將從互聯網上獲取的頁面源碼數據加載到實例化好的BeautifulSoup對象中

定位標籤的操做：

soup.tagName：定位到第一個出現的tagName標籤

屬性定位：soup.find('tagName',attrName='value')

屬性定位:soup.find_all('tagName',attrName='value'),返回值爲列表

選擇器定位：soup.select('選擇器'),返回的是列表

層級選擇器：>表示一個層級空格表示多個層級

取文本：

string:獲取直系的文本內容

.text:獲取全部的文本內容

取屬性：

tagName['attrName']

2.代碼示例：

定位標籤
from bs4 import BeautifulSoup
fp = open('./test.html', mode='r', encoding='utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.div)    # 定位到第一個出現的div
find相關
print(soup.find('div', class_='song'))    # 只有class_標籤須要帶_
print(soup.find('a', id='feng'))
print(soup.find_all('div', class_='song'))    # 返回的是一個列表
select相關
print(soup.select('#feng'))    # 返回的是一個列表
print(soup.select('.tang > ul >li'))    # 返回的是一個列表 > 表示一個層級
print(soup.select('.tang li'))    # 返回一個列表  空格表示多個層級
取文本
a_tag = soup.select("#feng")[0]
print(a_tag.text)
div = soup.div
print(div.string)    # 取直系的文本內容
div = soup.find('div', class_='song')
print(div.string)
a_tag = soup.select('#feng')[0]
print(a_tag['href'])

3.具體案例：

爬取三國整篇內容（章節名稱+章節內容）http://www.shicimingju.com/book/sanguoyanyi.html
fp = open('./sanguo.txt', mode='w', encoding='utf-8')
main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=main_url, headers=headers).text
soup1 = BeautifulSoup(page_text, 'lxml')
title_list = soup1.select('.book-mulu > ul > li > a')
for page in title_list:
    title = page.string
    title_url = 'https://www.shicimingju.com' + page['href']
    title_text = requests.get(url=title_url, headers=headers).text
    # 解析詳情頁中的章節內容
    soup = BeautifulSoup(title_text, 'lxml')
    content = soup.find('div', class_='chapter_content').text
    fp.write(title + ':' + content + '\n')
    print(f'{title}下載成功')

xpath解析

1. 知識點：

xpath解析的實現原理：

1.實例化一個etree的對象，而後將即將被解析的頁面源碼加載到該對象中

2.使用etree對象中的xpath方法結合着不一樣形式的xpath表達式實現標籤訂位和數據提取

環境安裝：

pip install lxmletree

對象的實例化：

etree.parse('test.html') # 本地文件

etree.HTML(page_text) # 互聯網頁面

xpath表達式：xpath方法的返回值必定是一個列表

最左側的/表示：xpath表達式必定要從根標籤逐層進行標籤查找和定位

最左側的//表示：xpath表達式能夠從任意位置定位標籤

非最左側的/:表示一個層級

非最左側的//：表示跨多個層級

屬性定位：//tagName[@attrName="value"]

索引定位：//tagName[index] 索引是從1開始

取文本：/text():直系文本內容//text():全部的文本內容

取屬性：/@attrName

2.代碼示例：

from lxml import etree
tree = etree.parse('./test.html')
標籤訂位
print(tree.xpath('/html/head/title'))
print(tree.xpath('//title'))
print(tree.xpath('/html/body//p'))
print(tree.xpath('//p'))
屬性定位
print(tree.xpath('//div[@class="song"]'))
print(tree.xpath('//li[3]'))    # 返回的是一個對象地址
取文本
print(tree.xpath('//a[@id="feng"]/text()')[0])    # 返回的是列表
print(tree.xpath('//div[@class="song"]//text()'))    # 返回的是列表
取屬性
print(tree.xpath('//a[@id="feng"]/@href'))    # 返回的是列表

3.具體案例：

#爬取糗百中的段子內容和做者名稱
url = 'https://www.qiushibaike.com/text/'
page_text = requests.get(url,headers=headers).text

#解析內容
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@id="content-left"]/div')
for div in div_list:
    author = div.xpath('./div[1]/a[2]/h2/text()')[0]#實現局部解析
    content = div.xpath('./a[1]/div/span//text()')
    content = ''.join(content)
    
    print(author,content)

4.提升xpath通用性

https://www.aqistudy.cn/historydata/ 爬取全部城市名稱
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
print(tree)
city_list1 = tree.xpath('//div[@class="bottom"]/ul/li/a/text()')
print(city_list1)
city_list2 = tree.xpath('//ul[@class="unstyled"]//li/a/text()')
print(city_list2)
利用|提升xpath的通用性(當前面表達式生效時執行前面，後面表達式生效時執行後面。兩個同時生效時同時執行)
cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text() | //ul[@class="unstyled"]//li/a/text()')
print(cities)

中文亂碼的處理

#http://pic.netbian.com/4kmeinv/中文亂碼的處理  
dirName = './meinvLibs'
if not os.path.exists(dirName):
    os.mkdir(dirName)
url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
for page in range(1,11):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeinv/' 
    else:
        new_url = format(url%page)
    page_text = requests.get(new_url,headers=headers).text
    tree = etree.HTML(page_text)
    a_list = tree.xpath('//div[@class="slist"]/ul/li/a')
    for a in a_list:
        img_src = 'http://pic.netbian.com'+a.xpath('./img/@src')[0]
        img_name = a.xpath('./b/text()')[0]
        img_name = img_name.encode('iso-8859-1').decode('gbk')    # 對亂碼部分進行編碼解碼
        img_data = requests.get(img_src,headers=headers).content
        imgPath = dirName+'/'+img_name+'.jpg'
        with open(imgPath,'wb') as fp:
            fp.write(img_data)
            print(img_name,'下載成功！！！')

代理

HttpConnectionPool:錯誤緣由

緣由：短期發起高頻的請求致使ip被禁http鏈接池中的鏈接資源被消耗盡
解決：代理headers中加入Conection: "close"

代理：代理服務器，能夠接受請求而後將其轉發

匿名度：

高匿：啥也不知道
匿名：知道你使用了代理，可是不知道你的真實ip
透明：知道你使用了代理而且知道你的真實ip

類型：http、https

免費代理：www.goubanjia.com、快代理西

cookie的處理

使用代理的簡單案例

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Connection': 'close'
}
 # url = 'https://www.baidu.com/s?wd=ip'
 url = 'http://ip.chinaz.com/'
 page_text = requests.get(url=url, headers=headers, proxies={'http': '123.169.122.111:9999'}).text
 with open('./ip.html', mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

代理池

代理池:構建本身的代理池

import random
proxy_list = {
    {'https': '121.231.94.44:8888'},
    {'https': '131.231.94.44:8888'},
    {'https': '141.231.94.44:8888'}
}
url = 'https://www.baidu.com/s?wd=ip'
page_text = requests.get(url=url, headers=headers, proxies=random.choice(proxy_list)).text
with open('ip.html', 'w', enconding='utf-8') as fp:
    fp.write(page_text)

從代理精靈中提取代理ip,爲了獲取一系列ip來構建本身的代理池

from lxml import etree
ip_url = 'http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=4&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2'
page_text = requests.get(ip_url, headers=headers).text
tree = etree.HTML(page_text)
ip_list = tree.xpath('//body//text()')
print(ip_list)

第一步:爬取ip,port http類型

爬取西祠代理(已掛)獲取可用ip構建本身代理池

import random
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Connection': "close"
}
# url = 'https://www.xicidaili.com/nn/%d'    # 西祠代理(已掛)
url = 'https://www.kuaidaili.com/free/inha/%d/'
proxy_list_http = []
proxy_list_https = []
for page in range(1, 20):
    new_url = format(url%page)
    ip_port = random.choice(ip_list)
    page_text = requests.get(new_url, headers=headers, proxies={'https': ip_port}).text
    tree = etree.HTML(page_text)
    # tbody不能夠出如今xpath表達式中
    tr_list = tree.xpath('//*[@id="list"]/table//tr')[1:]    # 這裏不能要tbody,索引是從1開始的
    for tr in tr_list:
        ip = tr.xpath('./td[1]/text()')[0]    # 返回的是一個列表
        port = tr.xpath('./td[2]/text()')[0]
        t_type = tr.xpath('/td[4]/text()')[0]
        ips = ip+":" + port
        if t_type == 'HTTP':
            dic = {
                t_type: ips
            }
            proxy_list_http.append(dic)
        else:
            dic = {
                t_type: ips
            }
            proxy_list_https.append(dic)
print(len(proxy_list_http), len(proxy_list_https))

第二步: 檢測,將可用的ip留下來

for ip in proxy_list_http:
    response = requests.get('https://www/sogou.com', headers=headers,proxies={'https': ip})
    if response.status_code == '200':
        print('檢測到了可用的ip')

cookie的處理

cookie的處理:

手動處理：將cookie封裝到headers中

自動處理：session對象.能夠建立一個session對象,該對象能夠像requests同樣進行請求發送；

不一樣之處在於若是在使用session進行請求發送的過程當中產生了cookie,則cookie會被自動存儲在session對象中

案例:對雪球網中的新聞數據進行爬取https://xueqiu.com/

手動

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Cookie':'device_id=24700f9f1986800ab4fcc880530dd0ed; xq_a_token=db48cfe87b71562f38e03269b22f459d974aa8ae; xqat=db48cfe87b71562f38e03269b22f459d974aa8ae; xq_r_token=500b4e3d30d8b8237cdcf62998edbf723842f73a; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYwNjk2MzA1MCwiY3RtIjoxNjA1NTM1Mjc2NzYxLCJjaWQiOiJkOWQwbjRBWnVwIn0.PhEaPnWolUZRgyuOY-QO04Bn_A_HYU46Hm54_kWBxa8IZ6cFw20trOr7rKp7XztprxEFc7fkMN2_5abfh1TUyyFKqTDn7IfoThXyJ2lJCnH33q1q-K9BclYvLHrLGqt8jQ3YOJi7-nyiSb5ZTNk7TLEhiFfsbXaZK9evNrt7W65MdxoEWyCcGjbhI5znffRxDDLHD9511bd9upY9CUGbf4SHQwwx4PxyQqdy9j5bgqPN6rsuHoCvjcr42DZYRd8B72uQTkFs-Lnru4AFxt4o4gdaxPo_Qd_IqzCrXnwoLtCdX6n4NKV44SryBttE0SKQC6UbqC35PwN-JqPeWCHKpQ; u=201605535281005; Hm_lvt_1db88642e346389874251b5a1eded6e3=1605354060,1605411081,1605535282; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1605535282'
}
params = {
    'status_id': '163425862',
    'page': '1',
    'size': '14'
}
url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14'
page_text = requests.get(url=url, headers=headers, params=params).json()
print(page_text)

自動:將Cookie永久存儲session

session = requests.Session()
session.get('https://xueqiu.com/', headers=headers)    # 自動處理cookie,將首頁的cookie存儲到session中，後面爬取其餘頁面時能夠用到
url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14'
page_text = session.get(url=url, headers=headers).json()
print(page_text)

驗證碼的識別

step1:超級鷹中的示例代碼

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):    # 用戶名，密碼，和軟件id
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 圖片字節
        codetype: 題目類型 參考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:報錯題目的圖片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()

step2:識別古詩文網中的驗證碼

def tranformImgData(imgPath, t_type):    # 驗證碼圖片的地址和驗證碼的類型
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')    # 須要註冊的超級鷹的用戶名，密碼，和軟件id
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, t_type)['pic_str']    # t_type爲該圖片的類型碼
# 從古詩文網中爬取驗證碼的圖片，將圖片保存到本地，而後將圖片送入到超級鷹中識別，最後返回識別結果
url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]    # 它返回的是一個列表
img_data = requests.get(img_src, headers=headers).content    # .content時爬取圖片數據
with open('./code.jpg', 'wb') as fp:
    fp.write(img_data)
tranformImgData('./code.jpg', 1004)    # 將圖片路徑和圖片類型輸入進去，返回識別出來的碼

step3:重要模擬登入，目的是爲了將cookie保存到session中

# 將上述產生的驗證碼進行模擬登入
s = requests.Session()
url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = s.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_src, headers=headers).content    # cookie的產生在發生驗證碼圖片時產生,目的是：1.產生cookie，2：產生圖片
with open('./code.jpg', 'wb') as fp:
    fp.write(img_data)

# 動態獲取變化的參數
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
# 獲取前面超級鷹得到的驗證碼(將圖片識別出來)
code_text = tranformImgData('./code.jpg', 1004)
print(code_text)    # 觀察是否正確
# 該login_url是點擊登入按鈕後出現的頁面，爲post請求
login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
data = {
    '__VIEWSTATE': __VIEWSTATE,
    '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
    'from':'http://so.gushiwen.org/user/collect.aspx',
    'email': 'www.zhangbowudi@qq.com',
    'pwd': 'bobo328410948',
    'code': code_text,
    'denglu': '登陸',
}
page_text = s.post(url=login_url, headers=headers, data=data).text
with open('login.html', mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

單線程+多任務異步協程

協程：

在函數（特殊的函數）定義的時候，若是使用了async修飾的話，則該函數調用後會返回一個協程對象，而且函數內部的實現語句不會被當即執行

任務對象

任務對象就是對協程對象的進一步封裝。任務對象高級的協程對象特殊的函數

任務對象時必需要註冊到事件循環對象中

給任務對象綁定回調：爬蟲的數據解析中

事件循環

當作是一個容器，容器中必須存聽任務對象；

當啓動事件循環對象後，則事件循環對象會對其內部存儲任務對象進行異步的執行。

aiohttp：支持異步網絡請求的模塊

1.協程

import asyncio
def callback(task):#做爲任務對象的回調函數
    print('i am callback and ',task.result())    # task.result()是接受特殊函數內部的返回值

async def test():
    print('i am test()')
    return 'bobo'

c = test()
# 封裝了一個任務對象，就是對協程對象的進一步封裝
task = asyncio.ensure_future(c)    # 封裝一個任務對象
task.add_done_callback(callback)    # 給任務對象綁定回調
#建立一個事件循環的對象
loop = asyncio.get_event_loop()    # 建立事件循環對象
loop.run_until_complete(task)    # 將任務對象註冊到事件循環對象中

2.多任務

import asyncio
import time
start = time.time()
#在特殊函數內部的實現中不能夠出現不支持異步的模塊代碼
async def get_request(url):
    await asyncio.sleep(2)    # 若是使用time的模塊的sleep則不支持異步
    print('下載成功:',url)

urls = [
    'www.1.com',
    'www.2.com'
]
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)    # 建立任務對象
    # 多任務能夠在這裏綁定回調
    tasks.append(task)

loop = asyncio.get_event_loop()    # 建立事件循環對象
#注意：掛起操做須要手動處理，
loop.run_until_complete(asyncio.wait(tasks))    # 將多個任務 註冊到事件循環對象，並啓用（將任務掛起）
print(time.time()-start)

3.在爬蟲中的應用

import requests
import aiohttp
import time
import asyncio
s = time.time()
urls = [
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay'
]

# async def get_request(url):
#     page_text = requests.get(url).text
#     return page_text

# 使用aiohttp進行獲取請求，它支持異步，requests不支持異步
async def get_request(url):
   async with aiohttp.ClientSession() as s:
       async with await s.get(url=url) as response:    # 發送一個get請求，細節處理：每一個前面加一個async,遇到阻塞的加await
           page_text = await response.text()
           print(page_text)
   return page_text
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)    # 封裝一個所任務對象
    tasks.append(task)

loop = asyncio.get_event_loop()    # 建立事件循環對象
loop.run_until_complete(asyncio.wait(tasks))    # 將多個任務 註冊到事件循環對象，並啓用（將任務掛起）
print(time.time()-s)

step4:

單線程+多任務異步協程總結：

import aiohttp
import asyncio
import time
from lxml import etree
start = time.time()
urls = [
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay',
    'http://127.0.0.1:5000/tom'
]

# 特殊的函數：請求發送和響應數據的捕獲
# 細節:在每個with前加上async,在每個阻塞操做的前邊加上await
async def get_request(url):
    async with aiohttp.ClientSession() as s:    # requests不能發送異步請求因此使用aiohttp
        # s.get(url, headers=headers, proxy="http://ip:port", params)
        async with await s.get(url) as response:
            page_text = await response.text()    # read()返回的是byte類型的數據
            return page_text

# 回調函數(普通函數)
def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    parse_data = tree.xpath('//li/text()')
    print(parse_data)
# 多任務
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)    # 封裝一個任務對象
    task.add_done_callback(parse)    # 當任務對象執行完了以後纔會回調
    tasks.append(task)

# 將多任務註冊到事件循環當中
loop = asyncio.get_event_loop()    # 建立事件循環對象
loop.run_until_complete(asyncio.wait(tasks))    # 將任務對象註冊到事件循環對象中，並開啓事件循環對象,這裏wait是掛起的意思

print(time.time()-start)

selenium模塊

selenium模塊在爬蟲中的使用

概念：是一個基於瀏覽器自動化的模塊

爬蟲之間的關聯：便捷的捕獲到動態加載到的數據。(可見便可得)，缺點是慢實現模擬登陸

環境安裝：pip install selenium

基本使用：準備好某一款瀏覽器的驅動程序+ 版本的映射關係，實例化某一款瀏覽器對象

selenium模塊的基本操做：

from selenium import webdriver
from time import sleep
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('https://www.jd.com/')
sleep(1)
# 進行標籤訂位
search_input = bro.find_element_by_id('key')
search_input.send_keys('mac pro')

btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)

# 執行js
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
page_text = bro.page_source
print(page_text)
sleep(2)
bro.quit()

selenium爬取動態加載的數據

from selenium import webdriver
from time import sleep
from lxml import etree
bro = webdriver.Chrome(executable_path='chromedriver.exe')

bro.get('http://scxk.nmpa.gov.cn:81/xk/')
sleep(2)
page_text = bro.page_source
page_text_list = [page_text]

for i in range(3):
    bro.find_element_by_id('pageIto_next').click()    # 點擊下一頁
    sleep(2)
    page_text_list.append(bro.page_source)

for page_text in page_text_list:
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//ul[@id="gzlist"]/li')
    for li in li_list:
        title = li.xpath('./dl/@title')[0]
        num = li.xpath('./ol/@title')[0]
        print(title, num)

sleep(2)
bro.quit()

動做鏈

動做鏈：

一系列連續的動做在實現標籤訂位時，若是發現定位的標籤是存在於iframe標籤之中的，則在定位時必須執行一個固定的操做：bro.switch_to.frame('id')

若是裏面還嵌套了iframe

from selenium import webdriver
from time import sleep
from selenium.webdriver import ActionChains
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-example-draggable')
# 若是裏面還嵌套了iframe
bro.switch_to.frame('iframeResult')

div_tag = bro.find_element_by_id('draggable')
print(div_tag)

# 拖動=點擊+滑動
action = ActionChains(bro)
action.click_and_hold(div_tag)    # 點擊中加滑動

for i in range(5):
    # perform讓動做鏈當即執行
    action.move_by_offset(17, 5).perform()
    sleep(0.5)
action.release()    # 讓action回收一下

sleep(3)
bro.quit()

12306模擬登入

# 模擬登入12306
from selenium import webdriver
from time import sleep
from PIL import Image
from selenium.webdriver import ActionChains
from Cjy import Chaojiying_Client
from selenium.webdriver import ActionChains
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/login/init')
sleep(5)
bro.save_screenshot('main.png')    # 這個截圖對圖片格式有要求須要爲.png

code_img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')

location = code_img_tag.location
size = code_img_tag.size
print(location, type(location))
print(size)
#裁剪的區域範圍
rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height']))

print(rangle)
# 裁剪圖
i = Image.open('./main.png')
frame = i.crop(rangle)
frame.save('code.png')


def get_text(imgPath,imgType):
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, imgType)['pic_str']


#55,70|267,133 ==[[55,70],[33,66]]
result = get_text('./code.png',9004)
all_list = []
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
print(all_list)
# action = ActionChains(bro)
for a in all_list:
    x = a[0]
    y = a[1]
    ActionChains(bro).move_to_element_with_offset(code_img_tag,x,y).click().perform()
    sleep(1)

bro.find_element_by_id('username').send_keys('123456')
sleep(1)
bro.find_element_by_id('password').send_keys('67890000000')
sleep(1)
bro.find_element_by_id('loginSub').click()

sleep(5)
bro.quit()

selenium的其餘操做

簡介：

無頭瀏覽器的操做：無可視化界面的瀏覽器，PhantomJs:中止更新了

谷歌無頭瀏覽器：讓selenium規避檢測，使用的是谷歌無頭瀏覽器

from selenium import webdriver
from time import sleep

# 用到時直接粘貼複製
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 後面是你的瀏覽器驅動位置，記得前面加'r'是防止字符轉義的
driver = webdriver.Chrome(r'chromedriver.exe', chrome_options=chrome_options)
driver.get('https://www.cnblogs.com/')
print(driver.page_source)
#如何規避selenium被檢測
# 查看是否被規避掉，在console中輸入window.navigator.webdriver,返回undefined則爬蟲有效，返回True則被網站規避掉
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from time import sleep

# 用到時直接粘貼複製
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = webdriver.Chrome(r'chromedriver.exe',options=option)
driver.get('https://www.taobao.com/')