爬蟲的基礎知識使用實例、應用技巧、基本知識點總結和須要注意事項php
爬蟲:css
+ Request + Scrapy
數據分析+機器學習html
+ numpy,pandas,matplotlib
jupyter:python
+ 啓動:到你須要進去的文件夾,而後輸入jupyter notebook
cell是分爲不一樣模式的:(Code:編寫代碼、markdown:編寫筆記)jquery
jupyter的快捷鍵:web
添加cell: a, b (a向前添加,b前後添加)ajax
刪除cell: xchrome
執行:shift+enter(執行而且光標到下一行),ctrl+enter(執行而且光標仍然在這一行)json
tab:自動補全切換cell的模式:api
打開幫助文檔:shift + tab
一、什麼是爬蟲?
經過編寫程序模擬瀏覽器上網,而後讓其去互聯網上爬取數據的過程
二、爬蟲的分類:
通用爬蟲:抓取互聯網中的一整張頁面數據
聚焦爬蟲:抓取頁面中的局部數據
增量式爬蟲:用來監測網站數據更新的狀況,以便爬取到網站最新更新出來的數據
三、反爬機制
四、反反爬策略
五、爬蟲合法嗎?
5.1爬取數據的行爲風險的體現:
爬蟲干擾了被訪問網站的正常運營;
爬蟲抓取了受到法律保護的特定類型的數據或信息。
5.2規避風險:
嚴格遵照網站設置的robots協議;
在規避反爬蟲措施的同時,須要優化本身的代碼,避免干擾被訪問網站的正常運行;
在使用、傳播抓取到的信息時,應審查所抓取的內容,如發現屬於用戶的我的信息、隱私或者他人的商業祕密的,應及時中止並刪除。
六、robots協議:
文本協議特性:防君子不防小人的文本協議
什麼是requests模塊?Python中封裝好的一個基於網絡請求的模塊。
requests模塊的做用?用來模擬瀏覽器發請求
requests模塊的環境安裝:pip install requests
requests模塊的編碼流程:指定url、發起請求、獲取響應數據、持久化存儲
import requests # 1.指定url url = 'https://www.sogou.com/' # 2.請求發送get:get返回值是一個響應對象 response = requests.get(url=url) # 3.獲取響應數據 page_text = response.text # 返回的是字符串形式的響應數據 # 4.持久化存儲 with open('sogou.html',mode='w',encoding='utf-8') as fp: fp.write(page_text)
須要讓url攜帶的參數動態化 import requests url = 'https://www.sogou.com/web' # 實現參數動態化 wd = input('enter a key:') params = { 'query': wd } # 在請求中須要將請求參數對應的字典做用到params這個get方法的參數中 response = requests.get(url=url, params=params) page_text = response.text file_name = wd+'.html' with open(file_name,encoding='utf-8',mode='w') as fp: fp.write(page_text)
import requests url = 'https://www.sogou.com/web' wd = input('enter a key') params = { 'query': wd } response = requests.get(url=url, params=params) response.encoding = 'utf-8' page_text = response.text filename = wd + '.html' with open(filename, mode='w', encoding='utf-8') as fp: fp.write(page_text)
import requests url = 'https://www.sogou.com/web' wd = input('enter a key') params = { 'query': wd } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' } response = requests.get(url=url, params=params, headers=headers) response.encoding = 'utf-8' page_text = response.text filename = wd + '.html' with open(filename, mode='w', encoding='utf-8') as fp: fp.write(page_text)
動態加載的頁面數據 是經過例一個單獨的請求請求到的數據 import requests url = 'https://movie.douban.com/j/chart/top_list' start = input('電影開始') end = input('電影結束') dic = { 'type': '13', 'interval_id': '100:90', 'action': '', 'start': start, 'end': end } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' } response = requests.get(url=url, params=dic, headers=headers) page_text = response.json() # json返回的是序列化好的實例對象 for dic in page_text: print(dic['title']+dic['score'])
import requests url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword' site = input('請輸入地點>>') for page in range(1, 5): data = { 'cname':'', 'pid':'', 'keyword': site, 'pageIndex': '1', 'pageSize': '10' } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' } response = requests.post(url=url, data=data,headers=headers) print(response.json())
數據解析的做用:能夠幫助咱們實現聚焦爬蟲
數據解析的實現方式:正則、bs四、xpath、pyquery
數據解析的通用原理:
1.爬蟲爬取的數據都被存儲在了相關的標籤之中和相關標籤的屬性中
2.定位標籤
3.取文本或者取屬性
import requests headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' } 1.爬取byte類型數據(如何爬取圖片) url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg' img_data = requests.get(url=url).content # 爬取byte類使用.content with open('./img.jpg',mode='wb') as fp: fp.write(img_data) # 弊端:不能使用UA假裝 from urllib import request # url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg' # request.urlretrieve(url, filename='./qutu.jpg')
import os import re # 糗圖爬取1-3頁全部的圖片 # 1.使用通用爬蟲將前3頁對應的頁面源碼數據進行爬取 # 通用的url模板(不可變) 1.建立目錄 dirName = "./imgLibs" if not os.path.exists(dirName): os.mkdir(dirName) url = f"https://www.qiushibaike.com/imgrank/page/%d/" # 2.下載圖片 for page in range(1, 3): new_url = format(url%page) page_text = requests.get(url=new_url,headers=headers).text # 每個頁碼對應的源碼數據 ex = '<div class="thumb">.*?<img src="(.*?)".*?</div>' img_src_list = re.findall(ex, page_text, re.S) for src in img_src_list: src = "https:" + src img_name = src.split('/')[-1] img_path = dirName + '/' + img_name #./imgLibs/xxxx.jpg request.urlretrieve(src, filename=img_path) print(img_name, '下載成功')
bs4解析bs4解析的原理:
實例化一個BeautifulSoup的對象,須要將即將被解析的頁面源碼數據加載到該對象中
調用BeautifulSoup對象中的相關方法和屬性進行標籤訂位和數據提取
環境的安裝:
pip install bs4
pip install lxml
BeautifulSoup的實例化:
BeautifulSoup(fp,'lxml'):將本地存儲的一個html文檔中的數據加載到實例化好的BeautifulSoup對象中
BeautifulSoup(page_text,'lxml'):將從互聯網上獲取的頁面源碼數據加載到實例化好的BeautifulSoup對象中
定位標籤的操做:
soup.tagName:定位到第一個出現的tagName標籤
屬性定位:soup.find('tagName',attrName='value')
屬性定位:soup.find_all('tagName',attrName='value'),返回值爲列表
選擇器定位:soup.select('選擇器'),返回的是列表
層級選擇器:>表示一個層級 空格表示多個層級
取文本:
string:獲取直系的文本內容
.text:獲取全部的文本內容
取屬性:
tagName['attrName']
定位標籤 from bs4 import BeautifulSoup fp = open('./test.html', mode='r', encoding='utf-8') soup = BeautifulSoup(fp, 'lxml') print(soup.div) # 定位到第一個出現的div find相關 print(soup.find('div', class_='song')) # 只有class_標籤須要帶_ print(soup.find('a', id='feng')) print(soup.find_all('div', class_='song')) # 返回的是一個列表 select相關 print(soup.select('#feng')) # 返回的是一個列表 print(soup.select('.tang > ul >li')) # 返回的是一個列表 > 表示一個層級 print(soup.select('.tang li')) # 返回一個列表 空格表示多個層級 取文本 a_tag = soup.select("#feng")[0] print(a_tag.text) div = soup.div print(div.string) # 取直系的文本內容 div = soup.find('div', class_='song') print(div.string) a_tag = soup.select('#feng')[0] print(a_tag['href'])
爬取三國整篇內容(章節名稱+章節內容)http://www.shicimingju.com/book/sanguoyanyi.html fp = open('./sanguo.txt', mode='w', encoding='utf-8') main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html' page_text = requests.get(url=main_url, headers=headers).text soup1 = BeautifulSoup(page_text, 'lxml') title_list = soup1.select('.book-mulu > ul > li > a') for page in title_list: title = page.string title_url = 'https://www.shicimingju.com' + page['href'] title_text = requests.get(url=title_url, headers=headers).text # 解析詳情頁中的章節內容 soup = BeautifulSoup(title_text, 'lxml') content = soup.find('div', class_='chapter_content').text fp.write(title + ':' + content + '\n') print(f'{title}下載成功')
xpath解析的實現原理:
1.實例化一個etree的對象,而後將即將被解析的頁面源碼加載到該對象中
2.使用etree對象中的xpath方法結合着不一樣形式的xpath表達式實現標籤訂位和數據提取
環境安裝:
pip install lxmletree
對象的實例化:
etree.parse('test.html') # 本地文件
etree.HTML(page_text) # 互聯網頁面
xpath表達式:xpath方法的返回值必定是一個列表
最左側的/表示:xpath表達式必定要從根標籤逐層進行標籤查找和定位
最左側的//表示:xpath表達式能夠從任意位置定位標籤
非最左側的/:表示一個層級
非最左側的//:表示跨多個層級
屬性定位://tagName[@attrName="value"]
索引定位://tagName[index] 索引是從1開始
取文本:/text():直系文本內容//text():全部的文本內容
取屬性:/@attrName
from lxml import etree tree = etree.parse('./test.html') 標籤訂位 print(tree.xpath('/html/head/title')) print(tree.xpath('//title')) print(tree.xpath('/html/body//p')) print(tree.xpath('//p')) 屬性定位 print(tree.xpath('//div[@class="song"]')) print(tree.xpath('//li[3]')) # 返回的是一個對象地址 取文本 print(tree.xpath('//a[@id="feng"]/text()')[0]) # 返回的是列表 print(tree.xpath('//div[@class="song"]//text()')) # 返回的是列表 取屬性 print(tree.xpath('//a[@id="feng"]/@href')) # 返回的是列表
#爬取糗百中的段子內容和做者名稱 url = 'https://www.qiushibaike.com/text/' page_text = requests.get(url,headers=headers).text #解析內容 tree = etree.HTML(page_text) div_list = tree.xpath('//div[@id="content-left"]/div') for div in div_list: author = div.xpath('./div[1]/a[2]/h2/text()')[0]#實現局部解析 content = div.xpath('./a[1]/div/span//text()') content = ''.join(content) print(author,content)
https://www.aqistudy.cn/historydata/ 爬取全部城市名稱 url = 'https://www.aqistudy.cn/historydata/' page_text = requests.get(url=url, headers=headers).text tree = etree.HTML(page_text) print(tree) city_list1 = tree.xpath('//div[@class="bottom"]/ul/li/a/text()') print(city_list1) city_list2 = tree.xpath('//ul[@class="unstyled"]//li/a/text()') print(city_list2) 利用|提升xpath的通用性(當前面表達式生效時執行前面,後面表達式生效時執行後面。兩個同時生效時同時執行) cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text() | //ul[@class="unstyled"]//li/a/text()') print(cities)
#http://pic.netbian.com/4kmeinv/中文亂碼的處理 dirName = './meinvLibs' if not os.path.exists(dirName): os.mkdir(dirName) url = 'http://pic.netbian.com/4kmeinv/index_%d.html' for page in range(1,11): if page == 1: new_url = 'http://pic.netbian.com/4kmeinv/' else: new_url = format(url%page) page_text = requests.get(new_url,headers=headers).text tree = etree.HTML(page_text) a_list = tree.xpath('//div[@class="slist"]/ul/li/a') for a in a_list: img_src = 'http://pic.netbian.com'+a.xpath('./img/@src')[0] img_name = a.xpath('./b/text()')[0] img_name = img_name.encode('iso-8859-1').decode('gbk') # 對亂碼部分進行編碼解碼 img_data = requests.get(img_src,headers=headers).content imgPath = dirName+'/'+img_name+'.jpg' with open(imgPath,'wb') as fp: fp.write(img_data) print(img_name,'下載成功!!!')
HttpConnectionPool:錯誤緣由
代理:代理服務器,能夠接受請求而後將其轉發
匿名度:
類型:http、https
免費代理:www.goubanjia.com、快代理西
cookie的處理
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Connection': 'close' } # url = 'https://www.baidu.com/s?wd=ip' url = 'http://ip.chinaz.com/' page_text = requests.get(url=url, headers=headers, proxies={'http': '123.169.122.111:9999'}).text with open('./ip.html', mode='w', encoding='utf-8') as fp: fp.write(page_text)
import random proxy_list = { {'https': '121.231.94.44:8888'}, {'https': '131.231.94.44:8888'}, {'https': '141.231.94.44:8888'} } url = 'https://www.baidu.com/s?wd=ip' page_text = requests.get(url=url, headers=headers, proxies=random.choice(proxy_list)).text with open('ip.html', 'w', enconding='utf-8') as fp: fp.write(page_text)
from lxml import etree ip_url = 'http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=4&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2' page_text = requests.get(ip_url, headers=headers).text tree = etree.HTML(page_text) ip_list = tree.xpath('//body//text()') print(ip_list)
import random headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Connection': "close" } # url = 'https://www.xicidaili.com/nn/%d' # 西祠代理(已掛) url = 'https://www.kuaidaili.com/free/inha/%d/' proxy_list_http = [] proxy_list_https = [] for page in range(1, 20): new_url = format(url%page) ip_port = random.choice(ip_list) page_text = requests.get(new_url, headers=headers, proxies={'https': ip_port}).text tree = etree.HTML(page_text) # tbody不能夠出如今xpath表達式中 tr_list = tree.xpath('//*[@id="list"]/table//tr')[1:] # 這裏不能要tbody,索引是從1開始的 for tr in tr_list: ip = tr.xpath('./td[1]/text()')[0] # 返回的是一個列表 port = tr.xpath('./td[2]/text()')[0] t_type = tr.xpath('/td[4]/text()')[0] ips = ip+":" + port if t_type == 'HTTP': dic = { t_type: ips } proxy_list_http.append(dic) else: dic = { t_type: ips } proxy_list_https.append(dic) print(len(proxy_list_http), len(proxy_list_https))
for ip in proxy_list_http: response = requests.get('https://www/sogou.com', headers=headers,proxies={'https': ip}) if response.status_code == '200': print('檢測到了可用的ip')
手動處理:將cookie封裝到headers中
自動處理:session對象.能夠建立一個session對象,該對象能夠像requests同樣進行請求發送;
不一樣之處在於若是在使用session進行請求發送的過程當中產生了cookie,則cookie會被自動存儲在session對象中
headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Cookie':'device_id=24700f9f1986800ab4fcc880530dd0ed; xq_a_token=db48cfe87b71562f38e03269b22f459d974aa8ae; xqat=db48cfe87b71562f38e03269b22f459d974aa8ae; xq_r_token=500b4e3d30d8b8237cdcf62998edbf723842f73a; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYwNjk2MzA1MCwiY3RtIjoxNjA1NTM1Mjc2NzYxLCJjaWQiOiJkOWQwbjRBWnVwIn0.PhEaPnWolUZRgyuOY-QO04Bn_A_HYU46Hm54_kWBxa8IZ6cFw20trOr7rKp7XztprxEFc7fkMN2_5abfh1TUyyFKqTDn7IfoThXyJ2lJCnH33q1q-K9BclYvLHrLGqt8jQ3YOJi7-nyiSb5ZTNk7TLEhiFfsbXaZK9evNrt7W65MdxoEWyCcGjbhI5znffRxDDLHD9511bd9upY9CUGbf4SHQwwx4PxyQqdy9j5bgqPN6rsuHoCvjcr42DZYRd8B72uQTkFs-Lnru4AFxt4o4gdaxPo_Qd_IqzCrXnwoLtCdX6n4NKV44SryBttE0SKQC6UbqC35PwN-JqPeWCHKpQ; u=201605535281005; Hm_lvt_1db88642e346389874251b5a1eded6e3=1605354060,1605411081,1605535282; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1605535282' } params = { 'status_id': '163425862', 'page': '1', 'size': '14' } url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14' page_text = requests.get(url=url, headers=headers, params=params).json() print(page_text)
session = requests.Session() session.get('https://xueqiu.com/', headers=headers) # 自動處理cookie,將首頁的cookie存儲到session中,後面爬取其餘頁面時能夠用到 url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14' page_text = session.get(url=url, headers=headers).json() print(page_text)
import requests from hashlib import md5 class Chaojiying_Client(object): def __init__(self, username, password, soft_id): # 用戶名,密碼,和軟件id self.username = username password = password.encode('utf8') self.password = md5(password).hexdigest() self.soft_id = soft_id self.base_params = { 'user': self.username, 'pass2': self.password, 'softid': self.soft_id, } self.headers = { 'Connection': 'Keep-Alive', 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)', } def PostPic(self, im, codetype): """ im: 圖片字節 codetype: 題目類型 參考 http://www.chaojiying.com/price.html """ params = { 'codetype': codetype, } params.update(self.base_params) files = {'userfile': ('ccc.jpg', im)} r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers) return r.json() def ReportError(self, im_id): """ im_id:報錯題目的圖片ID """ params = { 'id': im_id, } params.update(self.base_params) r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers) return r.json()
def tranformImgData(imgPath, t_type): # 驗證碼圖片的地址和驗證碼的類型 chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370') # 須要註冊的超級鷹的用戶名,密碼,和軟件id im = open(imgPath, 'rb').read() return chaojiying.PostPic(im, t_type)['pic_str'] # t_type爲該圖片的類型碼 # 從古詩文網中爬取驗證碼的圖片,將圖片保存到本地,而後將圖片送入到超級鷹中識別,最後返回識別結果 url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx' page_text = requests.get(url, headers=headers).text tree = etree.HTML(page_text) img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0] # 它返回的是一個列表 img_data = requests.get(img_src, headers=headers).content # .content時爬取圖片數據 with open('./code.jpg', 'wb') as fp: fp.write(img_data) tranformImgData('./code.jpg', 1004) # 將圖片路徑和圖片類型輸入進去,返回識別出來的碼
# 將上述產生的驗證碼進行模擬登入 s = requests.Session() url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx' page_text = s.get(url, headers=headers).text tree = etree.HTML(page_text) img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0] img_data = s.get(img_src, headers=headers).content # cookie的產生在發生驗證碼圖片時產生,目的是:1.產生cookie,2:產生圖片 with open('./code.jpg', 'wb') as fp: fp.write(img_data) # 動態獲取變化的參數 __VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0] __VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0] # 獲取前面超級鷹得到的驗證碼(將圖片識別出來) code_text = tranformImgData('./code.jpg', 1004) print(code_text) # 觀察是否正確 # 該login_url是點擊登入按鈕後出現的頁面,爲post請求 login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx' data = { '__VIEWSTATE': __VIEWSTATE, '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR, 'from':'http://so.gushiwen.org/user/collect.aspx', 'email': 'www.zhangbowudi@qq.com', 'pwd': 'bobo328410948', 'code': code_text, 'denglu': '登陸', } page_text = s.post(url=login_url, headers=headers, data=data).text with open('login.html', mode='w', encoding='utf-8') as fp: fp.write(page_text)
協程:
在函數(特殊的函數)定義的時候,若是使用了async修飾的話,則該函數調用後會返回一個協程對象,而且函數內部的實現語句不會被當即執行
任務對象
任務對象就是對協程對象的進一步封裝。任務對象高級的協程對象特殊的函數
任務對象時必需要註冊到事件循環對象中
給任務對象綁定回調:爬蟲的數據解析中
事件循環
當作是一個容器,容器中必須存聽任務對象;
當啓動事件循環對象後,則事件循環對象會對其內部存儲任務對象進行異步的執行。
aiohttp:支持異步網絡請求的模塊
import asyncio def callback(task):#做爲任務對象的回調函數 print('i am callback and ',task.result()) # task.result()是接受特殊函數內部的返回值 async def test(): print('i am test()') return 'bobo' c = test() # 封裝了一個任務對象,就是對協程對象的進一步封裝 task = asyncio.ensure_future(c) # 封裝一個任務對象 task.add_done_callback(callback) # 給任務對象綁定回調 #建立一個事件循環的對象 loop = asyncio.get_event_loop() # 建立事件循環對象 loop.run_until_complete(task) # 將任務對象註冊到事件循環對象中
import asyncio import time start = time.time() #在特殊函數內部的實現中不能夠出現不支持異步的模塊代碼 async def get_request(url): await asyncio.sleep(2) # 若是使用time的模塊的sleep則不支持異步 print('下載成功:',url) urls = [ 'www.1.com', 'www.2.com' ] tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) # 建立任務對象 # 多任務能夠在這裏綁定回調 tasks.append(task) loop = asyncio.get_event_loop() # 建立事件循環對象 #注意:掛起操做須要手動處理, loop.run_until_complete(asyncio.wait(tasks)) # 將多個任務 註冊到事件循環對象,並啓用(將任務掛起) print(time.time()-start)
import requests import aiohttp import time import asyncio s = time.time() urls = [ 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay' ] # async def get_request(url): # page_text = requests.get(url).text # return page_text # 使用aiohttp進行獲取請求,它支持異步,requests不支持異步 async def get_request(url): async with aiohttp.ClientSession() as s: async with await s.get(url=url) as response: # 發送一個get請求,細節處理:每一個前面加一個async,遇到阻塞的加await page_text = await response.text() print(page_text) return page_text tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) # 封裝一個所任務對象 tasks.append(task) loop = asyncio.get_event_loop() # 建立事件循環對象 loop.run_until_complete(asyncio.wait(tasks)) # 將多個任務 註冊到事件循環對象,並啓用(將任務掛起) print(time.time()-s)
import aiohttp import asyncio import time from lxml import etree start = time.time() urls = [ 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom' ] # 特殊的函數:請求發送和響應數據的捕獲 # 細節:在每個with前加上async,在每個阻塞操做的前邊加上await async def get_request(url): async with aiohttp.ClientSession() as s: # requests不能發送異步請求因此使用aiohttp # s.get(url, headers=headers, proxy="http://ip:port", params) async with await s.get(url) as response: page_text = await response.text() # read()返回的是byte類型的數據 return page_text # 回調函數(普通函數) def parse(task): page_text = task.result() tree = etree.HTML(page_text) parse_data = tree.xpath('//li/text()') print(parse_data) # 多任務 tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) # 封裝一個任務對象 task.add_done_callback(parse) # 當任務對象執行完了以後纔會回調 tasks.append(task) # 將多任務註冊到事件循環當中 loop = asyncio.get_event_loop() # 建立事件循環對象 loop.run_until_complete(asyncio.wait(tasks)) # 將任務對象註冊到事件循環對象中,並開啓事件循環對象,這裏wait是掛起的意思 print(time.time()-start)
概念:是一個基於瀏覽器自動化的模塊
爬蟲之間的關聯:便捷的捕獲到動態加載到的數據。(可見便可得),缺點是慢實現模擬登陸
環境安裝:pip install selenium
基本使用:準備好某一款瀏覽器的驅動程序+ 版本的映射關係,實例化某一款瀏覽器對象
from selenium import webdriver from time import sleep bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('https://www.jd.com/') sleep(1) # 進行標籤訂位 search_input = bro.find_element_by_id('key') search_input.send_keys('mac pro') btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button') btn.click() sleep(2) # 執行js bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(2) page_text = bro.page_source print(page_text) sleep(2) bro.quit()
from selenium import webdriver from time import sleep from lxml import etree bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('http://scxk.nmpa.gov.cn:81/xk/') sleep(2) page_text = bro.page_source page_text_list = [page_text] for i in range(3): bro.find_element_by_id('pageIto_next').click() # 點擊下一頁 sleep(2) page_text_list.append(bro.page_source) for page_text in page_text_list: tree = etree.HTML(page_text) li_list = tree.xpath('//ul[@id="gzlist"]/li') for li in li_list: title = li.xpath('./dl/@title')[0] num = li.xpath('./ol/@title')[0] print(title, num) sleep(2) bro.quit()
動做鏈:
一系列連續的動做在實現標籤訂位時,若是發現定位的標籤是存在於iframe標籤之中的,則在定位時必須執行一個固定的操做:bro.switch_to.frame('id')
若是裏面還嵌套了iframe
from selenium import webdriver from time import sleep from selenium.webdriver import ActionChains bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-example-draggable') # 若是裏面還嵌套了iframe bro.switch_to.frame('iframeResult') div_tag = bro.find_element_by_id('draggable') print(div_tag) # 拖動=點擊+滑動 action = ActionChains(bro) action.click_and_hold(div_tag) # 點擊中加滑動 for i in range(5): # perform讓動做鏈當即執行 action.move_by_offset(17, 5).perform() sleep(0.5) action.release() # 讓action回收一下 sleep(3) bro.quit()
# 模擬登入12306 from selenium import webdriver from time import sleep from PIL import Image from selenium.webdriver import ActionChains from Cjy import Chaojiying_Client from selenium.webdriver import ActionChains bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('https://kyfw.12306.cn/otn/login/init') sleep(5) bro.save_screenshot('main.png') # 這個截圖對圖片格式有要求須要爲.png code_img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img') location = code_img_tag.location size = code_img_tag.size print(location, type(location)) print(size) #裁剪的區域範圍 rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height'])) print(rangle) # 裁剪圖 i = Image.open('./main.png') frame = i.crop(rangle) frame.save('code.png') def get_text(imgPath,imgType): chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370') im = open(imgPath, 'rb').read() return chaojiying.PostPic(im, imgType)['pic_str'] #55,70|267,133 ==[[55,70],[33,66]] result = get_text('./code.png',9004) all_list = [] if '|' in result: list_1 = result.split('|') count_1 = len(list_1) for i in range(count_1): xy_list = [] x = int(list_1[i].split(',')[0]) y = int(list_1[i].split(',')[1]) xy_list.append(x) xy_list.append(y) all_list.append(xy_list) else: x = int(result.split(',')[0]) y = int(result.split(',')[1]) xy_list = [] xy_list.append(x) xy_list.append(y) all_list.append(xy_list) print(all_list) # action = ActionChains(bro) for a in all_list: x = a[0] y = a[1] ActionChains(bro).move_to_element_with_offset(code_img_tag,x,y).click().perform() sleep(1) bro.find_element_by_id('username').send_keys('123456') sleep(1) bro.find_element_by_id('password').send_keys('67890000000') sleep(1) bro.find_element_by_id('loginSub').click() sleep(5) bro.quit()
無頭瀏覽器的操做:無可視化界面的瀏覽器,PhantomJs:中止更新了
谷歌無頭瀏覽器:讓selenium規避檢測,使用的是谷歌無頭瀏覽器
from selenium import webdriver from time import sleep # 用到時直接粘貼複製 from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') # 後面是你的瀏覽器驅動位置,記得前面加'r'是防止字符轉義的 driver = webdriver.Chrome(r'chromedriver.exe', chrome_options=chrome_options) driver.get('https://www.cnblogs.com/') print(driver.page_source) #如何規避selenium被檢測 # 查看是否被規避掉,在console中輸入window.navigator.webdriver,返回undefined則爬蟲有效,返回True則被網站規避掉 from selenium import webdriver from selenium.webdriver import ChromeOptions from time import sleep # 用到時直接粘貼複製 option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) driver = webdriver.Chrome(r'chromedriver.exe',options=option) driver.get('https://www.taobao.com/')