pip install lxml
- 獲取頁面原碼數據
- 實例化etree對象,將頁面原碼數據加載到該對象中
- 調用該對象的xpath方法進行指定標籤的定位(xparh函數必須結合着xpath表達式進行標籤的定位和內容的捕獲)
## 一.數據解析 ### 1.xpath解析(各個爬蟲語言通用) #### (1)環境安裝 ``` pip install lxml ``` #### (2)解析原理 ``` - 獲取頁面原碼數據 - 實例化etree對象,將頁面原碼數據加載到該對象中 - 調用該對象的xpath方法進行指定標籤的定位(xparh函數必須結合着xpath表達式進行標籤的定位和內容的捕獲) ``` #### (3)xpath語法(返回值是一個列表) ``` 屬性定位 / 至關於 > (在開頭必定從根節點開始) // 至關於 ' ' @ 表示屬性 例://div[@class="song"] 索引定位(索引從1開始) //ul/li[2] 邏輯運算 //a[@href='' and @class='du'] 和 //a[@href='' | @class='du'] 或 模糊匹配 //div[contains(@class,'ng')] //div[starts-with(@class,'ng')] 取文本 //div/text() 直系文本內容 //div//text() 非直系文本內容(返回列表) 取屬性 //div/@href ``` #### (4)案例 ##### 案例一:58同城二手房數據爬取 ```python import requests from lxml import etree import os url='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()') with open('./文件夾1/fangyuan.txt','w',encoding='utf-8') as f: for title_price in title_price_list: f.write(title_price) f.close() print("over") ``` ###### *注:區別解析的數據源是原碼仍是局部數據* ``` 原碼數據 tree.HTML('//ul...') 局部數據 tree.HTML('./ul...') #以.開頭 ``` ##### 測試xpath語法的正確性 ###### 方式一:xpath.crx(xpath插件) ``` 找到瀏覽器的 更多工具>拓展程序 開啓開發者模式 將xpath.crx拖動到瀏覽器中 xpath插件啓動快捷鍵:ctrl+shift+x 做用:用於測試xpath語法的正確性 ```  ###### 方式二:瀏覽器自帶  ##### 案例二:4k網爬取圖片 ``` import requests from lxml import etree import urllib headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } page_num=int(input("請輸入要爬取的頁數:")) if page_num==1: url='http://pic.netbian.com/4kyingshi/index.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夾2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!') else: for page in range(1,page_num+1): url='http://pic.netbian.com/4kyingshi/index_%d.html' % page origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夾2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!') ``` ###### 中文亂碼問題 ``` 方式一: response.encoding='gbk' 方式二: name=name.encode('iso-8859-1').decode('utf-8') ``` ###### 數據來源問題 ``` etree.HTML() #處理網絡數據 etree.parse() #處理本地數據 ``` ##### 案例3:爬取煎蛋網圖片 ```python import requests from lxml import etree import urllib import base64 headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='http://jandan.net/ooxx' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) span_list=tree.xpath('//span[@class="img-hash"]/text()') for span in span_list: src='http:'+base64.b64decode(span).decode("utf-8") picture_data=requests.get(url=src,headers=headers).content name='./文件夾3/'+src.split("/")[-1] with open(name,'wb') as f: f.write(picture_data) f.close() print('over!!!') ``` ###### ##反爬機制3:base64 在response返回數據中,圖片的src都是相同的,每一個圖片都有一個span標籤存儲一串加密字符串,同時發現一個jandan_load_img函數,故猜想該加密字符串經過此函數可能獲得圖片地址.  全局搜索此函數  發現此函數中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz  全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz  函數的最後用到了base64_decode  故判定該加密字符串用base64解密可獲得圖片地址 ##### 案例4:站長素材簡歷爬取 ```python import requests from lxml import etree import random headers={ 'Connection':'close', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='http://sc.chinaz.com/jianli/free.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) src_list=tree.xpath('//div[@id="main"]/div/div/a/@href') for src in src_list: filename='./文件夾4/'+src.split('/')[-1].split('.')[0]+'.rar' print(filename) down_page_data=requests.get(url=src,headers=headers).text tree=etree.HTML(down_page_data) down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href') res=random.choice(down_list) print(res) jianli=requests.get(url=res,headers=headers).content with open(filename,'wb') as f: f.write(jianli) f.close() print('over!!!') ``` ###### ##反爬機制4:Connection 經典錯誤 ``` HTTPConnectionPool(host:xx) Max retries exceeded with url ``` 緣由 ``` 1.每次數據傳輸前客戶端都要和服務端創建TCP鏈接,爲了節省傳輸消耗,默認爲keep-alive,即鏈接一次傳輸屢次,然而若是鏈接遲遲不斷開的話,連接池滿後,則沒法產生新的連接對象,致使請求沒法發送 2.IP被封 3.請求頻率太頻繁 ``` 解決 ``` 1.設置請求頭中Connection的值爲close,每次成功後斷開鏈接 2.更換請求IP 3.每次請求之間使用sleep進行請求間隔 ``` ##### 案例5:解析全部的城市名稱 ```python import requests from lxml import etree headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='https://www.aqistudy.cn/historydata/' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()') with open('./文件夾1/city.txt','w',encoding='utf-8') as f: for hot in hot_list: f.write(hot.strip()) common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()') for common in common_list: f.write(common.strip()) f.close() print('over!!!') ``` ##### 案例6:圖片懶加載,站長素材婚紗照 ```python import requests from lxml import etree headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='http://sc.chinaz.com/tupian/hunsha.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) div_list=tree.xpath('//div[@id="container"]/div') for div in div_list: title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8') name='./文件夾1/'+title+'.jpg' photo_url=div.xpath('./div/a/@href')[0] origin_data=requests.get(url=photo_url,headers=headers).text tree=etree.HTML(origin_data) url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0] origin_data=requests.get(url=url_it,headers=headers).content with open(name,'wb') as f: f.write(origin_data) print('over!!!') ``` ###### ##反爬機制5:代理IP 使用 ```python import requests from lxml import etree import random proxie=[{'https':'116.197.134.153:80'},{'https':'103.224.100.43:8080'},{'https':'222.74.237.246:808'}] headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='https://www.baidu.com/s?wd=ip' origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).text with open('./ip.html','w',encoding='utf-8') as f: f.write(origin_data) print('over!!!') ``` 經常使用代理網站 ``` www.goubanjia.com 快代理 西祠代理 ``` 代理知識 ``` 透明:對方知道使用了代理,且知道真實IP 匿名:對方知道使用了代理,不知道真實IP 高匿:對方不知道使用了代理,更不知道真實IP ``` *注:代理IP的類型必須和請求url的協議頭 保持一致* *https://www.55xia.com下載電影* *順序:動態加載,url加密,element*
import requests from lxml import etree import os url='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()') with open('./文件夾1/fangyuan.txt','w',encoding='utf-8') as f: for title_price in title_price_list: f.write(title_price) f.close() print("over")
原碼數據 tree.HTML('//ul...') 局部數據 tree.HTML('./ul...') #以.開頭
找到瀏覽器的 更多工具>拓展程序 開啓開發者模式 將xpath.crx拖動到瀏覽器中 xpath插件啓動快捷鍵:ctrl+shift+x 做用:用於測試xpath語法的正確性
import requests from lxml import etree import urllib headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } page_num=int(input("請輸入要爬取的頁數:")) if page_num==1: url='http://pic.netbian.com/4kyingshi/index.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夾2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!') else: for page in range(1,page_num+1): url='http://pic.netbian.com/4kyingshi/index_%d.html' % page origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) a_list=tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: name=a.xpath('./b/text()')[0] name=name.encode('iso-8859-1').decode('gbk') url='http://pic.netbian.com'+a.xpath('./img/@src')[0] picture=requests.get(url=url,headers=headers).content picture_name='./文件夾2/'+name+'.jpg' with open(picture_name,'wb') as f: f.write(picture) f.close() print('over!!!')
方式一: response.encoding='gbk' 方式二: name=name.encode('iso-8859-1').decode('utf-8')
etree.HTML() #處理網絡數據 etree.parse() #處理本地數據
import requests from lxml import etree import urllib import base64 headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='http://jandan.net/ooxx' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) span_list=tree.xpath('//span[@class="img-hash"]/text()') for span in span_list: src='http:'+base64.b64decode(span).decode("utf-8") picture_data=requests.get(url=src,headers=headers).content name='./文件夾3/'+src.split("/")[-1] with open(name,'wb') as f: f.write(picture_data) f.close() print('over!!!')
在response返回數據中,圖片的src都是相同的,每一個圖片都有一個span標籤存儲一串加密字符串,同時發現一個jandan_load_img函數,故猜想該加密字符串經過此函數可能獲得圖片地址.html
全局搜索此函數python
發現此函數中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz瀏覽器
全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz網絡
函數的最後用到了base64_decodedom
故判定該加密字符串用base64解密可獲得圖片地址函數
import requests from lxml import etree import random headers={ 'Connection':'close', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='http://sc.chinaz.com/jianli/free.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) src_list=tree.xpath('//div[@id="main"]/div/div/a/@href') for src in src_list: filename='./文件夾4/'+src.split('/')[-1].split('.')[0]+'.rar' print(filename) down_page_data=requests.get(url=src,headers=headers).text tree=etree.HTML(down_page_data) down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href') res=random.choice(down_list) print(res) jianli=requests.get(url=res,headers=headers).content with open(filename,'wb') as f: f.write(jianli) f.close() print('over!!!')
經典錯誤工具
HTTPConnectionPool(host:xx) Max retries exceeded with url
緣由測試
1.每次數據傳輸前客戶端都要和服務端創建TCP鏈接,爲了節省傳輸消耗,默認爲keep-alive,即鏈接一次傳輸屢次,然而若是鏈接遲遲不斷開的話,連接池滿後,則沒法產生新的連接對象,致使請求沒法發送 2.IP被封 3.請求頻率太頻繁
解決網站
1.設置請求頭中Connection的值爲close,每次成功後斷開鏈接 2.更換請求IP 3.每次請求之間使用sleep進行請求間隔
import requests from lxml import etree headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='https://www.aqistudy.cn/historydata/' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()') with open('./文件夾1/city.txt','w',encoding='utf-8') as f: for hot in hot_list: f.write(hot.strip()) common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()') for common in common_list: f.write(common.strip()) f.close() print('over!!!')
import requests from lxml import etree headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='http://sc.chinaz.com/tupian/hunsha.html' origin_data=requests.get(url=url,headers=headers).text tree=etree.HTML(origin_data) div_list=tree.xpath('//div[@id="container"]/div') for div in div_list: title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8') name='./文件夾1/'+title+'.jpg' photo_url=div.xpath('./div/a/@href')[0] origin_data=requests.get(url=photo_url,headers=headers).text tree=etree.HTML(origin_data) url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0] origin_data=requests.get(url=url_it,headers=headers).content with open(name,'wb') as f: f.write(origin_data) print('over!!!')
使用加密
import requests from lxml import etree import random proxie=[{'https':'116.197.134.153:80'},{'https':'103.224.100.43:8080'},{'https':'222.74.237.246:808'}] headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36' } url='https://www.baidu.com/s?wd=ip' origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).text with open('./ip.html','w',encoding='utf-8') as f: f.write(origin_data) print('over!!!')
經常使用代理網站
www.goubanjia.com
快代理
西祠代理
代理知識
透明:對方知道使用了代理,且知道真實IP
匿名:對方知道使用了代理,不知道真實IP
高匿:對方不知道使用了代理,更不知道真實IP
注:代理IP的類型必須和請求url的協議頭 保持一致
順序:動態加載,url加密,element