數據解析

一.數據解析

1.xpath解析(各個爬蟲語言通用)

(1)環境安裝

pip install lxml

(2)解析原理

- 獲取頁面原碼數據 
- 實例化etree對象,將頁面原碼數據加載到該對象中
- 調用該對象的xpath方法進行指定標籤的定位(xparh函數必須結合着xpath表達式進行標籤的定位和內容的捕獲)

(3)xpath語法(返回值是一個列表)

## 一.數據解析

### 1.xpath解析(各個爬蟲語言通用)

#### (1)環境安裝

```
pip install lxml
```

#### (2)解析原理

```
- 獲取頁面原碼數據 
- 實例化etree對象,將頁面原碼數據加載到該對象中
- 調用該對象的xpath方法進行指定標籤的定位(xparh函數必須結合着xpath表達式進行標籤的定位和內容的捕獲)
```

#### (3)xpath語法(返回值是一個列表)

```
屬性定位
    / 至關於 > (在開頭必定從根節點開始)
    // 至關於  ' '
    @ 表示屬性
    例://div[@class="song"]
索引定位(索引從1開始)
    //ul/li[2]
邏輯運算
    //a[@href='' and @class='du'] 和
    //a[@href='' | @class='du'] 或
模糊匹配
    //div[contains(@class,'ng')]
    //div[starts-with(@class,'ng')]    
取文本
    //div/text() 直系文本內容
    //div//text() 非直系文本內容(返回列表)
取屬性
    //div/@href
```

#### (4)案例

##### 案例一:58同城二手房數據爬取

```python
import requests
from  lxml import etree
import os
url='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1'
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()')
with open('./文件夾1/fangyuan.txt','w',encoding='utf-8') as f:
    for title_price in title_price_list:
        f.write(title_price)
    f.close()    
print("over")
```

###### *注:區別解析的數據源是原碼仍是局部數據*

```
原碼數據
    tree.HTML('//ul...') 
局部數據
    tree.HTML('./ul...') #以.開頭
```

##### 測試xpath語法的正確性

###### 方式一:xpath.crx(xpath插件)

```
找到瀏覽器的 更多工具>拓展程序
開啓開發者模式
將xpath.crx拖動到瀏覽器中
xpath插件啓動快捷鍵:ctrl+shift+x
做用:用於測試xpath語法的正確性
```

![1551257321487](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551257321487.png)

###### 方式二:瀏覽器自帶

![1551231018948](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551231018948.png)



##### 案例二:4k網爬取圖片

```
import requests
from  lxml import etree
import urllib
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
page_num=int(input("請輸入要爬取的頁數:"))
if page_num==1:
    url='http://pic.netbian.com/4kyingshi/index.html'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
    for a in a_list:
        name=a.xpath('./b/text()')[0]
        name=name.encode('iso-8859-1').decode('gbk')
        url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
        picture=requests.get(url=url,headers=headers).content
        picture_name='./文件夾2/'+name+'.jpg'
        with open(picture_name,'wb') as f:
            f.write(picture)
    f.close()
    print('over!!!')
    
else:
    for page in range(1,page_num+1):
        url='http://pic.netbian.com/4kyingshi/index_%d.html' % page
        origin_data=requests.get(url=url,headers=headers).text
        tree=etree.HTML(origin_data)
        a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
        for a in a_list:
            name=a.xpath('./b/text()')[0]
            name=name.encode('iso-8859-1').decode('gbk')
            url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
            picture=requests.get(url=url,headers=headers).content
            picture_name='./文件夾2/'+name+'.jpg'
            with open(picture_name,'wb') as f:
                f.write(picture)
        f.close()
        print('over!!!')
```

###### 中文亂碼問題

```
方式一:
    response.encoding='gbk'
方式二:
    name=name.encode('iso-8859-1').decode('utf-8')
```

###### 數據來源問題

```
etree.HTML() #處理網絡數據
etree.parse() #處理本地數據
```



##### 案例3:爬取煎蛋網圖片

```python
import requests
from  lxml import etree
import urllib
import base64
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='http://jandan.net/ooxx'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
span_list=tree.xpath('//span[@class="img-hash"]/text()')
for span in span_list:
    src='http:'+base64.b64decode(span).decode("utf-8")
    picture_data=requests.get(url=src,headers=headers).content
    name='./文件夾3/'+src.split("/")[-1]
    with open(name,'wb') as f:
        f.write(picture_data)
        f.close()
print('over!!!')
```



###### ##反爬機制3:base64

在response返回數據中,圖片的src都是相同的,每一個圖片都有一個span標籤存儲一串加密字符串,同時發現一個jandan_load_img函數,故猜想該加密字符串經過此函數可能獲得圖片地址.

![1551260850370](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551260850370.png)

全局搜索此函數

![1551261126014](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261126014.png)

發現此函數中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz

![1551261205397](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261205397.png)

全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz

![1551261246264](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261246264.png)

函數的最後用到了base64_decode

![1551261317520](C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1551261317520.png)

故判定該加密字符串用base64解密可獲得圖片地址



##### 案例4:站長素材簡歷爬取

```python
import requests
from  lxml import etree
import random
headers={
    'Connection':'close',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='http://sc.chinaz.com/jianli/free.html'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
src_list=tree.xpath('//div[@id="main"]/div/div/a/@href')
for src in src_list:
    filename='./文件夾4/'+src.split('/')[-1].split('.')[0]+'.rar'
    print(filename)
    down_page_data=requests.get(url=src,headers=headers).text
    tree=etree.HTML(down_page_data)
    down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href')
    res=random.choice(down_list)
    print(res)
    jianli=requests.get(url=res,headers=headers).content
    with open(filename,'wb') as f:
        f.write(jianli)
        f.close()     
print('over!!!')
```



###### ##反爬機制4:Connection

經典錯誤

```
HTTPConnectionPool(host:xx) Max retries exceeded with url
```

緣由

```
1.每次數據傳輸前客戶端都要和服務端創建TCP鏈接,爲了節省傳輸消耗,默認爲keep-alive,即鏈接一次傳輸屢次,然而若是鏈接遲遲不斷開的話,連接池滿後,則沒法產生新的連接對象,致使請求沒法發送
2.IP被封
3.請求頻率太頻繁
```

解決

```
1.設置請求頭中Connection的值爲close,每次成功後斷開鏈接
2.更換請求IP
3.每次請求之間使用sleep進行請求間隔
```



##### 案例5:解析全部的城市名稱

```python
import requests
from  lxml import etree
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='https://www.aqistudy.cn/historydata/'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()')
with open('./文件夾1/city.txt','w',encoding='utf-8') as f:
    for hot in hot_list:
        f.write(hot.strip())
    common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()')
    for common in common_list:
        f.write(common.strip())
    f.close()
print('over!!!')
```



##### 案例6:圖片懶加載,站長素材婚紗照

```python
import requests
from  lxml import etree
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='http://sc.chinaz.com/tupian/hunsha.html'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
div_list=tree.xpath('//div[@id="container"]/div')

for div in div_list:
    title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8')
    name='./文件夾1/'+title+'.jpg'
    photo_url=div.xpath('./div/a/@href')[0]
    
    origin_data=requests.get(url=photo_url,headers=headers).text
    tree=etree.HTML(origin_data)
    url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0]

    origin_data=requests.get(url=url_it,headers=headers).content
    with open(name,'wb') as f:
        f.write(origin_data)
    
print('over!!!')
```

###### ##反爬機制5:代理IP

使用

```python
import requests
from  lxml import etree
import random
proxie=[{'https':'116.197.134.153:80'},{'https':'103.224.100.43:8080'},{'https':'222.74.237.246:808'}]
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='https://www.baidu.com/s?wd=ip'
origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).text

with open('./ip.html','w',encoding='utf-8') as f:
    f.write(origin_data)
    
print('over!!!')
```

經常使用代理網站

```
www.goubanjia.com
快代理
西祠代理
```

代理知識

```
透明:對方知道使用了代理,且知道真實IP
匿名:對方知道使用了代理,不知道真實IP
高匿:對方不知道使用了代理,更不知道真實IP
```

*注:代理IP的類型必須和請求url的協議頭 保持一致*

*https://www.55xia.com下載電影*

*順序:動態加載,url加密,element*

(4)案例

案例一:58同城二手房數據爬取
import requests
from  lxml import etree
import os
url='https://bj.58.com/changping/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-1cc0-306c-511ad17612b3&ClickID=1'
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
title_price_list=tree.xpath('//ul[@class="house-list-wrap"]/li/div[2]/h2/a/text() | //ul[@class="house-list-wrap"]/li/div[3]//text()')
with open('./文件夾1/fangyuan.txt','w',encoding='utf-8') as f:
    for title_price in title_price_list:
        f.write(title_price)
    f.close()    
print("over")
注:區別解析的數據源是原碼仍是局部數據
原碼數據
    tree.HTML('//ul...') 
局部數據
    tree.HTML('./ul...') #以.開頭
測試xpath語法的正確性
方式一:xpath.crx(xpath插件)
找到瀏覽器的 更多工具>拓展程序
開啓開發者模式
將xpath.crx拖動到瀏覽器中
xpath插件啓動快捷鍵:ctrl+shift+x
做用:用於測試xpath語法的正確性

方式二:瀏覽器自帶

 

 

案例二:4k網爬取圖片
import requests
from  lxml import etree
import urllib
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
page_num=int(input("請輸入要爬取的頁數:"))
if page_num==1:
    url='http://pic.netbian.com/4kyingshi/index.html'
    origin_data=requests.get(url=url,headers=headers).text
    tree=etree.HTML(origin_data)
    a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
    for a in a_list:
        name=a.xpath('./b/text()')[0]
        name=name.encode('iso-8859-1').decode('gbk')
        url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
        picture=requests.get(url=url,headers=headers).content
        picture_name='./文件夾2/'+name+'.jpg'
        with open(picture_name,'wb') as f:
            f.write(picture)
    f.close()
    print('over!!!')
    
else:
    for page in range(1,page_num+1):
        url='http://pic.netbian.com/4kyingshi/index_%d.html' % page
        origin_data=requests.get(url=url,headers=headers).text
        tree=etree.HTML(origin_data)
        a_list=tree.xpath('//ul[@class="clearfix"]/li/a')
        for a in a_list:
            name=a.xpath('./b/text()')[0]
            name=name.encode('iso-8859-1').decode('gbk')
            url='http://pic.netbian.com'+a.xpath('./img/@src')[0]
            picture=requests.get(url=url,headers=headers).content
            picture_name='./文件夾2/'+name+'.jpg'
            with open(picture_name,'wb') as f:
                f.write(picture)
        f.close()
        print('over!!!')
中文亂碼問題
方式一:
    response.encoding='gbk'
方式二:
    name=name.encode('iso-8859-1').decode('utf-8')
數據來源問題
etree.HTML() #處理網絡數據
etree.parse() #處理本地數據
案例3:爬取煎蛋網圖片
import requests
from  lxml import etree
import urllib
import base64
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='http://jandan.net/ooxx'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
span_list=tree.xpath('//span[@class="img-hash"]/text()')
for span in span_list:
    src='http:'+base64.b64decode(span).decode("utf-8")
    picture_data=requests.get(url=src,headers=headers).content
    name='./文件夾3/'+src.split("/")[-1]
    with open(name,'wb') as f:
        f.write(picture_data)
        f.close()
print('over!!!')
##反爬機制3:base64

在response返回數據中,圖片的src都是相同的,每一個圖片都有一個span標籤存儲一串加密字符串,同時發現一個jandan_load_img函數,故猜想該加密字符串經過此函數可能獲得圖片地址.html

全局搜索此函數python

發現此函數中用到了jdtPGUg7oYxbEGFASovweZE267FFvm5aYz瀏覽器

全局搜索jdtPGUg7oYxbEGFASovweZE267FFvm5aYz網絡

函數的最後用到了base64_decodedom

故判定該加密字符串用base64解密可獲得圖片地址函數

 

案例4:站長素材簡歷爬取
import requests
from  lxml import etree
import random
headers={
    'Connection':'close',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='http://sc.chinaz.com/jianli/free.html'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
src_list=tree.xpath('//div[@id="main"]/div/div/a/@href')
for src in src_list:
    filename='./文件夾4/'+src.split('/')[-1].split('.')[0]+'.rar'
    print(filename)
    down_page_data=requests.get(url=src,headers=headers).text
    tree=etree.HTML(down_page_data)
    down_list=tree.xpath('//div[@id="down"]/div[2]/ul/li/a/@href')
    res=random.choice(down_list)
    print(res)
    jianli=requests.get(url=res,headers=headers).content
    with open(filename,'wb') as f:
        f.write(jianli)
        f.close()     
print('over!!!')

 

##反爬機制4:Connection

經典錯誤工具

HTTPConnectionPool(host:xx) Max retries exceeded with url

緣由測試

1.每次數據傳輸前客戶端都要和服務端創建TCP鏈接,爲了節省傳輸消耗,默認爲keep-alive,即鏈接一次傳輸屢次,然而若是鏈接遲遲不斷開的話,連接池滿後,則沒法產生新的連接對象,致使請求沒法發送
2.IP被封
3.請求頻率太頻繁

解決網站

1.設置請求頭中Connection的值爲close,每次成功後斷開鏈接
2.更換請求IP
3.每次請求之間使用sleep進行請求間隔
案例5:解析全部的城市名稱
import requests
from  lxml import etree
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='https://www.aqistudy.cn/historydata/'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
hot_list=tree.xpath('//div[@class="row"]/div/div[1]/div/text() | //div[@class="row"]/div/div[1]/div[@class="bottom"]/ul[@class="unstyled"]/li/a/text()')
with open('./文件夾1/city.txt','w',encoding='utf-8') as f:
    for hot in hot_list:
        f.write(hot.strip())
    common_list=tree.xpath('//div[@class="row"]/div/div[2]/div[1]/text() | //div[@class="row"]/div/div[2]/div[2]/ul//text()')
    for common in common_list:
        f.write(common.strip())
    f.close()
print('over!!!')
案例6:圖片懶加載,站長素材婚紗照
import requests
from  lxml import etree
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='http://sc.chinaz.com/tupian/hunsha.html'
origin_data=requests.get(url=url,headers=headers).text
tree=etree.HTML(origin_data)
div_list=tree.xpath('//div[@id="container"]/div')
​
for div in div_list:
    title=div.xpath('./p/a/text()')[0].encode('iso-8859-1').decode('utf-8')
    name='./文件夾1/'+title+'.jpg'
    photo_url=div.xpath('./div/a/@href')[0]
    
    origin_data=requests.get(url=photo_url,headers=headers).text
    tree=etree.HTML(origin_data)
    url_it=tree.xpath('//div[@class="imga"]/a/img/@src')[0]
​
    origin_data=requests.get(url=url_it,headers=headers).content
    with open(name,'wb') as f:
        f.write(origin_data)
    
print('over!!!')
##反爬機制5:代理IP

使用加密

import requests
from  lxml import etree
import random
proxie=[{'https':'116.197.134.153:80'},{'https':'103.224.100.43:8080'},{'https':'222.74.237.246:808'}]
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url='https://www.baidu.com/s?wd=ip'
origin_data=requests.get(url=url,headers=headers,proxies=random.choice(proxie)).text
​
with open('./ip.html','w',encoding='utf-8') as f:
    f.write(origin_data)
    
print('over!!!')

經常使用代理網站

www.goubanjia.com
快代理
西祠代理

代理知識

透明:對方知道使用了代理,且知道真實IP
匿名:對方知道使用了代理,不知道真實IP
高匿:對方不知道使用了代理,更不知道真實IP

注:代理IP的類型必須和請求url的協議頭 保持一致

https://www.55xia.com下載電影

順序:動態加載,url加密,element

相關文章
相關標籤/搜索