貓哥教你寫爬蟲 032--爬蟲初體驗-BeautifulSoup

時間 2019-12-05

原文原文鏈接

BeautifulSoup是什麼

在爬蟲中，須要使用能讀懂html的工具，才能提取到想要的數據。這就是解析數據。

【提取數據】是指把咱們須要的數據從衆多數據中挑選出來。

解析與提取數據在爬蟲中，既是一個重點，也是難點

BeautifulSoup怎麼用

安裝beautifulsoup ==> pip install BeautifulSoup4

在括號中，要輸入兩個參數，第0個參數是要被解析的文本，注意了，它必須必須必須是字符串。

括號中的第1個參數用來標識解析器，咱們要用的是一個Python內置庫：html.parser

import requests #調用requests庫
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
#獲取網頁源代碼，獲得的res是response對象
print(res.status_code) #檢查請求是否正確響應
html = res.text #把res的內容以字符串的形式返回
print(html)#打印html
複製代碼

import requests
from bs4 import BeautifulSoup
#引入BS庫
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
html = res.text
soup = BeautifulSoup(html,'html.parser') #把網頁解析爲BeautifulSoup對象
print(type(soup)) # <class 'bs4.BeautifulSoup'>
複製代碼

打印soup出來的源代碼和咱們以前使用response.text打印出來的源代碼是徹底同樣的

雖然response.text和soup打印出的內容表面上看長得如出一轍，

它們屬於不一樣的類：<class 'str'> 與<class 'bs4.BeautifulSoup'>

提取數據

咱們仍然使用BeautifulSoup來提取數據。

這一步，又能夠分爲兩部分知識：find()與find_all()，以及Tag對象。

find()與find_all()是BeautifulSoup對象的兩個方法，

它們能夠匹配html的標籤和屬性，把BeautifulSoup對象裏符合要求的數據都提取出來

區別在於，find()只提取首個知足要求的數據，而find_all()提取出的是全部知足要求的數據

localprod.pandateacher.com/python-manu…

HTML代碼中，有三個<div>元素，用find()能夠提取出首個元素，而find_all()能夠提取出所有

import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
item = soup.find('div') #使用find()方法提取首個<div>元素，並放到變量item裏。
print(type(item)) #打印item的數據類型
print(item)       #打印item 
複製代碼

import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
items = soup.find_all('div') #用find_all()把全部符合要求的數據提取出來，並放在變量items裏
print(type(items)) #打印items的數據類型
print(items)       #打印items
複製代碼

舉例中括號裏的class_，這裏有一個下劃線，是爲了和python語法中的類 class區分，避免程序衝突

小練習: 爬取網頁中的三本書的書名、連接、和書籍介紹

localprod.pandateacher.com/python-manu…

另外一個知識點——Tag對象。

咱們通常會選擇用type()函數查看一下數據類型，

Python是一門面向對象編程的語言，只有知道是什麼對象，才能調用相關的對象屬性和方法。

用find()提取出來的數據類型和剛纔同樣，仍是Tag對象

咱們能夠用Tag.text提出Tag對象中的文字，用Tag['href']提取出URL

import requests #調用requests庫
from bs4 import BeautifulSoup
# 獲取數據
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
# res.status_code 狀態碼
# res.content 二進制
# res.text html代碼
# res.encoding 編碼
# 解析數據
# soup 是beautifulsoup對象
soup = BeautifulSoup(res.text,'html.parser')
# soup.find(標籤名,屬性=屬性值)
# soup.find_all(標籤名, 屬性=屬性值)
# 提取數據 list 裏面是tag對象
item = soup.find_all('div',class_='books')
for i in item:
    # i.find().find().find() # tag對象, 能夠一級一級找下去
    # i.find_all()
    # i 是tag對象, 也可使用find和find_all, 獲得結果仍是tag對象
    # i.find().find().find().find()
    print(i.find('a',class_='title').text) # 獲取標籤內容
    print(i.find('a',class_='title')['href']) # 獲取標籤屬性(href)
    print(i.find('p',class_='info').text) # 獲取標籤內容
複製代碼

層層檢索的過程有點像是在超市買你想要的零食

對象的變化過程

最開始requests獲取數據，到BeautifulSoup解析數據，再用BeautifulSoup提取數據

不斷經歷的是咱們操做對象的類型轉換

完整版

完整版

再複習一遍代碼...

import requests #調用requests庫
from bs4 import BeautifulSoup
# 獲取數據
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
# res.status_code 狀態碼
# res.content 二進制
# res.text html代碼
# res.encoding 編碼
# 解析數據
# soup 是beautifulsoup對象
soup = BeautifulSoup(res.text,'html.parser')
# soup.find(標籤名,屬性=屬性值)
# soup.find_all(標籤名, 屬性=屬性值)
# 提取數據 list 裏面是tag對象
item = soup.find_all('div',class_='books')
for i in item:
    # i.find().find().find() # tag對象, 能夠一級一級找下去
    # i.find_all()
    # i 是tag對象, 也可使用find和find_all, 獲得結果仍是tag對象
    # i.find().find().find().find()
    print(i.find('a',class_='title').text) # 獲取標籤內容
    print(i.find('a',class_='title')['href']) # 獲取標籤屬性(href)
    print(i.find('p',class_='info').text) # 獲取標籤內容
複製代碼

總結

beautifulsoup 解析器

解析器	使用方法	優點	劣勢
Python標準庫	BeautifulSoup(text, "html.parser")	Python的內置標準庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2前的版本中文檔容錯能力差
lxml HTML 解析器	BeautifulSoup(text, "lxml")	速度快文檔容錯能力強	須要安裝C語言庫
lxml XML 解析器	BeautifulSoup(text, "xml")	速度快惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(text, "html5lib")	生成HTML5格式的文檔	速度慢不依賴外部擴展