BeautifulSoup是什麼?
BeautifulSoup是一個網頁解析庫,相比urllib、Requests要更加靈活和方便,處理高校,支持多種解析器。
利用它不用編寫正則表達式即可方便地實現網頁信息的提取。
BeautifulSoup的安裝:直接輸入pip3 install beautifulsoup4即可安裝。4也就是它的最新版本。
BeautifulSoup的用法:
解析庫:
解析器 | 使用方法 | 優勢 | 不足 |
Python標準庫 | BeautifulSoup(markup,"html.parser") | python的內置標準庫、執行速度適中、文檔容錯能力強 | Python2.7.3 or 3.2.2之前的版本容錯能力較差 |
lxml HTML解析庫 | BeautifulSoup(markup,"lxml") | 速度快、文檔容錯能力強 | 需要安裝C語言庫 |
lxml XML解析庫 | BeautifulSoup(markup,"xml") | 速度快、唯一支持XML的解析器 | 需要安裝C語言庫 |
html5lib | BeautifulSoup(markup,"html5lib") | 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴展 |
基本使用:
html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title.string)
我們可以看到,html這次參數是一段html代碼,但是該代碼並不完成,我們來看下下面的運行結果。
我們可以看到結果,BeautifulSoup將我們的代碼自動補全了,並且幫我們自動格式化了html代碼。
標籤選擇器:
選擇元素
html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)
我們先來看下運行結果
我們可以到看,.title方法將整個title標籤全部提取出來了, .head也是如此的,但是p標籤有很多,這裏會默認只取第一個標籤
獲取名稱 :
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.title.name)
輸出結果爲title 。.name方法獲取該標籤的名稱(並非name屬性的值)
獲取屬性:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.attrs['name']) print(soup.p['name'])
我們嘗試運行以後會發現,結果都爲dromouse,也就是說兩中方式都可以娶到name屬性的值,但是隻匹配第一個標籤。
獲取內容:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story" name="pname">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.string)
輸入標籤.string方法即可取到該標籤下的內容,得到的輸出結果爲:
我們可以看到我們獲取到的是第一個p標籤下的文字內容。
嵌套獲取:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story" name="pname">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.head.title.string)
運行結果爲:
我們可以嵌套其子節點繼續選擇獲取標籤的內容。
獲得子節點和子孫節點:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story" name="pname">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.body.contents)
獲取的結果爲一個類型的數據,換行符也佔用一個位置,我們來看一下輸出結果:
另外還有一個方法也可以獲得子節點.children也可以獲取子節點:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story" name="pname">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(type(soup.body.children)) for i,child in enumerate(soup.body.children): print(i,child)
輸出結果如下:
用.children方法得到的是一個可以迭代的類型數據。
通過descendas可以獲得其子孫節點:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story" name="pname">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(type(soup.body.descendants)) for i,child in enumerate(soup.body.descendants): print(i,child)
我們來看一下運行結果:
獲取父節點和祖先節點:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(type(soup.a.parent)) print(soup.a.parent)
我們來看一下運行結果:
獲取其祖先節點:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') print(type(soup.a.parents)) print(list(enumerate(soup.a.parents)))
我們來看下結果:
兄弟節點:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') print(list(enumerate(soup.a.next_siblings))) print(list(enumerate(soup.a.previous_siblings)))
我們來看一下結果:
next_siblings獲得後面的兄弟節點,previous_siblings獲得前面的兄弟節點。
以前就是我們用最簡單的方式來獲取了內容,也是標籤選擇器,選擇速度很快的,但是這種選擇器過於單一,不能滿足我們的解析需求,下面我們來看一下標準選擇器。
標準選擇器:
find_all(name,attrs,recursive,text,**kwargs)可以根據標籤名,屬性,內容查找文檔。
我們來看一下具體的用法。
根據name來查找:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') print(soup.find_all('p')) print(soup.find_all('p')[0])
我們看下結果:
我們通過find_all得到了一組數據,通過其索引得到每一項的標籤。也可以用嵌套的方式來查找
attrs方式:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') print(soup.find_all(attrs={'id':'link3'})) print(soup.find_all(attrs={'class':'sister'})) for i in soup.find_all(attrs={'class':'sister'}): print(i)
運行結果:
我也可以這樣來寫:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') print(soup.find_all(id='link3')) print(soup.find_all(class_='sister')) for i in soup.find_all(class_='sister'): print(i)
對於特殊類型的我們可以直接用其屬性來查找,例如id,class等。attrs更便於我們的查找了。
用text選擇:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') print(soup.find_all(text='Title'))
運行結果:
find(name,attrs,recursive,text,**kwargs)可以根據標籤名,屬性,內容查找文檔。和find_all用法完全一致,不同於find返回單個標籤(第一個),find_all返回所有標籤。
還有很多類似的方法:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') # find print(soup.find(class_='sister')) # find_parents() 和find_parent() print(soup.find_parents(class_='sister')) # 返回祖先節點 print(soup.find_parent(class_='sister')) # 返回父節點 # find_next_siblings() ,find_next_sibling() print(soup.find_next_siblings(class_='sister')) # 返回後面所有的兄弟節點 print(soup.find_next_sibling(class_='sister')) # 返回後面第一個兄弟節點 # find_previous_siblings() ,find_previous_sibling() print(soup.find_previous_siblings(class_='sister')) # 返回前面所有的兄弟節點 print(soup.find_previous_sibling(class_='sister')) # 返回前面第一個兄弟節點 # find_all_next() ,find_next() print(soup.find_all_next(class_='sister')) # 返回節點後所有符合條件的節點 print(soup.find_next(class_='sister')) # 返回節點後第一個滿足條件的節點 # find_all_previous,find_all_previous print(soup.find_all_previous(class_='sister')) # 返回節點後所有符合條件的節點 print(soup.find_all_previous(class_='sister')) # 返回節點後第一個滿足條件的節點
在這裏不在一一列舉。
CSS選擇器:通過select()直接傳入CSS選擇器即可完成選擇
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') print(soup.select('.body')) print(soup.select('a')) print(soup.select('#link3'))
我們select()的方法和jquery差不多的,選擇class的前面加一個"." 選擇id的前面加一個"#" 不加入任何的是標籤選擇器,我們來看下結果:
獲取屬性:
輸入get_text()就可以獲得到裏面的文本了。
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" name="pname">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup as bs4 soup = bs4(html,'lxml') for i in soup.select('.sister'): print(i.get_text())
運行結果:
總結:
- 推薦使用lxml解析庫,必要時使用html.parser庫
- 標籤選擇篩選功能弱但是速度快
- 建議使用find()、find_all()查詢匹配單個結果或者多個結果
- 如果對CSS選擇器熟悉的建議使用select()
- 記住常用的獲取屬性和文本值的方法
代碼地址:https://gitee.com/dwyui/BeautifulSoup.git
下一期我會來說一下pyQuery的使用,敬請期待。
感謝大家的閱讀,不正確的地方,還希望大家來斧正,鞠躬,謝謝🙏。