官方學習文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/html
1、什麼時BeautifulSoup?html5
答:靈活又方便的網頁解析庫,處理搞笑,支持多種解析器。正則表達式
利用它不用編寫正則表達式便可方便地實現網頁信息的提取。瀏覽器
2、安裝學習
pip3 install bewautifulsoup4
3、用法講解spa
解析器 | 使用方法 | 優點 | 劣勢 |
Py't'hon標準庫 | BeautifulSoup(markup,"html.parser") | Python的內置標準庫、執行速度適中、文檔容錯額能力強 | Python2.7 or 3.2。2 前的版本中文容錯額能力差 |
lxml HTML解析器 | BeautifulSoup(markup,"lxml") | 速度快、文檔容錯能力強 | 須要安裝C語言庫 |
lxml XML解析器 | BeautifulSoup(markup,"xml") | 速度快、惟一支持XML的解析器 | 須要安裝C語言庫 |
html5lib | BeautifulSoup(markup,"html5lib") | 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴展 |
4、基本使用code
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title.string)
5、標籤選擇器orm
lxml解析庫xml
一、選擇元素htm
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.title) print(soup.title.string) print(type(soup.title)) print(soup.href) print(soup.p)
二、獲取名稱
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.title.name)
三、獲取屬性
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.attrs['name']) print(soup.p['name'])
四、獲取內容
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.string)
五、嵌套選擇
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dormouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.head.title.string)
六、子節點和子孫節點
.contents能夠獲取標籤的子節點
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="story"> Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a> and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.contents)# .contents能夠獲取標籤的子節點
.children是一個迭代器,以換行符分隔,獲取全部的子節點
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.children) # .children是一個迭代器,以換行符分隔,獲取全部的子節點 for i,child in enumerate(soup.p.children): print(i,child)
.descendants,以換行符分隔,獲取全部的子孫節點
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.descendants) # .descendants,以換行符分隔,獲取全部的子孫節點 for i,child in enumerate(soup.p.descendants): print(i,child)
七、父節點和祖先節點
.parent,獲取父節點
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.a.parent) # .parent,獲取父節點
.parents,獲取祖先節點
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(list(enumerate(soup.a.parents))) # .parents,獲取祖先節點
八、兄弟節點
.next_siblings,獲取後面的兄弟節點
.previous_siblings,獲取後面的兄弟節點
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>, <a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and <a href="http://example.com/elsle" class="sister" id="link1">Title</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(list(enumerate(soup.a.next_siblings))) # .next_siblings,獲取後面的兄弟節點 print(list(enumerate(soup.a.previous_siblings))) # .previous_siblings,獲取後面的兄弟節點
標籤選擇器
可根據標籤名、屬性、內容查找文檔
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.find_all('ul'): print(ul.find_all('li'))
attrs
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.find_all('ul'): print(ul.find_all('li'))
text
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(text='Foo')) # text方法適用於文本匹配,不適用於標籤查找
find返回單個元素,find_all返回全部元素
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))
find_parents()返回全部祖先節點,find_parent()返回直接父節點
find_next_siblings()返回後面全部兄弟結點, find_next_siblings()返回後面第一個兄弟結點
find_previous_siblings()返回前面全部修兄弟節點,find_previous_sibling()返回前面第一個兄弟節點
find_all_next()返回節點後面全部符合條件的結點,find_next()返回第一個符合條件的結點
find_all_previous()返回結點前面全部符合條件的結點,find_previous()返回第一個符合條件的結點
經過select()直接傳入CSS選擇器便可完成選擇
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.select('.panel .panel-heading')) # panel前面的.表明class屬性 print(soup.select('ul li')) #ul li表示ul屬性內的li屬性,嵌套選擇 print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))
一、獲取屬性
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
二、獲取內容
html = ''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for li in soup.select('li'): print(li.get_text())