Beautiful Soup 是一個HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 數據。css
pip install beautifulsoup4
from bs4 import BeautifulSoup
In [1]: from bs4 import BeautifulSoup In [2]: text = ''' ...: <div> ...: <ul> ...: <li class="item-0" id="first"><a href="link1.html">first item</a></li> ...: <li class="item-1"><a href="link2.html">second item</a></li> ...: <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> ...: <li class="item-1"><a href="link4.html">fourth item</a></li> ...: <li class="item-0"><a href="link5.html">fifth item</a></li> ...: </ul> ...: </div> ...: ''' In [3]: bs = BeautifulSoup(text)#建立BeautifulSoup對象,能夠直接傳入字符串
In [4]: bs1 = BeautifulSoup(open('./test.html'))#也能夠傳入文件對象
In [5]: bs Out[5]: <html><body><div> <ul> <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
建立Beautiful Soup對象時,既能夠傳入字符串,也能夠傳入文件對象。它將複雜HTML文檔轉換成一個複雜的樹形結構,而且會自動修正文檔,像上述例子中補齊了html和body節點,每一個節點都是Python對象html
In [6]: bs.ul #獲取ul標籤內容 Out[6]: <ul> <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> In [7]: type(bs.ul) Out[7]: bs4.element.Tag In [8]: bs.li #獲取li標籤內容,注意返回的是第一個符合要求的標籤 Out[8]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [12]: bs.ul.li.a #可疊加查找標籤 Out[12]: <a href="link1.html">first item</a>
經過Beautiful Soup對象後面接上‘.標籤名’來獲取須要查找的標籤,可疊加python
In [13]: bs.name #大部分時候,能夠把BeautifulSoup看成Tag對象,是一個特殊的 Tag Out[13]: '[document]' In [14]: bs.li.name Out[14]: 'li
BeautifulSoup 對象表示的是一個文檔的內容。大部分時候,能夠把它看成 Tag 對象spa
In [15]: bs.attrs Out[15]: {} In [16]: bs.li.attrs #以字典的形式顯示全部屬性 Out[16]: {'class': ['item-0'], 'id': 'first'} In [17]: bs.li.attrs['id'] #獲取具體的某個屬性方法1 Out[17]: 'first' In [18]: bs.li['id'] #獲取具體屬性方法2,'.attrs'可省略 Out[18]: 'first' In [19]: bs.li.get('id')#獲取具體 屬性方法3,利用get方法 Out[19]: 'first'
In [20]: bs.li.string #li標籤裏面只有惟一的a標籤了,那麼 .string 會返回最裏面a標籤的內容 Out[20]: 'first item' In [21]: bs.li.a.string #返回a標籤的內容 Out[21]: 'first item'
注意:若是標籤內容是一個註釋,則註釋符號會被去掉,好比「<!-- 這是一個註釋 -->」,則返回"這是一個註釋"code
In [22]: bs.ul.contents Out[22]: ['\n', <li class="item-0" id="first"><a href="link1.html">first item</a></li>, '\n', <li class="item-1"><a href="link2.html">second item</a></li>, '\n', <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>, '\n', <li class="item-1"><a href="link4.html">fourth item</a></li>, '\n', <li class="item-0"><a href="link5.html">fifth item</a></li>, '\n']
In [28]: bs.ul.children #返回的是列表生成器對象 Out[28]: <list_iterator at 0x7f2d9e90ea30> In [29]: for child in bs.ul.children: ...: print(child) ...: <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>
In [30]: bs.ul.descendants #返回的是一個生成器對象,進行迭代取值的時候,會遞歸循環的顯示全部子孫節點 Out[30]: <generator object Tag.descendants at 0x7f2d9e79fc80> In [31]: for d in bs.ul.descendants: ...: print(d) ...: <li class="item-0" id="first"><a href="link1.html">first item</a></li> <a href="link1.html">first item</a> first item <li class="item-1"><a href="link2.html">second item</a></li> <a href="link2.html">second item</a> second item <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <a href="link3.html"><span class="bold">third item</span></a> <span class="bold">third item</span> third item <li class="item-1"><a href="link4.html">fourth item</a></li> <a href="link4.html">fourth item</a> fourth item <li class="item-0"><a href="link5.html">fifth item</a></li> <a href="link5.html">fifth item</a> fifth item
In [32]: bs.find('li') #查找第一個匹配的li標籤 Out[32]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [33]: bs.find(['li','a']) #查找第一個匹配的li標籤或者a標籤 Out[33]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [34]: import re In [35]: bs.find(re.compile(r'^l')) #查找第一個以l開頭的標籤,li標籤匹配上 Out[35]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [36]: bs.find(re.compile(r'l$')) #查找第一個以l結尾的標籤,html標籤符合 Out[36]: <html><body><div> <ul> <li class="item-0" id="first"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
In [37]: bs.find(attrs={'class':'item-1'}) #查找class屬性爲item-1的第一個標籤 Out[37]: <li class="item-1"><a href="link2.html">second item</a></li>
In [38]: bs.find('li',recursive=True) #遞歸查找,可以匹配到li對象 Out[38]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [39]: bs.find('li',recursive=False) #從直接子節點(即html)中沒法找到li標籤 In [40]: bs.ul.find('li',recursive=False) #ul的直接子節點爲li標籤,因此可以匹配到 Out[40]: <li class="item-0" id="first"><a href="link1.html">first item</a></li>
In [41]: bs.find(text='first item') #查找字符串,須要傳入完整內容,不然沒法匹配 Out[41]: 'first item' In [42]: bs.find(text=re.compile(r'item'))#查找第一個包含item的內容 Out[42]: 'first item' In [43]: bs.find(text=re.compile(r'ir'))#查找第一個包含ir的內容 Out[43]: 'first item' In [44]: bs.find(text=['second item','third item']) #查找內容爲second item或third item的第一個內容 Out[44]: 'second item'
In [45]: bs.find(id='first') #id屬性做爲關鍵字參數進行查找 Out[45]: <li class="item-0" id="first"><a href="link1.html">first item</a></li> In [43]: bs.find(href='link4.html') #href屬性做爲關鍵字參數進行查找 Out[43]: <a href="link4.html">fourth item</a> In [44]: bs.find(class='item-inactive') #和python關鍵字class重名的class屬性則會報錯 File "<ipython-input-42-a9ab4a3f6cee>", line 1 bs.find(class='item-inactive') ^ SyntaxError: invalid syntax
In [45]: bs.find_all('li') #查找全部的li標籤 Out[45]: [<li class="item-0" id="first"><a href="link1.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a></li>] In [46]: bs.find_all('li',attrs={"class":"item-1"}) #查找全部的li標籤,而且class屬性爲item-1 Out[46]: [<li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>]
In [47]: bs.li.get('class') #class屬性由於能夠有多個,因此返回的是列表形式 Out[47]: ['item-0'] In [48]: bs.find(attrs={"class":"item-0"}).get('id') #以字符串的形式返回id屬性值 Out[48]: 'first' In [49]: bs.find_all('a')[1].get('href') Out[49]: 'link2.html'
In [50]: bs.li.get_text() #獲取第一個li最裏面的內容 Out[50]: 'first item' In [51]: bs.find(attrs={"class":"bold"}).get_text() #獲取class屬性爲bold標籤(即span標籤)裏面的內容 Out[51]: 'third item' In [52]: bs.find_all('a')[3].get_text() #獲取第4個a標籤裏面的內容 Out[52]: 'fourth item'
In [53]: bs.select('li') #查找全部li標籤 Out[53]: [<li class="item-0" id="first"><a href="link1.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a></li>]
In [54]: bs.select('.bold') #查找class='bold'的標籤 Out[54]: [<span class="bold">third item</span>]
In [55]: bs.select('#first') #查找id爲first的標籤 Out[55]: [<li class="item-0" id="first"><a href="link1.html">first item</a></li>]
In [56]: bs.select('.item-0 a') #查找class="item-0"下的a標籤 Out[56]: [<a href="link1.html">first item</a>, <a href="link5.html">fifth item</a>] In [57]: bs.select('#first a') #查找id="first"下面的a標籤 Out[57]: [<a href="link1.html">first item</a>] In [58]: bs.select('ul span') #查找ul下面的span標籤 Out[58]: [<span class="bold">third item</span>] In [59]: bs.select('ul>span') #標籤後面帶上">"表示直接子標籤,由於span標籤不是ul的直接子標籤,因此匹配不到 Out[59]: [] In [60]: bs.select('a>span') #span標籤是a標籤的子標籤,因此能匹配到 Out[60]: [<span class="bold">third item</span>]
直接子標籤查找,則使用 >
分隔htm
In [61]: bs.select('li[class="item-inactive"]') #查找class屬性爲'item-inactive'的li標籤 Out[61]: [<li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>] In [62]: bs.select('a[href="link2.html"]') #查找href屬性爲'link2.html'的a標籤 Out[62]: [<a href="link2.html">second item</a>]