網頁中有用的信息一般存在於網頁中的文本或各類不一樣標籤的屬性值,爲了得到這些網頁信息,有必要有一些查找方法能夠獲取這些文本值或標籤屬性。而Beautiful Soup中內置了一些查找方式:css
如下這段HTML是例程要用到的參考網頁html
<html> <body> <div class="ecopyramid"> <ul id="producers"> <li class="producerlist"> <div class="name">plants</div> <div class="number">100000</div> </li> <li class="producerlist"> <div class="name">algae</div> <div class="number">100000</div> </li> </ul> <ul id="primaryconsumers"> <li class="primaryconsumerlist"> <div class="name">deer</div> <div class="number">1000</div> </li> <li class="primaryconsumerlist"> <div class="name">rabbit</div> <div class="number">2000</div> </li> <ul> <ul id="secondaryconsumers"> <li class="secondaryconsumerlist"> <div class="name">fox</div> <div class="number">100</div> </li> <li class="secondaryconsumerlist"> <div class="name">bear</div> <div class="number">100</div> </li> </ul> <ul id="tertiaryconsumers"> <li class="tertiaryconsumerlist"> <div class="name">lion</div> <div class="number">80</div> </li> <li class="tertiaryconsumerlist"> <div class="name">tiger</div> <div class="number">50</div> </li> </ul> </body> </html>
以上代碼是一個生態金字塔的簡單展現,爲了找到其中的第一輩子產者,第一消費者或第二消費者,咱們可使用Beautiful Soup的查找方法。通常來講,爲了找到BeautifulSoup對象內任何第一個標籤入口,咱們可使用find()方法。python
能夠明顯看到,生產者在第一個<ul>標籤裏,由於生產者是在整個HTML文檔中第一個<ul>標籤中出現,因此能夠簡單的使用find()方法找到第一輩子產者。下圖HTML樹表明了第一個生產者所在位置。正則表達式
而後在ecologicalpyramid.py中寫入下面一段代碼,使用ecologicalpyramid.html文件建立BeautifulSoup對象。函數
from bs4 import BeautifulSoup with open("ecologicalpyramid.html","r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid) producer_entries = soup.find("ul") print(producer_entries.li.div.string)
輸出獲得:plantsspa
find()函數以下:code
find(name,attrs,recursive,text,**wargs)regexp
這些參數至關於過濾器同樣能夠進行篩選處理。orm
不一樣的參數過濾能夠應用到如下狀況:xml
咱們能夠傳遞任何標籤的名字來查找到它第一次出現的地方。找到後,find函數返回一個BeautifulSoup的標籤對象。
from bs4 import BeautifulSoup with open("ecologicalpyramid.html", "r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"html") producer_entries = soup.find("ul") print(type(producer_entries))
直接字符串的話,查找的是標籤。若是想要查找文本的話,則須要用到text參數。以下所示:
from bs4 import BeautifulSoup with open("ecologicalpyramid.html", "r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"html") plants_string = soup.find(text="plants") print(plants_string)
有如下HTML代碼:
<br/> <div>The below HTML has the information that has email ids.</div> abc@example.com <div>xyz@example.com</div> <span>foo@example.com</span>
參考以下代碼:
import re from bs4 import BeautifulSoup email_id_example = """<br/> <div>The below HTML has the information that has email ids.</div> abc@example.com <div>xyz@example.com</div> <span>foo@example.com</span> """ soup = BeautifulSoup(email_id_example) emailid_regexp = re.compile("\w+@\w+\.\w+") first_email_id = soup.find(text=emailid_regexp) print(first_email_id)
觀看例程HTML代碼,其中第一消費者在ul標籤裏面且id屬性爲priaryconsumers.
由於第一消費者出現的ul不是文檔中第一個ul,因此經過前面查找標籤的辦法就行不通了。如今經過標籤屬性進行查找,參考代碼以下:
from bs4 import BeautifulSoup with open("ecologicalpyramid.html", "r") as ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"html") primary_consumer = soup.find(id="primaryconsumers") print(primary_consumer.li.div.string)
經過標籤屬性查找的方式適用於大多數標籤屬性,包括id,style,title,可是有一組標籤屬性例外。
customattr = ""'<p data-custom="custom">custom attribute example</p>""" customsoup = BeautifulSoup(customattr,'lxml') customSoup.find(data-custom="custom")
using_attrs = customsoup.find(attrs={'data-custom':'custom'}) print(using_attrs)
css_class = soup.find(attrs={'class':'primaryconsumerlist'}) print(css_class)
css_class = soup.find(class_ = "primaryconsumers" )
css_class = soup.find(attrs={'class':'primaryconsumers'})
def is_secondary_consumers(tag): return tag.has_attr('id') and tag.get('id') == 'secondaryconsumers'
secondary_consumer = soup.find(is_secondary_consumers) print(secondary_consumer.li.div.string)
all_tertiaryconsumers = soup.find_all(class_="tertiaryconsumerslist")
for tertiaryconsumer in all_tertiaryconsumers: print(tertiaryconsumer.div.string)
email_ids = soup.find_all(text=emailid_regexp) print(email_ids)
email_ids_limited = soup.find_all(text=emailid_regexp,limit=2) print(email_ids_limited)
all_texts = soup.find_all(text=True) print(all_texts)
all_texts_in_list = soup.find_all(text=["plants","algae"]) print(all_texts_in_list)
[u'plants', u'algae']
div_li_tags = soup.find_all(["div","li"])
primaryconsumers = soup.find_all(class_="primaryconsumerlist") primaryconsumer = primaryconsumers[0] parent_ul = primaryconsumer.find_parents('ul') print(parent_ul)
immediateprimary_consumer_parent = primary_consumer.find_parent()
producers= soup.find(id='producers') next_siblings = producers.find_next_siblings() print(next_siblings)
first_div = soup.div all_li_tags = first_div.find_all_next("li")