二、經過BeautifulSoup檢索文檔中的tag

時間 2019-11-17

原文原文鏈接

一、使用find_all()(或者findAll())檢索標籤

對於BeautifulSoup中的方法，find/findAll()爲一組函數，經過不一樣的參數進行重載。此處沒太多書上的含糊細節，具體知識點都在代碼註釋行。html

須要注意的是，Python的註釋分爲：#單行註釋
python

''' 多行註釋 '''。可是通過試驗，發現不能有多個多行註釋。正則表達式

網上的其餘註釋方案有：用""" """定義多行字符串進行註釋express

#根據特定tag和屬性，來搜索定位
from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj=BeautifulSoup(html,"html.parser")
nameList=bsObj.find_all("span",{"class":"green"})#尋找被class=green渲染的文字,
                #findall(tag, attr)
for name in nameList:
    print(name.get_text()) #打印被渲染的文字

#關於find()和findall()函數的說明
#findAll(tag, attr, recursive, text, limit, keywords)
#find(tag, attr, recursive, text, keywords)
#一、tag爲標籤,多個標籤：bsObj.findAll({"h1","h2","h3"})表示「or」關係
#二、attr爲標籤下的屬性，分「名」和「值」：{"name":"value"}
    # 多個屬性：bsObj.findAll{"class":"green", "class":"red"}
#三、recursive：true表示查看子標籤；false表示只看頂層標籤。默認爲true
#四、text:根據內容尋找tag，如查看prince出現的次數：nameList=bsObj.findAll(text="prince")
#                                                   print(len(nameList))
#五、limit：限制查找次數，find()爲findAll()當limit=1時的特例。
#六、findAll()能夠指定特定的關鍵詞（keyword）,此處爲多個關鍵詞爲「and」關係，
#如 allText=bsObj.findAll(id="text")
#print(allText[0].get_text())
#注意：用keyword的地方也能夠用其餘方式來產生相同的做用。同時，因爲class爲關鍵字，因此不能用下面方法：
#bsObj.findAll(class="green")
#但能夠：bsObj.findAll(class_="gree")或者bsObj.findAll("":{"class":"green"})

二、使用children/decsendants

#BeautifulSoup中的對象：BeautifulSoup object, Tag object, NavigableString object,
#  Comment object(<!-- comment -->)
#   直接訪問標籤：bsObj.div.findAll("img")，表示bsObj文檔中的第一個div標籤中尋找<img>
from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

for child in bsObj.find("table",{"id":"giftList"}).children: #children表示在table標籤下的子標籤
    #children僅僅表示第一層子標籤，若是表示下面的子孫，則須要：descendants來表示
    print(child)

三、使用sibling/siblings

#此處演示「兄弟標籤」
from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    #此處打印多個兄弟節點，須要注意：一、此處的sibling不包括tr本省
                            #二、next_siblings向後順序查找，前面的siblings使用previous_siblings
                            #三、next_sibling和previous_sibling只查找一個元素
    print(sibling)

四、使用parent

#尋找某個tag的parent
from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).
      parent.previous_sibling.get_text())

五、正則表達式

#此程序使用regular expression
#regular expression = 特徵 + 數量
#在線檢測：RegexPal
#郵件地址的正則表達式：[A-Za-z0-9_+]+@[A-Za-z0-9]+\.(com|org|edu|net)
'''
    正則表達式中的12個符號：
    一、*：出現0或屢次
    二、+：出現1或屢次
    三、[]：給定某個範圍，如[0-9]
    四、()：圈上某個組合，如(a*b)*
    五、{m,n}：指明最少出現m此，最多出現n次
    六、[^]：匹配不是範圍內的字符，如：[^A-Za-z]，不是字母
    七、|：表示或，如：a|b|c，出現或者是a，或者是b，或者是c
    八、.：匹配任意一個字符
    九、^：出如今字符串的開頭，如^a，a出如今字符串開頭
    十、$：出如今字符串結尾，如a$，a爲倒數第一個字符。
    十一、\：轉譯字符
    十二、?!：表示不包含，如^((?![A-Z]).)*$，表示不包含大寫字母
'''

#下面程序用正則表達式來描述圖片路徑
from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4 import re

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

images=bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts\/img.*\.jpg")})
for image in images:
    print(image["src"])

六、lambda 表達式以及其餘

#對於tag，能夠直接訪問其屬性：tag.attrs，如訪問圖片屬性：myImgTag.attrs['src']
#lambda expression：將函數值做爲參數，帶入另外一個函數中，如g(f(x),y)
#findAll()容許使用lambda表示，可是須要被帶入的參數知足：該函數有一個tag參數
                                                        #返回值爲true
#BeautifulSoup遇到的每一個tag都會在被帶入函數中處理，且返回的true的tag將被保留
#好比：soup.findAll(lambda tag:len(tag.attrs) == 2)
'''
對於兩個標籤：<div class="body" id="content"></div>
            <span style="color:red" class="title"></span>
            則會返回
下面幾個庫實現和BeautifulSoup相同的功能:
lxml
HTML parser
'''

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。