python爬蟲之Beautiful Soup的基本使用

一、簡介

  簡單來講,Beautiful Soup是python的一個庫,最主要的功能是從網頁抓取數據。官方解釋以下:html

  Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱,經過解析文檔爲用戶提供須要抓取的數據,由於簡單,因此不須要多少代碼就能夠寫出一個完整的應用程序。html5

  Beautiful Soup自動將輸入文檔轉換爲Unicode編碼,輸出文檔轉換爲utf-8編碼。你不須要考慮編碼方式,除非文檔沒有指定一個編碼方式,這時,Beautiful Soup就不能自動識別編碼方式了。而後,你僅僅須要說明一下原始編碼方式就能夠了。python

  Beautiful Soup已成爲和lxml、html6lib同樣出色的python解釋器,爲用戶靈活地提供不一樣的解析策略或強勁的速度。正則表達式

 

二、環境安裝

  Beautiful Soup 3 目前已經中止開發,推薦在如今的項目中使用Beautiful Soup 4,不過它已經被移植到BS4了,也就是說導入時咱們須要 from bs4 import BeautifulSoup  。因此這裏咱們用的版本是 Beautiful Soup 4.3.2 (簡稱BS4)。express

  一、快速安裝瀏覽器

pip install beautifulsoup4

  二、若是想安裝最新的版本,請直接下載安裝包來手動安裝,也是十分方便的方法ide

    一、Beautiful Soup3.2.1函數

    https://pypi.python.org/pypi/BeautifulSoup/3.2.1工具

    二、Beautiful Soup4.3.2ui

      https://pypi.python.org/pypi/beautifulsoup4/

    下載完成以後解壓

    運行下面的命令便可完成安裝

     python setup.py install

  三、而後須要安裝 lxml

   pip install lxml

   另外一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,能夠選擇下列方法來安裝html5lib:

   pip install html5lib

    Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,若是咱們不安裝它,則 Python 會使用 Python默認的解析器,lxml 解析器更增強大,速度更快,推薦安裝。

  

 

 


 
三、使用

  官方文檔:http://beautifulsoup.readthedocs.io/zh_CN/latest/

  一、導入

from bs4 import BeautifulSoup

  二、咱們首先建立一個html文件,爲了模擬下面的操做。

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

 

  三、建立 beautifulsoup 對象

soup = BeautifulSoup(html)

  另外,咱們還能夠打開本地的html文件。

soup = BeautifulSoup(open('index.html'))

  四、格式化輸入

soup = BeautifulSoup(html,"lxml")
print(soup.prettify())

  輸出:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>

   這個頗有用的哦,若是咱們要分析本地的html文件沒有格式化輸出的時候,看起來就很是亂了,因此咱們須要格式化輸入後咱們就能一目瞭然這個html文件的結構。

  五、四大對象種類

  Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種

  (1)Tag
  (2)NavigableString
  (3)BeautifulSoup
  (4)Comment

(1)Tag

  tag是什麼鬼,tag中文意思是標籤的意思,學過html的同窗確定明白,標籤例如<a  href="https://www.baidu.com">my name a</a>

  感覺一下tag的用法

print(soup.title)
#<title>The Dormouse's story</title>

  

print(soup.head)
#<head><title>The Dormouse's story</title></head>

  細心的同窗會發現,我有不少p標籤,可是隻能打印到從上往下的第一個匹配到的標籤

print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

  

print(soup.a)
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

  

soup = BeautifulSoup(html,"lxml")
print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

  

  tag還有兩個經常使用的屬性,name和attrs

  name: 

soup = BeautifulSoup(html,"lxml")
print(soup.name)
print(soup.head.name)
#[document]
#head

  

  attrs: 

soup = BeautifulSoup(html,"lxml")
print(soup.p.attrs)
#{'name': 'dromouse', 'class': ['title']}

  

  獲取屬性值的兩種不一樣方法:  

soup = BeautifulSoup(html,"lxml")
print(soup.p.attrs)
print(soup.p.get("class"))
print(soup.p["class"])
#{'class': ['title'], 'name': 'dromouse'}
#['title']
#['title']

  

  能夠獲取,固然也能夠修改和刪除

  修改:

soup = BeautifulSoup(html,"lxml")
print(soup.p)
soup.p["class"]="newclass"
print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
#<p class="newclass" name="dromouse"><b>The Dormouse's story</b></p>

  

  刪除: 

soup = BeautifulSoup(html,"lxml")
print(soup.p)
del soup.p["class"]
print(soup.p)
#<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
#<p name="dromouse"><b>The Dormouse's story</b></p>

  

(2)NavigableString

  一、咱們已經經過tag方法找到標籤,可是若是想找某個標籤的內容怎麼辦。

soup = BeautifulSoup(html,"lxml")
print(soup.p.string)
#The Dormouse's story

  

(3)BeautifulSoup 

對象表示的是一個文檔的所有內容.大部分時候,能夠把它看成 Tag 對象,是一個特殊的 Tag,咱們能夠分別獲取它的類型,名稱,以及屬性。

soup = BeautifulSoup(html,"lxml")
print(type(soup.name))
print(soup.name)
print(soup.attrs)
#<class 'str'>
#[document]
#{}

  

(4)Comment

  Comment 對象是一個特殊類型的 NavigableString 對象,其實輸出的內容仍然不包括註釋符號,可是若是很差好處理它,可能會對咱們的文本處理形成意想不到的麻煩。

soup = BeautifulSoup(html,"lxml")
print(soup.a)
print(soup.a.string)
print(type(soup.a.string))

#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
#Elsie 
#<class 'bs4.element.Comment'>

  

  a 標籤裏的內容其實是註釋,可是若是咱們利用 .string 來輸出它的內容,咱們發現它已經把註釋符號去掉了,因此這可能會給咱們帶來沒必要要的麻煩。

  另外咱們打印輸出下它的類型,發現它是一個 Comment 類型,因此,咱們在使用前最好作一下判斷,判斷代碼以下

import bs4
soup = BeautifulSoup(html,"lxml")
if type(soup.a.string)==bs4.element.Comment:
    print(soup.a.string)

  

  (6)遍歷文檔樹

  contents和children的區別

  一、contents

  tag 的 .content 屬性能夠將tag的子節點以列表的方式輸出

soup = BeautifulSoup(html,"lxml")
print(soup.p.contents)
#[<b>The Dormouse's story</b>]

  

  列表的話咱們就能夠經過下標取裏面值

soup = BeautifulSoup(html,"lxml")
print(soup.p.contents[0])
#<b>The Dormouse's story</b>

  

  二、children 

  它返回的不是一個 list,不過咱們能夠經過遍歷獲取全部子節點。

  咱們打印輸出 .children 看一下,能夠發現它是一個 list 生成器對象 

soup = BeautifulSoup(html,"lxml")
print(soup.p.children)
#<list_iterator object at 0x01BAE310>

  list能夠經過for循環遍歷取值

soup = BeautifulSoup(html,"lxml")
print(soup.p.children)
for line in soup.p.children:
    print(line)

  

  三、全部子孫節點(.descendants

.contents 和 .children 屬性僅包含tag的直接子節點.例如,<head>標籤只有一個直接子節點<title>

  .descendants

soup = BeautifulSoup(html,"lxml")
for line in soup.descendants:
    print(line)

  children和contents只會把html文件打印一遍,只是children須要用for循環遍歷一下而已,可是descendantes會把html中每個tag都遍歷一遍的前提是子子孫孫都會遍歷一下(有些朋友可能仍是有點不明白,直接上代碼你就懂了)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>
Tillie
<span>Test<a>TEST</a></span>
Test
<a>TEST</a>
TEST
;
and they lived at the bottom of a well.


<p class="story">...</p>
...
descendants

 

 

  四、節點內容(string)

  通俗點說就是:若是一個標籤裏面沒有標籤了,那麼 .string 就會返回標籤裏面的內容。若是標籤裏面只有惟一的一個標籤了,那麼 .string 也會返回最裏面的內容(若是標籤裏面有不少不少的內容,它就不知道該找誰了,結果返回一個None)

soup = BeautifulSoup(html,"lxml")
print(soup.head.string)
print(soup.title.string)
#The Dormouse's story
#The Dormouse's story

  

  若是標籤下有不少的內容,返回的就是None了 

soup = BeautifulSoup(html,"lxml")
print(soup.html.string)
#None

  

  可能有些同窗會說,不要緊啊,內容多了你能夠用for循環遍歷一下不就成了麼,那咱們來試試。

  結果:報錯了

  還能夠這麼說,好比,你找到了一個a標籤,可是這個a標籤有本身的內容,a標籤下面還有一個a標籤或者別的標籤,這個a標籤也有本身的內容,這個時候你要是用string的話確定是None。

soup = BeautifulSoup(html,"lxml")
for line in soup.html.string:
    print(line)
#TypeError: 'NoneType' object is not iterable

  

  五、多個內容

  .strings  .stripped_strings  

  strings

  獲取多個內容,不過須要遍歷獲取,好比下面的例子

soup = BeautifulSoup(html,"lxml")
for line in soup.strings:
    print(repr(line))
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
',\n'
'Lacie'
' and\n'
'Tillie'
'Test'
'TEST'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'
結果

 

  

  stripped_strings

   輸出的字符串中可能包含了不少空格或空行,使用 .stripped_strings 能夠去除多餘空白內容

soup = BeautifulSoup(html,"lxml")
for line in soup.stripped_strings:
    print(repr(line))
"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
','
'Lacie'
'and'
'Tillie'
'Test'
'TEST'
';\nand they lived at the bottom of a well.'
'...'
結果

 

   

  六、父節點

    .parent 屬性

   其實就是把當前要找的標籤的上一級標籤打印出來(注意,會把上一級標籤的全部子標籤都打印回來)

  例一:

soup = BeautifulSoup(html,"lxml")
print(soup.title.parent)
#<head><title>The Dormouse's story</title></head>

  例二:

soup = BeautifulSoup(html,"lxml")
print(soup.p.parent)

  結果:

<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

  

   七、所有父節點

   .parents 屬性

soup = BeautifulSoup(html,"lxml")
content = soup.head.title.string
for parent in  content.parents:
    print(parent.name)

  結果你會發現,parents是遞歸查找到你指定標籤的父標籤名字,指定標籤的爺爺標籤,太爺爺標籤,就這麼一直找下去。(若是你想看一下你找到這個標籤是具體在什麼位置,能夠這麼找)

title
head
html
[document]

 

 

  八、兄弟標籤

   .next_sibling  .previous_sibling 屬性

  兄弟節點能夠理解爲和本節點處在統一級的節點,.next_sibling 屬性獲取了該節點的下一個兄弟節點,.previous_sibling 則與之相反,若是節點不存在,則返回 None

  注意:實際文檔中的tag的 .next_sibling 和 .previous_sibling 屬性一般是字符串或空白,由於空白或者換行也能夠被視做一個節點,因此獲得的結果多是空白或者換行

   一、next_sibling

soup = BeautifulSoup(html,"lxml")
print(soup.p.next_sibling.next_sibling)
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>;
and they lived at the bottom of a well.</p>
結果

 

   

  二、previous_sibling

soup = BeautifulSoup(html,"lxml")
print(soup.p.previous_sibling.previous_sibling)
#None

  

  九、所有兄弟節點

    .next_siblings .previous_siblings 屬性

  一、next_siblings

  打印了初當前標籤的全部子標籤。(能夠找form表單)

soup = BeautifulSoup(html,"lxml")
for a in soup.a.next_siblings:
    print(a)

  結果:

,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie<span>Test<a>TEST</a></span></a>
;
and they lived at the bottom of a well.

 

   

  十、先後節點

   .next_element  .previous_element 屬性

   與 .next_sibling  .previous_sibling 不一樣,它並非針對於兄弟節點,而是在全部節點,不分層次

  注意:咱們說的前一個或者下一個其實除了標籤外標籤內的內容其實也是一個屬性

   一、next_element(後)

soup = BeautifulSoup(html,"lxml")
print(soup.p.next_element )
#<b>The Dormouse's story</b>

  二、.previous_sibling(前)

  html中第一個a標籤的前一個標籤的是html中第二個p標籤(這個p標籤的兄弟姐妹,不是全部)

soup = BeautifulSoup(html,"lxml")
print(soup.a.previous_sibling )
#Once upon a time there were three little sisters; and their names were

  

  十一、搜索文檔樹

  一、find_all( name , attrs , recursive , text , **kwargs )

  find_all() 方法搜索當前tag的全部tag子節點,並判斷是否符合過濾器的條件

     1)name 參數

    name參數能夠查找全部名字爲 name 的tag,字符串對象會被自動忽略

    A)傳個字符串看看以列表的形式找出全部a標籤

soup = BeautifulSoup(html,"lxml")
print(soup.find_all('a') )
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  

     B)正則表達式

    若是傳入正則表達式做爲參數,Beautiful Soup會經過正則表達式的 match() 來匹配內容.下面例子中找出全部以b開頭的標籤,這表示<body>和<b>標籤都應該被找到  

soup = BeautifulSoup(html,"lxml")
print(soup.find_all(re.compile('^a')) )
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  

     C)傳個列表看看

soup = BeautifulSoup(html,"lxml")
print(soup.find_all(['a','p']) )

    結果:

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

 

 

     D)傳 True

    若是沒有合適過濾器,那麼還能夠定義一個方法,方法只接受一個元素參數 [4] ,若是這個方法返回 True 表示當前元素匹配而且被找到,若是不是則反回 False

    下面方法校驗了當前元素,若是包含 class 屬性卻不包含 id 屬性,那麼將返回 True:

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

    將這個方法做爲參數傳入 find_all() 方法,將獲得全部<p>標籤:

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

 

  二、keyword 參數

  注意:若是一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數看成指定名字tag的屬性來搜索,若是包含一個名字爲 id 的參數,Beautiful Soup會搜索每一個tag的」id」屬性

soup = BeautifulSoup(html,"lxml")
print(soup.find_all(id='link2'))
#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

  上正則表達式

soup = BeautifulSoup(html,"lxml")
print(soup.find_all(href=re.compile('elsie')))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

  

  還能夠同時傳多個屬性參數進行進準查找

soup = BeautifulSoup(html,"lxml")
print(soup.find_all(href=re.compile("elsie"), id='link1'))
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

  

  在這裏咱們想用 class 過濾,不過 class 是 python 的關鍵詞,這怎麼辦?加個下劃線就能夠

soup = BeautifulSoup(html,"lxml")
print(soup.find_all("a", class_="sister"))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  

  有些tag屬性在搜索不能使用,好比HTML5中的 data-* 屬性

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

  

  三、text參數

soup = BeautifulSoup(html,"lxml")
print(soup.find_all(text=" Elsie "))
# [' Elsie ']

print(soup.find_all(text=["Tillie", " Elsie ", "Lacie"]))
#[' Elsie ', 'Lacie', 'Tillie']

print(soup.find_all(text=re.compile("Dormouse")))
#["The Dormouse's story", "The Dormouse's story"]

  

  四、limit參數

  find_all() 方法返回所有的搜索結構,若是文檔樹很大那麼搜索會很慢.若是咱們不須要所有結果,可使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字相似,當搜索到的結果數量達到 limit 的限制時,就中止搜索返回結果.

  文檔樹中有3個tag符合搜索條件,但結果只返回了2個,由於咱們限制了返回數量

soup = BeautifulSoup(html,"lxml")
print(soup.find_all('a',limit=2))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

  

   五、recursive 參數

soup = BeautifulSoup(html,"lxml")
print(soup.find_all('p'))
print(soup.find_all('p',recursive=False))

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
[]

  

   

  六、一大波find操做

  一、find( name , attrs , recursive , text , **kwargs )

  它與 find_all() 方法惟一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果

soup = BeautifulSoup(html,"lxml")
print(soup.find_all('a'))
print(soup.find('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

  

  二、find_next_siblings()  find_next_sibling()

    這2個方法經過 .next_siblings 屬性對當 tag 的全部後面解析的兄弟 tag 節點進行迭代, find_next_siblings() 方法返回全部符合條件的後面的兄弟節點,find_next_sibling() 只返回符合條件的後面的第一個tag節點

  三、find_previous_siblings()  find_previous_sibling()

    這2個方法經過 .previous_siblings 屬性對當前 tag 的前面解析的兄弟 tag 節點進行迭代, find_previous_siblings()方法返回全部符合條件的前面的兄弟節點, find_previous_sibling() 方法返回第一個符合條件的前面的兄弟節點

  四、find_all_next()  find_next()

     這2個方法經過 .next_elements 屬性對當前 tag 的以後的 tag 和字符串進行迭代, find_all_next() 方法返回全部符合條件的節點, find_next() 方法返回第一個符合條件的節點

  五、find_all_previous() 和 find_previous()

    這2個方法經過 .previous_elements 屬性對當前節點前面的 tag 和字符串進行迭代, find_all_previous() 方法返回全部符合條件的節點, find_previous()方法返回第一個符合條件的節點

  注:以上(2)(3)(4)(5)(6)(7)方法參數用法與 find_all() 徹底相同,原理均相似,在此再也不贅

   

  

  十二、CSS選擇器

   咱們在寫 CSS 時,標籤名不加任何修飾,類名前加點,id名前加 #,在這裏咱們也能夠利用相似的方法來篩選元素,用到的方法是 soup.select(),返回類型是 list

  

  (1)經過標籤名查找

soup = BeautifulSoup(html,"lxml")
print(soup.select('title'))
#[<title>The Dormouse's story</title>]

  

soup = BeautifulSoup(html,"lxml")
print(soup.select('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  

soup = BeautifulSoup(html,"lxml")
print(soup.select('b'))
#[<b>The Dormouse's story</b>]

    

  (2)經過類名查找

soup = BeautifulSoup(html,"lxml")
print(soup.select('.sister'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  

  

  (3)經過 id 名查找

soup = BeautifulSoup(html,"lxml")
print(soup.select('#link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

   

  (4)組合查找

  組合查找即和寫 class 文件時,標籤名與類名、id名進行的組合原理是同樣的,例如查找 p 標籤中,id 等於 link1的內容,兩者須要用空格分開

soup = BeautifulSoup(html,"lxml")
print(soup.select('p #link1'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

  

  (5)屬性查找

  查找時還能夠加入屬性元素,屬性須要用中括號括起來,注意屬性和標籤屬於同一節點,因此中間不能加空格,不然會沒法匹配到。

soup = BeautifulSoup(html,"lxml")
print(soup.select('a[class="sister"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

  

soup = BeautifulSoup(html,"lxml")
print(soup.select('a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

  

  一樣,屬性仍然能夠與上述查找方式組合,不在同一節點的空格隔開,同一節點的不加空格

soup = BeautifulSoup(html,"lxml")
print(soup.select('p a[href="http://example.com/elsie"]'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

  

  以上的 select 方法返回的結果都是列表形式,能夠遍歷形式輸出,而後用 get_text() 方法來獲取它的內容。

soup = BeautifulSoup(html, 'lxml')
print(type(soup.select('title')))
#<class 'list'>
print(soup.select('title')[0].get_text())
#The Dormouse's story
for title in soup.select('title'):
    print(title.get_text())
#The Dormouse's story

  

轉載:靜覓 » Python爬蟲利器二之Beautiful Soup的用法

相關文章
相關標籤/搜索