Beautiful Soup 庫的基本使用

示例網站:https://python123.io/ws/demo....html

>>> import requests
>>> r = requests.get('https://python123.io/ws/demo.html')
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,'html.parser')
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

咱們使用soup的prettify方法來漂亮打印HTML頁面。
BeautifulSoup庫的基本元素
html元素的介紹:html5

<p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
</p>

在以上的代碼中,<p>..</p> 就是標籤,Tag;
標籤的名字name 就是 p
class 就是屬性(Attributes),屬性是由鍵值對構成的。
beautifulsoup類就是對應着HTML頁面。
BeautifulSoup 庫解析器
不管哪一種解析器均可以處理HTML文檔的python

解析器 使用方法 條件
bs4的HTML解析器 BeautifulSoup(mk,'html.parser') 安裝bs4庫
lxml的HTML解析器 BeautifulSoup(mk,'lxml') pip install lxml
lxml的XML解析器 BeautifulSoup(mk,'xml') pip install lxml
html5lib解析器 BeautifulSoup(mk,'html5lib') pip install html5lib

BeautifulSoup 類的基本元素網絡

基本元素 說明
Tag 標籤,最基本的信息組織單元,分別用<>和</>代表開頭和結尾
Name 標籤的名字,<p>..</p>的名字是‘p’,格式:<tag>.name
Attributes 標籤的屬性,字典形式組織,格式:<tag>.attrs
NavigableString 標籤內非屬性字符串,<>..</>中字符串,格式:<tag>.string
Comment 標籤內字符串的註釋部分,一種特殊的Comment類型
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

soup中全部的標籤均可以使用soup.tag的形式訪問,當文檔中存在多個同名標籤,則只訪問第一個。
訪問標籤名字的方法:數據結構

>>> tag.name
'a'
>>> tag.parent.name
'p'
>>> tag.parent.parent.name
'body'

訪問標籤屬性的方法,不管標籤是否存在屬性,都會返回一個字典。app

>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> tag.attrs['class']
['py1']
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>#標籤是標籤屬性

訪問NavigableString的方法:網站

>>> tag.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>#這個字符串不是普通字符串類型

訪問Comment註釋的方法ui

>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>",'html.parser')
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

若分析文本時不須要註釋的內容,用類型來判斷一下就能夠過濾掉註釋的內容。
HTML遍歷方法:
1.從根節點向葉子節點下行遍歷
2.從葉子節點向根節點上行遍歷
3.從葉子節點到葉子節點的平行遍歷方式
下行遍歷方法:url

屬性 說明
.contents 子節點的列表,將<tag>全部兒子節點存入列表
.children 子節點的迭代類型,與.contents相似,用於循環遍歷兒子節點
.descendants 子孫節點的迭代類型,包含全部子孫節點,用於循環遍歷
>>> soup.body
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> for child in soup.body.children:
    print(child)

    


<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>


>>> for child in soup.body.descendants:
    print(child)

    


<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.

上行遍歷方法spa

屬性 說明
.parent 節點的父親標籤
.parents 節點先輩標籤的迭代類型,用於循環遍歷先輩節點
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.parent
>>> for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

        
p
body
html
[document]

標籤樹的平行遍歷

屬性 說明
.next_sibling 返回按照HTML文本順序的下一個平行節點標籤
.previous_sibling 返回按照HTML文本順序的上一個平行節點標籤
.next_siblings 迭代類型,返回按照HTML文本順序的後續全部平行節點標籤
.previous_siblings 迭代類型,返回按照HTML文本順序的前續全部平行節點標籤

重要:平行遍歷發送在同一個父親節點下的各個節點之間

>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling

bs4的html頁面友好輸出:

print(soup.prettify())

XML:擴展標記語言 與HTML格式相似。最先的通用信息標記語言,可擴展性好,但繁瑣。主要採用
image.png
JSON:JS語言中面向對象的一種信息標記形式
JSON是有類型的鍵值對 key:value。信息有類型,適合程序處理(js),較XML簡潔。主要用於節點通訊,不能註釋
image.png
YAML:無類型鍵值對 key:value 。信息無類型,文本信息比例最高,可讀性好。各種系統的配置文件,有註釋易讀
image.png
文本解析的基本思路
image.png

>>> for link in soup.find_all('a'):
    print(link.get('href'))

    
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

soup.find_all()方法:
<>.find_all(name,attrs,recursive,string,**kwargs)
返回一個列表類型,存儲查找的結果
name:對標籤名稱的檢索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a')[0]
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.find_all(['a','b'])#同時搜索兩個參數須要傳入一個列表
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>> for tag in soup.find_all(True):
    print(tag.name)

    
html
head
title
body
p
b
p
a
a

如今要搜索全部以‘b’開頭的標籤,包括和<body>標籤。

image.png

attrs:對標籤屬性值的檢索字符串,可標註屬性檢索
image.png
recursive:是否對子孫所有檢索,默認True
image.png
string:對<>...</>中字符串區域的檢索字符串
image.png
<tag>(..)等價於 <tag>.find_all(..)
soup(..)等價於 soup.find_all(..)

實例:中國大學排名的爬蟲實現
URL連接爲:http://www.zuihaodaxue.com/zu...
image.png
首先肯定咱們想得到的信息是否在HTML頁面中,而不是用js造成的。
image.png
在這裏,咱們在HTML頁面中看到了所有信息,那麼如今對程序結構進行一下初步設計。
步驟1:從網絡上獲取大學排名網頁內容,getHTMLText()
步驟2:提取網頁內容中信息到合適的數據結構,fillUnivList()(關鍵點)
步驟3:利用數據結構展現並輸出結構,printUnivList()

import requests
from bs4 import BeautifulSoup as bs
import bs4
def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print('爬取網頁失敗')

def fillUnivList(ulist,html):
    soup = bs(html,'html.parser')
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):#過濾掉非標籤內容,必須引入bs4庫
            tds = tr('td') #tr.find_all('td'),查找全部td標籤
            ulist.append([tds[0].string,tds[1].string,tds[2].string])

def printUnivList(ulist,num):
    geshi = "{:^10}\t{:^6}\t{:^10}"
    print(geshi.format('排名','學校名稱','省市'))
    for i in range(num):
        u = ulist[i]
        print(geshi.format(u[0],u[1],u[2]))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20) #打印前20所大學
 
main()
相關文章
相關標籤/搜索