【總結】BeautifulSoup速查手冊

時間 2019-11-07

標籤總結 beautifulsoup 速查手冊简体版

原文原文鏈接

1、下載網頁html

response = urllib2.urlopen(page_link, timeout=time_out)
web_page = response.read()

2、解碼網頁（解決中文亂碼問題）：python

decode_s = web_page.decode("utf-8")

3、將字符串轉化成樹狀結構：web

soup = BeautifulSoup(decode_s, "lxml")

接下來進入正題（假設搜索這麼一段HTML標籤）：url

<div class="cell maket">LMN<h1>ABC<a href="a.html">DEF</a>GHIJK</h1>OPQRST</div>

若是咱們想取出ABCDEFGHIJK的話（不帶<a>標籤）：spa

4、遍歷尋找特定元素code

for tag in soup.find_all("div"):

5、判斷特定屬性是否在該元素內：xml

if tag.attrs is not None and 'class' in tag.attrs.keys():

6、判斷該屬性的內容是否等於特定值（對於空格分隔的狀況，須要判斷list長度）：htm

if len(tag.attrs['class']) == 2 and 
   tag.attrs['class'][0] == 'cell' and 
   tag.attrs['class'][1] == 'maket':

7、取出該標籤下子標籤的元素：utf-8

tag.h1.get_text()

取得children時候的注意事項：children返回一個可迭代元素，但這個迭代裏面的元素不全都是tag，極有多是bs4.element.NavigableString，因此迭代操做元素的時候，首先要判斷一下元素的類型：element

for span in tag.parent.children:
                if isinstance(span, element.Tag) and 
                   span.attrs is not None and 
                   'class' in span.attrs.keys():

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。