1、下載網頁html
response = urllib2.urlopen(page_link, timeout=time_out) web_page = response.read()
2、解碼網頁(解決中文亂碼問題):python
decode_s = web_page.decode("utf-8")
3、將字符串轉化成樹狀結構:web
soup = BeautifulSoup(decode_s, "lxml")
接下來進入正題(假設搜索這麼一段HTML標籤):url
<div class="cell maket">LMN<h1>ABC<a href="a.html">DEF</a>GHIJK</h1>OPQRST</div>
若是咱們想取出ABCDEFGHIJK的話(不帶<a>標籤):spa
4、遍歷尋找特定元素code
for tag in soup.find_all("div"):
5、判斷特定屬性是否在該元素內:xml
if tag.attrs is not None and 'class' in tag.attrs.keys():
6、判斷該屬性的內容是否等於特定值(對於空格分隔的狀況,須要判斷list長度):htm
if len(tag.attrs['class']) == 2 and tag.attrs['class'][0] == 'cell' and tag.attrs['class'][1] == 'maket':
7、取出該標籤下子標籤的元素:utf-8
tag.h1.get_text()
取得children時候的注意事項:children返回一個可迭代元素,但這個迭代裏面的元素不全都是tag,極有多是bs4.element.NavigableString,因此迭代操做元素的時候,首先要判斷一下元素的類型:element
for span in tag.parent.children: if isinstance(span, element.Tag) and span.attrs is not None and 'class' in span.attrs.keys():