原文地址html
http://www.cnblogs.com/yupeng/p/3362031.html
這篇文章講的也很全數組
http://www.cnblogs.com/twinsclover/archive/2012/04/26/2471704.html
稍微研究了下bs4這個庫,運行了下都還好用,就是解析html的各類結構,和xml的elementTree解析庫是相似的,使用起來差很少。spa
能夠直接調試,用來熟悉其用法調試
1 # coding=utf-8 2 # 3 from bs4 import BeautifulSoup 4 5 html_doc = """ 6 <html><head><title>The Dormouse's story</title></head> 7 <body> 8 <p class="title"><b>The Dormouse's story</b></p> 9 <p class="story">Once upon a time there were three little sisters; and their names were 10 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 11 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 12 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 13 and they lived at the bottom of a well.</p> 14 <p class="story">...</p> 15 """ 16 17 soup = BeautifulSoup(html_doc,'html.parser') 18 # print soup.title 19 # print soup.title.name 20 # print soup.title.string 21 # print soup.p 22 # print soup.a 23 # print soup.find_all('a') 24 # a=soup.find_all('a') 25 # print len(a) 26 # print soup.find_all('p')#返回相似數組的結構 27 # p=soup.find_all('p') 28 # print len(p) 29 # print soup.find(id='link3') 30 31 # print soup.get_text()#返回整個的文本 32 # print soup.p.get_text()#根據解析的節點來 33 # for i in soup.find_all('p'): 34 # print i.get_text() 35 # print i.contents 36 # print soup.a['href'],soup.a['class'],soup.a['id'],soup.a.text#注意單節點的每一個內容都獲取到了 37 # print soup.html,soup.head,soup.body#s總體,頭,身體,所有的結構 38 # print soup.p.contents,soup.head.contents#列表形式返回子內容 39 # for i in list(soup.head.children):#不須要知道子節點的名稱,迭代遍歷子內容 40 # print i, 41 # print soup.a.parent#向上查找,parents是查找全部的 42 # for i in soup.html.parents: 43 # print i,len(i) 44 # print soup.a.parent 45 # print soup.find_all(class_="sister") 46 print soup.find_all('a',limit=1)#限制個數