1,beautifulsoup的中文文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/html
2,python
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """; soup = BeautifulSoup(html_doc); print(soup.prettify())
1)soup.prettify()的做用是把html格式化輸出正則表達式
2)在輸出是會發出警告:No parser was explicitly specified, so I'm using the best available HTML parser for this system。這是由於沒有解析器。因此須要安裝解析器。以下圖:數組
3)soup = BeautifulSoup(html_doc,"html.parser");//這個就能夠加入解析器this
print(soup.prettify())spa
4)soup.title #獲取title內容<title>The Dormouse's story</title>code
soup.標籤名 #獲取對應的標籤。(系統當前第一個)orm
soup.find_all('a') #打印出全部‘a’標籤 返回的是一個數組
soup.find(id="link3") #打印出對應id頁面
for link in soup.find_all('a'): #這個用來遍歷 print(link.get('id'))
#在遍歷class時候返回的是一個數組
print(link.get('class'))
#['sister1']
#['sister2']
#['sister3']
soup.get_text() #這個是用來獲取全部的文字
soup.find('p',{'class':'story'})) #這個裏面是獲取p標籤下的class=story全部信息 注:這裏由於class是關鍵字因此不能使用find('class':'story')
soup.find('p',{'class':'story'}).string) # 結果爲none
5)能夠經過政策表達式來 match() 來匹配內容.下面例子中找出全部以b開頭的標籤,這表示<body>和<b>標籤都應該被找到:htm
import re for tag in soup.find_all(re.compile("^b")): print(tag.name)
(5.1),python的正則表達式blog
(注:圖片來源https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html)