python爬蟲入門--beautifulsoup

1,beautifulsoup的中文文檔:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/html

2,python

from bs4 import BeautifulSoup 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
""";
soup = BeautifulSoup(html_doc);
print(soup.prettify())

 1)soup.prettify()的做用是把html格式化輸出正則表達式

 2)在輸出是會發出警告:No parser was explicitly specified, so I'm using the best available HTML parser for this system。這是由於沒有解析器。因此須要安裝解析器。以下圖:數組

3)soup = BeautifulSoup(html_doc,"html.parser");//這個就能夠加入解析器this

print(soup.prettify())spa

4)soup.title  #獲取title內容<title>The Dormouse's story</title>code

  soup.標籤名  #獲取對應的標籤。(系統當前第一個)orm

soup.find_all('a') #打印出全部‘a’標籤 返回的是一個數組
soup.find(id="link3") #打印出對應id頁面

for link in soup.find_all('a'): #這個用來遍歷
print(link.get('id'))

#在遍歷class時候返回的是一個數組
print(link.get('class'))
#['sister1']
#['sister2']
#['sister3']

soup.get_text() #這個是用來獲取全部的文字

soup.find('p',{'class':'story'})) #這個裏面是獲取p標籤下的class=story全部信息 注:這裏由於class是關鍵字因此不能使用find('class':'story')
soup.find('p',{'class':'story'}).string) # 結果爲none
 

 

5)能夠經過政策表達式來 match() 來匹配內容.下面例子中找出全部以b開頭的標籤,這表示<body>和<b>標籤都應該被找到:htm

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

 (5.1),python的正則表達式blog

(注:圖片來源https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html)

相關文章
相關標籤/搜索