python爬蟲入門--beautifulsoup

時間 2019-11-11

標籤 python 爬蟲入門 beautifulsoup 欄目 Python 简体版

原文原文鏈接

1,beautifulsoup的中文文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/html

2,python

from bs4 import BeautifulSoup 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
""";
soup = BeautifulSoup(html_doc);
print(soup.prettify())

1）soup.prettify()的做用是把html格式化輸出正則表達式

2)在輸出是會發出警告：No parser was explicitly specified, so I'm using the best available HTML parser for this system。這是由於沒有解析器。因此須要安裝解析器。以下圖：數組

3）soup = BeautifulSoup(html_doc,"html.parser");//這個就能夠加入解析器this

print(soup.prettify())spa

4）soup.title #獲取title內容<title>The Dormouse's story</title>code

soup.標籤名 #獲取對應的標籤。（系統當前第一個）orm

soup.find_all('a') #打印出全部‘a’標籤 返回的是一個數組

soup.find(id="link3") #打印出對應id頁面

for link in soup.find_all('a'): #這個用來遍歷
print(link.get('id'))

#在遍歷class時候返回的是一個數組 
print(link.get('class'))
#['sister1']
#['sister2']
#['sister3']

soup.get_text() #這個是用來獲取全部的文字

soup.find('p',{'class':'story'})) #這個裏面是獲取p標籤下的class=story全部信息 注：這裏由於class是關鍵字因此不能使用find('class':'story')

soup.find('p',{'class':'story'}).string) # 結果爲none

5）能夠經過政策表達式來 match() 來匹配內容.下面例子中找出全部以b開頭的標籤,這表示<body>和<b>標籤都應該被找到:htm

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

（5.1），python的正則表達式blog

（注：圖片來源https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html）

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。