python爬蟲系列（2）—— requests和BeautifulSoup

時間 2019-11-09

標籤 python 爬蟲系列 requests beautifulsoup 欄目 Python 简体版

原文原文鏈接

本文主要介紹python爬蟲的兩大利器：requests和BeautifulSoup庫的基本用法。html

1. 安裝requests和BeautifulSoup庫

能夠經過3種方式安裝：java

easy_install
pip
下載源碼手動安裝

這裏只介紹pip安裝方式：python

pip install requests
pip install BeautifulSoup4

2. requests基本用法示例

# coding:utf-8
import requests

# 下載新浪新聞首頁的內容
url = 'http://news.sina.com.cn/china/'
# 用get函數發送GET請求，獲取響應
res = requests.get(url)
# 設置響應的編碼格式utf-8（默認格式爲ISO-8859-1），防止中文出現亂碼
res.encoding = 'utf-8'

print type(res)
print res
print res.text

輸出：linux

<class 'requests.models.Response'>
<Response [200]>
<!DOCTYPE html>
<!-- [ published at 2017-04-19 23:30:28 ] -->
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<title>國內新聞_新聞中心_新浪網</title>
<meta name="keywords" content="國內時政,內地新聞">

下面將上面獲取到的網頁html內容寫入到文件中，這裏有一點須要注意的是：python是調用ASCII編碼解碼程序去處理字符流的，當字符不屬於ASCII範圍時會拋異常（ordinal not in range(128)），因此要提早設置程序的默認編碼：windows

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

而後再將響應的html內容存入文件中：瀏覽器

with open('content.txt','w+') as f:
    f.write(res.text)

3. BeautifulSoup基本用法

1. 自定義測試html服務器

html = '''
<html>
    <body>
        <h1 id="title">Hello World</h1>
        <a href="#link1" class="link">This is link1</a>
        <a href="#link2" class="link">This is link2</a>
    </body>
</html>
'''

2. 從html文本中獲取soup網絡

from bs4 import BeautifulSoup
# 這裏指定解析器爲html.parser（python默認的解析器），指定html文檔編碼爲utf-8
soup = BeautifulSoup(html,'html.parser',from_encoding='utf-8')
print type(soup)

#輸出：<class 'bs4.BeautifulSoup'>

3. soup.select()函數用法架構

(1) 獲取指定標籤的內容app

header = soup.select('h1')
print type(header)
print header
print header[0]
print type(header[0])
print header[0].text

# 輸出：
'''
<type 'list'>
[<h1 id="title">Hello World</h1>]
<h1 id="title">Hello World</h1>
<class 'bs4.element.Tag'>
Hello World
'''
alinks = soup.select('a')
print [x.text for x in alinks]

# 輸出：[u'This is link1', u'This is link2']

(2) 獲取指定id的標籤的內容（用'#'）

title = soup.select('#title')
print type(title)
print title[0].text

# 輸出：
'''
<type 'list'>
Hello World
'''

(3) 獲取指定class的標籤的內容（用'.'）

alinks = soup.select('.link')
print [x.text for x in alinks]

# 輸出：[u'This is link1', u'This is link2']

(4) 獲取a標籤的連接（href屬性值）

print alinks[0]['href']

# 輸出：#link1

(5) 獲取一個標籤下的全部子標籤的text

body = soup.select('body')[0]
print body.text

# 輸出：
'''

Hello World
This is link1
This is link2
'''

(6) 獲取不存在的標籤

aa = soup.select('aa')
print aa

# 輸出：[]

(7) 獲取自定義屬性值

html2 = '<a href="www.test.com" qoo="123" abc="456">This is a link.</a>'
soup2 = BeautifulSoup(html2,'html.parser')
alink = soup2.select('a')[0]
print alink['qoo']
print alink['abc']

# 輸出：
'''
123
456
'''

4. soup.find()和soup.find_all()函數用法

(1) find()和find_all()函數原型：

find和find_all函數均可根據多個條件從html文本中查找標籤對象，只不過find的返回對象類型爲bs4.element.Tag，爲查找到的第一個知足條件的Tag；而find_all的返回對象爲bs4.element.ResultSet（實際上就是Tag列表）,這裏主要介紹find函數，find_all函數相似。

find(name=None, attrs={}, recursive=True, text=None, **kwargs)
注：其中name、attrs、text的值都支持正則匹配。

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
注：其中name、attrs、text的值都支持正則匹配。

(2) find函數的用法示例

html = '<p><a href="www.test.com" class="mylink1 mylink2">this is my link</a></p>'
soup = BeautifulSoup(html,'html.parser')
a1 = soup.find('a')
print type(a1)
# 輸出：<class 'bs4.element.Tag'>

print a1.name
print a1['href']
print a1['class']
print a1.text
# 輸出：
'''
a
www.test.com
[u'mylink1', u'mylink2']
this is my link
'''
# 多個條件的正則匹配：
import re
a2 = soup.find(name = re.compile(r'w+'),class_ = re.compile(r'mylinkd+'),text = re.compile(r'^this.+link$'))
# 注：這裏的class屬性之因此寫成'class_'，是爲了防止和python關鍵字class混淆，其餘屬性名寫正常的名就行，不用這樣特殊處理
print a2

# 輸出：
'''
<a class="mylink1 mylink2" href="www.test.com">this is my link</a>
'''
# find函數的鏈式調用
a3 = soup.find('p').find('a')
print a3

# 輸出：
'''
<a class="mylink1 mylink2" href="www.test.com">this is my link</a>
'''
# attrs參數的用法
# 注：支持正則匹配屬性值（包括自定義屬性）
import re
html = '<div class="myclass" my-attr="123abc"></div><div class="myclass" my-attr="abc">'
soup = BeautifulSoup(html,'html.parser')
div = soup.find('div',attrs = {'class':'myclass','my-attr':re.compile(r'd+w+')})
print div

# 輸出：
'''
<div class="myclass" my-attr="123abc"></div>
'''

4. 網絡爬蟲基本架構

5. 補充

1. 代理訪問

有時候爲了不封IP，或者在某些公司內網訪問外網時候，須要用到代理服務器發送請求，代理的用法示例：

import requests
proxies = {'http':'http://proxy.test.com:8080','https':'http://proxy.test.com:8080'}  # 其中proxy.test.com即爲代理服務器的地址
url = 'https://www.baidu.com'  # 這個url爲要訪問的url
resp = requests.get(url,proxies = proxies)
若是代理服務器須要帳號、密碼，則能夠這樣寫proxies：

proxies = {'http':'http://{username}:{password}@proxy.test.com:8080','https':'http://{username}:{password}@proxy.test.com:8080'}

2. 向https的url發送請求

有時候向https的url發送請求會報錯：ImportError:no module named certifi.

解決方法：在發送請求時關閉校驗：verify = False，如：

resp = requests.get('https://test.com',verify = False)
注：也可經過在headers中傳相關鑑權參數來解決此問題。

3. httpbin.org

httpbin.org是requests庫的做者開發的一個網站，能夠專門用來測試requests庫的各類功能，其頁面以下：

但httpbin.org的服務器在國外，訪問速度比較慢。因此須要在本地搭建一個該網站的鏡像，方法以下：

前提：安裝好requests庫，才能基於該網站測試requests庫的功能。

pip install gunicorn httpbin
gunicorn httpbin:app

瀏覽器輸入：127.0.0.1:8000,便可訪問。
注：以上步驟在windows下會報錯：缺乏模塊pwd.fcanl，在linux下沒問題。

4. requests庫官方文檔

http://docs.python-requests.org/en/master/

原文連接：

https://www.cnblogs.com/jiayongji/p/7118939.html

-END-

識別圖中二維碼,領取python全套視頻資料

相關標籤/搜索

python爬蟲系列

爬蟲系列

python+requests+beautifulsoup

requests+beautifulsoup

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。