python爬蟲基礎知識——requests、bs4的使用

時間 2019-11-06

標籤 python 爬蟲基礎知識 requests bs4 使用欄目 Python 简体版

原文原文鏈接

requests基本用法

1.get用法css

不帶參數的html

#!\user\bin\python
 #-*-coding:utf-8-*
 import requests
 url="http://www.baidu.com"
 r=requests.get(url)
 print r.text

帶參數的get請求python

#!\user\bin\python
#-*-coding:utf-8-*-
import requests
url="http://www.baidu.com"
payload={'key1':'value1','key2':'value2'}
r=requests.get(url,params=payload)
print r.url
print r.text

加入headres正則表達式

import requests
 
payload = {'key1': 'value1', 'key2': 'value2'}
headers = {'content-type': 'application/json'}
r = requests.get("http://httpbin.org/get", params=payload, headers=headers)
print r.url

>>> import l8
http://www.baidu.com/?key2=value2&key1=value1

2.post請求json

一個http請求包括三個部分，爲別爲請求行，請求報頭，消息主體，相似如下這樣：服務器

請求行

請求報頭

消息主體

HTTP協議規定post提交的數據必須放在消息主體中，可是協議並無規定必須使用什麼編碼方式。服務端經過是根據請求頭中的Content-Type字段來獲知請求中的消息主體是用何種方式進行編碼，再對消息主體進行解析。具體的編碼方式包括：app

application/x-www-form-urlencoded
最多見post提交數據的方式，以form表單形式提交數據。
application/json
以json串提交數據。
multipart/form-data
通常使用來上傳文件。

以form形式發送post請求函數

Reqeusts支持以form表單形式發送post請求，只須要將請求的參數構形成一個字典，而後傳給requests.post()的data參數便可post

url = 'http://httpbin.org/post'
d = {'key1': 'value1', 'key2': 'value2'}
r = requests.post(url, data=d)
print r.text

輸出結果學習

{ 
「args」: {}, 
「data」: 「」, 
「files」: {}, 
「form」: { 
「key1」: 「value1」, 
「key2」: 「value2」 
}, 
「headers」: { 
…… 
「Content-Type」: 「application/x-www-form-urlencoded」, 
…… 
}, 
「json」: null, 
…… 
}

以json形式發送post請求

能夠將一json串傳給requests.post()的data參數，

url = 'http://httpbin.org/post'
s = json.dumps({'key1': 'value1', 'key2': 'value2'})
r = requests.post(url, data=s)
print r.text

輸出結果

{ 
「args」: {}, 
「data」: 「{\」key2\」: \」value2\」, \」key1\」: \」value1\」}」, 
「files」: {}, 
「form」: {}, 
「headers」: { 
…… 
「Content-Type」: 「application/json」, 
…… 
}, 
「json」: { 
「key1」: 「value1」, 
「key2」: 「value2」 
}, 
…… 
}

經過上述方法，咱們能夠POST JSON格式的數據

若是想要上傳文件，那麼直接用 file 參數便可

新建一個 a.txt 的文件，內容寫上 Hello World!

以multipart形式發送post請求

Requests也支持以multipart形式發送post請求，只需將一文件傳給requests.post()的files參數便可。

url = 'http://httpbin.org/post'
files = {'file': open('report.txt', 'rb')}
r = requests.post(url, files=files)
print r.text

輸出結果


{ 
「args」: {}, 
「data」: 「」, 
「files」: { 
「file」: 「Hello world!」 
}, 
「form」: {}, 
「headers」: {…… 
「Content-Type」: 「multipart/form-data; boundary=467e443f4c3d403c8559e2ebd009bf4a」, 
…… 
}, 
「json」: null,

beautifulsoup基本用法

自定義測試html,從html文本中獲取soup

html = '''
<html>
    <body>
        <h1 id="title">Hello World</h1>
        <a href="#link1" class="link">This is link1</a>
        <a href="#link2" class="link">This is link2</a>
    </body>
</html>
'''
from bs4 import BeautifulSoup
# 這裏指定解析器爲html.parser（python默認的解析器），指定html文檔編碼爲utf-8
soup = BeautifulSoup(html,'html.parser',from_encoding='utf-8')
print type(soup)
print soup
#print soup的結果
<html>
<body>
<h1 id="title">Hello World</h1>
<a class="link" href="#link1">This is link1</a>
<a class="link" href="#link2">This is link2</a>
</body>
</html>

# 輸出：<class 'bs4.BeautifulSoup'>

1.soup.select()函數用法

獲取指定標籤的內容

from bs4 import BeautifulSoup as bs
soup=bs(html,'html.parser')
header = soup.select('h1')#是一個列表
print type(header)#是一個列表
print header#打印出一個列表，內容是一個html標籤
print header[0]#打印出一個列表，內容是一個html標籤
print type(header[0])#打出一個類，內容是一個tag標籤
print header[0].text#打印出列表中的內容

# 輸出
'''
<type 'list'>
[<h1 id="title">Hello World</h1>]
<h1 id="title">Hello World</h1>
<class 'bs4.element.Tag'>
Hello World
'''

      1 html = '''
      2 <html>
      3     <body>
      4         <h1 id="title">Hello World</h1>
      5         <a href="#link1" class="link">This is link1</a>
      6         <a href="#link2" class="link">This is link2</a>
      7     </body>
      8 </html>
      9                                 '''
     10 from bs4 import BeautifulSoup as bs
     11 soup=bs(html,'html.parser',from_encoding='utf-8')
     12 a_links=soup.select('a')
     13 l=[x.text for x in a_links]
     14 print l
     15 print a_links
     16 print type(a_links)
     17 print a_links[0]
     18 print type(a_links[0])
     19 print a_links[0].text
     20 print a_links[0].text

>>> import l9
[u'This is link1', u'This is link2']
[<a class="link" href="#link1">This is link1</a>, <a class="link" href="#link2">This is link2</a>]
<type 'list'>
<a class="link" href="#link1">This is link1</a>
<class 'bs4.element.Tag'>
This is link1
This is link1
>>>

2.獲取指定id的標籤的內容（用’#’）

html = '''
      2 <html>
      3     <body>
      4         <h1 id="title">Hello World</h1>
      5         <a href="#link1" class="link">This is link1</a>
      6         <a href="#link2" class="link">This is link2</a>
      7     </body>
      8 </html>
      9                                 '''
     10 from bs4 import BeautifulSoup as bs
     11 soup=bs(html,'html.parser',from_encoding='utf-8')
     12 title=soup.select('#title')
     13 print title
     14 print type(title)
     15 print title[0]
     16 print type(title[0])
     17 print title[0].text
     18 


>>> import l9
[<h1 id="title">Hello World</h1>]
<type 'list'>
<h1 id="title">Hello World</h1>
<class 'bs4.element.Tag'>
Hello World
>>>

3.獲取指定class的標籤的內容（用’.’）

from bs4 import BeautifulSoup as bs
      2 html = '''
      3 <html>
      4     <body>
      5         <h1 id="title">Hello World</h1>
      6         <a href="#link1" class="link">This is link1</a>
      7         <a href="#link2" class="link">This is link2</a>
      8     </body>
      9 </html>
     10         '''
     11 soup=bs(html,'html.parser')
     12 h=soup.select('a.link')
     13 print h
        print [x.text for x in h]
     14 for i in [x.text for x in h]:
     15     print i

>>> import l9
[u'This is link1', u'This is link2']
[<a class="link" href="#link1">This is link1</a>, <a class="link" href="#link2">This is link2</a>]
This is link1
This is link2


一.回顧
1.在前面的筆記中，學習了三種抓取辦法。

使用select()函數獲取標籤，可是獲取標籤的方法有三種；第一種是直接獲取的標籤('tag'),第二種方法是獲取id的屬性（'#id屬性'）,第三種方法是獲取class屬性('.class屬性')

2.前面的筆記根據html頁面的特性進行的：

（1）selecet('tag')能夠獲取全部的tag
（2）「#」用於獲取制定id的內容
（3）「.」用於獲取指定class的標籤內容
二.下面介紹如下剩餘的標籤
1.獲取a標籤的連接（href屬性值）


2.獲取一個標籤下全部的子標籤的text

代碼示例：

from bs4 import BeautifulSoup as bs
import requests
html = '''
<html>
    <body>
        <h1 id="title">Hello World</h1>
        <a href="#link1" class="link">This is link1</a>
        <a href="#link2" class="link">This is link2</a>
    </body>
</html>
'''
soup=bs(html,'html.parser')
alinks=soup.select('a')
a=[x.text for x in alinks]
print (a)
for i in a:
    print (i)
print (alinks[0]['href'])

輸出結果：

['This is link1', 'This is link2']
This is link1
This is link2
#link1

from bs4 import BeautifulSoup as bs
import requests
html = '''
<html>
    <body>
        <h1 id="title">Hello World</h1>
        <a href="#link1" class="link">This is link1</a>
        <a href="#link2" class="link">This is link2</a>
    </body>
</html>
'''
soup=bs(html,'html.parser')
a=soup.select('h1')
b=[x.text for x in a]
print(b)
'''soup=bs(html,'html.parser')
a=soup.select('#title')
b=[x.text for x in a]
print (b)
soup=bs(html,'html.parser')
alinks=soup.select('a')
soup=bs(html,'html.parser')
h_id=soup.select('.link')
a=[x.text for x in h_id]

print (h_id[0]['href'])
print(a)
a=[x.text for x in alinks]
print (a)
for i in a:
    print (i)
print (alinks[0]['href'])'''

4.獲取一個標籤下的全部子標籤的text

1 from bs4 import BeautifulSoup as bs
      2 html = '''
      3 <html>
      4     <body>
      5         <h1 id="title">Hello World</h1>
      6         <a href="#link1" class="link">This is link1</a>
      7         <a href="#link2" class="link">This is link2</a>
      8     </body>
      9 </html>
     10                                 '''
     11 soup=bs(html,'html.parser')
     13 h=soup.select('body')[0]
     14 print type(h)
     15 print h
     17 print h.text


#輸出結果
<class 'bs4.element.Tag'>

<body>
<h1 id="title">Hello World</h1>
<a class="link" href="#link1">This is link1</a>
<a class="link" href="#link2">This is link2</a>
</body>

Hello World
This is link1
This is link2

5.soup.find()和soup.find_all()函數用法

find()和find_all()函數原型

find和find_all函數均可根據多個條件從html文本中查找標籤對象，只不過find的返回對象類型爲bs4.element.Tag，爲查找到的第一個知足條件的Tag。
而find_all的返回對象爲bs4.element.ResultSet（實際上就是Tag列表）。

find(name=None, attrs={}, recursive=True, text=None, **kwargs) 
#其中name、attrs、text的值都支持正則匹配。
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) 
#其中name、attrs、text的值都支持正則匹配

代碼示例：

find_all( name , attrs , recursive , text , **kwargs )

    find( name , attrs , recursive , text , **kwargs )




name 參數
name 參數能夠查找全部名字爲 name 的tag,字符串對象會被自動忽略掉.

簡單的用法以下:

    soup.find_all("title")
    # [<title>The Dormouse's story</title>]

複製代碼




keyword 參數
若是一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數看成指定名字tag的屬性來搜索,若是包含一個名字爲 id 的參數,Beautiful Soup會搜索每一個tag的」id」屬性.


    soup.find_all([u]id[/u]='link2')
    # [<a class="sister" href="http://example.com/lacie"[u] id[/u]="link2">Lacie</a>]
    若是傳入 href 參數,Beautiful Soup會搜索每一個tag的」href」屬性:

    搜索指定名字的屬性時可使用的參數值包括 字符串 , 正則表達式 , 列表, True .

    來段代碼：
    [code]from bs4 import BeautifulSoup as bs
    html = '''<table border=16 width='66%' align='center'>
                    <thead align='center'>
                            <caption>魚C信息</caption>
                            <tr>
                                    <td colspan="3">魚C信息表</td>
                            </tr>
                            <tr>
                                    <th id='th1'>姓名</th>
                                    <th id='th2'>年齡</th>
                                    <th id='th3'>顏值</th>
                            </tr>
                    </thead>
                    <tbody align='center'>
                            <tr>
                                    <td>不二如是：</td>
                                    <td>18</td>
                                    <td>下一位更帥~</td>
                            </tr>
                            <tr>
                                    <td>小甲魚老溼：</td>
                                    <td>28</td>
                                    <td>下一位更帥~</td>
                            </tr>
                            <tr>
                                    <td>MSK：</td>
                                    <td>16</td>
                                    <td>第一位最帥~</td>
                            </tr>
                            <tr>
                                    <td colspan='3'>村裏有個姑娘叫小花~</td>
                            </tr>
                    </tbody>       
            </table>'''
    soup = bs(html,'html.parser')

複製代碼


ps:在這段代碼中，只有<th>標籤擁有id

當name傳入字符串(a)時，將會查找全部name屬性爲a的Tag

    temp = soup.find_all('tr')
    temp
    #[<tr>
    <td colspan="3">魚C信息表</td>
    </tr>, <tr>
    <th id="th1">姓名</th>
    <th id="th2">年齡</th>
    <th id="th3">顏值</th>
    </tr>, <tr>
    <td>不二如是：</td>
    <td>18</td>
    <td>下一位更帥~</td>
    </tr>, <tr>
    <td>小甲魚老溼：</td>
    <td>28</td>
    <td>下一位更帥~</td>
    </tr>, <tr>
    <td>MSK：</td>
    <td>16</td>
    <td>第一位最帥~</td>
    </tr>, <tr>
    <td colspan="3">村裏有個姑娘叫小花~</td>
    </tr>]

複製代碼


傳入正則表達式時re.compile('a')，將查找全部包含'a'的Tag

    soup.find_all([u]href[/u]=re.compile("elsie"))
    # [<a class="sister" [u]href[/u]="http://example.com/elsie" id="link1">Elsie</a>]

複製代碼


傳入列表時，將查找全部包含列表中元素的Tag

    soup.find_all(['th','td'])
    [<td colspan="3">魚C信息表</td>, <th id="th1">姓名</th>, <th id="th2">年齡</th>, <th id="th3">顏值</th>, <td>不二如是：</td>, <td>18</td>, <td>下一位更帥~</td>, <td>小甲魚老溼：</td>, <td>28</td>, <td>下一位更帥~</td>, <td>MSK：</td>, <td>16</td>, <td>第一位最帥~</td>, <td colspan="3">村裏有個姑娘叫小花~</td>]

複製代碼


傳入True時，我不會解釋，你本身看：

    soup.find_all(id=True)
    [<th id="th1">姓名</th>, <th id="th2">年齡</th>, <th id="th3">顏值</th>]

複製代碼

將全部具備id屬性的Tag查找了出來


text參數
經過 text 參數能夠搜搜文檔中的字符串內容.與 name 參數的可選值同樣, text 參數接受 字符串 , 正則表達式 , 列表, True

    soup.find_all(text='下一位更帥~')
    #['下一位更帥~', '下一位更帥~']
    soup.find_all(text=re.compile('帥'))
    #['下一位更帥~', '下一位更帥~', '第一位最帥~']
    soup.find_all(text=True)
    #['\n', '\n', '魚C信息', '\n', '\n', '魚C信息表', '\n', '\n', '\n', '姓名', '\n', '年齡', '\n', '顏值', '\n', '\n', '\n', '\n', '\n', '不二如是：', '\n', '18', '\n', '下一位更帥~', '\n', '\n', '\n', '小甲魚老溼：', '\n', '28', '\n', '下一位更帥~', '\n', '\n', '\n', 'MSK：', '\n', '16', '\n', '第一位最帥~', '\n', '\n', '\n', '村裏有個姑娘叫小花~', '\n', '\n', '\n']

複製代碼



limit 參數
限制返回個數

recursive 參數
指定爲True時，搜索範圍是子孫節點，若是設爲False，只搜索子節點

bs4補充：轉自https://www.jianshu.com/p/a2d68ae3d02d

soup = BeautifulSoup(open("index.html"), "lxml")
使用 request 向服務器請求網頁

wb_data = requests.get("http://www.baidu.com")    # 得到完整的 HTTP response

使用 beautifulsoup 解析網頁

soup = Beautifulsoup(wb_data.text,'lxml')   # 用`.text`提取 HTTP 體，即 HTML 文檔

搜索文檔樹

描述要爬取的元素在哪兒，獲取元素/標籤列表

過濾器類型

字符串
re
列表

若是傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中全部<a>標籤和<b>標籤:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

CSS選擇器

Beautiful Soup支持大部分CSS選擇器 ,在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字符串參數,便可使用CSS選擇器的語法找到tag。

xx = Soup.select()填入描述元素所在位置的路徑，獲取標籤列表

查找tab標籤：

soup.select("title")
# [<title>The Dormouse's story</title>]

經過tag標籤逐層查找，遍歷子標籤:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

找到某個tag標籤下的直接子標籤:

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

經過CSS的類名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

經過tag的id查找:

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

經過是否存在某個屬性來查找:

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

經過屬性的值來查找:

soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]