安裝依賴以及頁面解析

時間 2019-11-13

原文原文鏈接

Date: 2019-06-19html

Author: Sunpython

本節要學習的庫有：git

網絡庫：requestsgithub

頁面解析庫：Beautiful Soup正則表達式

1 Requests庫

雖然Python的標準庫中 urllib 模塊已經包含了日常咱們使用的大多數功能，可是它的 API 使用起來讓人感受不太好，而 Requests 自稱「HTTP for Humans」，說明使用更簡潔方便。sql

Requests 是用Python語言編寫，基於 urllib，採用 Apache2 Licensed 開源協議的 HTTP 庫。它比 urllib 更加方便，能夠節約咱們大量的工做，徹底知足 HTTP 測試需求。Requests 的哲學是以 PEP 20 的習語爲中心開發的，因此它比 urllib 更加 Pythoner。更重要的一點是它支持 Python3 哦!json

Requests 惟一的一個非轉基因的 Python HTTP 庫，人類能夠安全享用：）api

Requests 繼承了urllib的全部特性。Requests支持HTTP鏈接保持和鏈接池，支持使用cookie保持會話，支持文件上傳，支持自動肯定響應內容的編碼，支持國際化的 URL 和 POST 數據自動編碼安全

requests 的底層實現其實就是 urllib3cookie

Requests的文檔很是完備，中文文檔也至關不錯。Requests能徹底知足當前網絡的需求，支持Python 2.6—3.6

1.1 安裝 Requests

pip install requests

Requests官方文檔：

http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

http協議測試網站：

http://httpbin.org/

1.2 基本用法：

import requests

response = requests.get('http://www.baidu.com')
print(response.request.url) # 等同於response.url
print(response.status_code)
#請求頭是請求頭，響應頭是響應頭
print(response.headers['content-type'])    #不區分大小寫
print(response.encoding)
print(response.text)       #獲取文本，通常狀況自動解碼

1.3 請求方法

Requests的請求再也不像urllib同樣須要去構造各類Request，opener和handler，使用Requests構造的方法，並在其中傳入須要的參數便可
每個請求方法都有一個對應的API，好比GET請求就可使用get()方法

POST請求就可使用post()方法，而且將須要提交的數據傳遞給data參數便可

設置訪問超時，設置timeout參數便可

requests.get(‘http://github.com’,timeout=0.01)

具體用例說明

import requests
response = requests.get('https://httpbin.org/get')        #拉數據
response = requests.post('http://gttpbin.org/post',data={'key': 'value'})   #推數據

# - post請求四種傳送正文方式：
# 　　- 請求正文是application/x-www-form-urlencoded
# 　　- 請求正文是multipart/form-data
# 　　- 請求正文是raw
# 　　- 請求正文是binary

response = requests.put('http://httpbin.org/put',data={'key':'value'})
response = requests.delete('http://httpbin.org/delete')
response = requests.head('http://httpbin.org/get')
response = requests.options('http://httpbin.org/get')

1.4 傳遞URL參數

（1）傳遞URL參數也不用再像urllib中那樣須要去拼接URL，而是簡單的構造一個字典，並在請求時將其傳遞給params參數

（2）有時候咱們會遇到相同的url參數名，但又不一樣的值，而Python的字典又不支持鍵的重名，能夠把鍵的值用列表表示

#傳遞URL參數也不用再像urllib中那樣須要去拼接URL，而是簡單的構造一個字典，並在請求時將其傳遞給params參數
import requests
params = {'key1':'value1','key2':'value2'}
response = requests.get('http://httpbin.org/get',params=params)
#有時候咱們會遇到相同的url參數名，但又不一樣的值，而Python的字典又不支持鍵的重名，能夠把鍵的值用列表表示
params = {'key1':'value1','key2':['value2','value3']}
response = requests.get('http://httpbin.org/get',params=params)
print(response.url)
print(response.content)
#http://httpbin.org/get?key1=value1&key2=value2&key2=value3

1.5 自定義Headers
若是想自定義請求的Headers，一樣的將字典數據傳遞給headers參數
url = ‘http://api.github.com/some/endpoint’
headers = {‘user-agent’:‘my-app/0.0.1’} #自定義headers
response = requests.get(url,headers=headers)

print(response.headers)

1.6 自定義cookies

Requests中自定義cookies也不用再去構造CookieJar對象，直接將字典遞給cookies參數

url = ‘http://httpbin.org/cookies’
co = {‘cookies_are’:‘working’}
response = requests.get(url,cookies=co)
print(response.text)   #{「cookies」: {「cookies_are」: 「working」}}

1.7 設置代理

#當咱們須要使用代理時，一樣構造代理字典，傳遞給proxies參數
import requests
proxies = {
'http':'http://10.10.1.10:3128',
'https':'https://10.10.1.10:1080'
}
requests.get('http://httpbin.org/ip',proxies=proxy)
print(response.text)

2 requests庫使用案例

例子1: 採用requests實現百度搜索功能

# -*- coding: utf-8 -*-
__author__ = 'sun'
__date__ = '2019/6/19 14:47'
import requests

def getfromBaidu(key):
    #url = 'http://www.baidu.com.cn/s?wd=' + urllib.parse.quote(key) + '&pn='  # word爲關鍵詞，pn是分頁。
    kv = {'wd': key}
    r = requests.get("http://www.baidu.com/s", params=kv)
    print(r.request.url)
    with open("baidu.html", "w", encoding='utf8')   as  f:
        f.write(r.text)

key = 'python'
getfromBaidu(key)

例子2：採用get和post方法

# -*- coding: utf-8 -*-  
__author__ = 'sun'
__date__ = '2019/6/19 下午9:32'

import requests 
import  json
r = requests.get(url='http://www.sina.com')  # 最基本的GET請求
print(r.status_code)  # 獲取返回狀態
r = requests.get(url='http://dict.baidu.com/s', params={'wd': 'python'})  # 帶參數的GET請求
print(r.url)
print(r.text)  # 打印解碼後的返回數據

print("#####################")
payload = (('key1', 'value1'), ('key1', 'value2'))
#urlencode
r = requests.post('http://httpbin.org/post', data=payload)

print("code: " + str(r.status_code) + ", text:" + r.text)

url = 'http://httpbin.org/post'
files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}
r = requests.post(url, files=files) 
print(r.text)

2 BeautifulSoup

簡介

Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。官方解釋以下：

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，經過解析文檔爲用戶提供須要抓取的數據，由於簡單，因此不須要多少代碼就能夠寫出一個完整的應用程序。

Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼。你不須要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。而後，你僅僅須要說明一下原始編碼方式就能夠了。

Beautiful Soup已成爲和lxml、html6lib同樣出色的python解釋器，爲用戶靈活地提供不一樣的解析策略或強勁的速度。

安裝

Beautiful Soup 3 目前已經中止開發，推薦在如今的項目中使用Beautiful Soup 4，不過它已經被移植到BS4了，也就是說導入時咱們須要 import bs4

進入python虛擬化環境，安裝lxml和bs4

pip install lxml

pip install bs4

使用方法

首先必需要導入 bs4 庫

from bs4 import BeautifulSoup

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種:

1. Tag
2. NavigableString
3. BeautifulSoup
4. Comment

語法：見附件《Beautiful Soup 4.2.0 文檔 — Beautiful Soup.pdf》

例子分析

假設串爲：

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title aq">
    <b>
        The Dormouse's story
    </b>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
    and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""

生成soup對象：

soup = BeautifulSoup(html_doc, 'lxml')

(1) Tag

通俗點講就是 HTML 中的一個個標籤，例如

<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

上面的 title a 等等 HTML 標籤加上裏面包括的內容就是 Tag; 下面咱們來感覺一下怎樣用 Beautiful Soup 來方便地獲取 Tags

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.head)
# <head><title>The Dormouse's story</title></head>

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(soup.p)
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print type(soup.a)
#<class 'bs4.element.Tag'>

對於 Tag，它有兩個重要的屬性，是 name 和 attrs

print(soup.name)
print(soup.head.name)
# [document]
# head

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

print soup.p['class']
#['title']

print soup.p.get('class')   #等價於上述的
#['title']

能夠對這些屬性和內容等等進行修改，例如

soup.p['class'] = "newClass"
print(soup.p)
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

複雜點的操做

# 獲取全部文字內容
print(soup.get_text())

# 輸出第一個  a 標籤的全部屬性信息
print(soup.a.attrs)

for link in soup.find_all('a'):
    # 獲取 link 的  href 屬性內容
    print(link.get('href'))

# 對soup.p的子節點進行循環輸出    
for child in soup.p.children:
    print(child)

# 正則匹配，名字中帶有b的標籤
for tag in soup.find_all(re.compile("b")):
    print(tag.name)

（2） NavigableString

既然咱們已經獲得了標籤的內容，那麼問題來了，咱們要想獲取標籤內部的文字怎麼辦呢？很簡單，用 .string 便可，例如

print(soup.p.string)
#The Dormouse's story

案例2：

新建文件test.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Hello</title>
</head>
<body>
   <div class="aaa" id="xxx">
       <p>Hello <span>world</span></p>
   </div>
   <div class="bbb" s="sss">bbbb1</div>
   <div class="ccc">ccc</div>
   <div class="ddd">dddd</div>
   <div class="eeee">eeeee</div>
</body>
</html>

測試python文件以下：

from bs4 import BeautifulSoup
import re
# 1. 建立BeautifulSoup對象
with open("test.html") as f:
    html_doc = f.read()

soup = BeautifulSoup(html_doc, 'lxml')
# 2. 按Tag name 找網頁元素
print(f"2.:{soup.title}")
print(f"2.:{soup.title.string}")
# 3. 使用get_text()獲取文本
print(f"3.get_text():{soup.div.get_text()}")
# 4. 如何獲取屬性
print("4.", soup.div['class'])
print("4.get", soup.div.get("class"))
print("4.attrs:", soup.div.attrs)
# 5. find_all(self, name=None, attrs={}, recursive=True, text=None,
#                 limit=None, **kwargs):
# 1) 獲取全部符合過濾條件的Tag
# 2) 過濾條件能夠是多個條件，也能夠是單個條件
# 3）過濾條件支持正則表達式
# 4） 參數說明
# -name- : Tag name, 默認值是None
# -attrs-：字典，字典裏能夠放tag的多個屬性。
# - recursive-：是否遞歸，默認值是True。
# - text-：按tag裏面的文本內容找，也支持正則表達式，默認值是None
# - limit-: 限制找的個數，默認值是None即不限制個數，若是想限制只找前2個的話，
#   設置limit = 2便可。
# -kwargs - : 接受關鍵參數，能夠指定特定的參數。例如： id = '',class_ = ''

divs = soup.find_all("div")
for div in divs:
    print("type(div)", type(div))
    print(div.get_text())
print(soup.find_all(name='div', class_='bbb'))
print("==", soup.find_all(limit=1, attrs={"class": re.compile('^b')}))
print(soup.find_all(text="bbbb1"))
print(soup.find_all(id="xxxx"))
# 6.find  limit =1 的find_all()
# 7.咱們能夠像使用find_all同樣使用tag.( 按tagname找其實就是find_all的一個快捷方式)
soup.find_all(name='div', class_='bbb')
soup.div(class_='bbb')
# 注意：咱們對Tag和BeautifulSoup類型的對象同等對待。
# 8. 查找當前Tag的子節點
# 1) 分屢次查找
div_tag = soup.div
print(type(soup))
print(type(div_tag))
print(div_tag.p)
# 2）使用contents得到tag對象的子節點
print("8.2):", soup.div.contents)
# 9. children  返回  list_iterator 類型的對象
body_children = soup.body.children
for child in body_children:
    print("9. ", child)
# 10. 父節點
tag_p = soup.p
print("10.", tag_p.parent)

# 11. 兄弟節點find_next_siblings
# 找當前tag的下面的全部兄弟節點
div_ccc = soup.find(name='div',class_='ccc')
print("11.", div_ccc)
print("11.", div_ccc.find_next_siblings(name='div'))
# 12. 兄弟節點find_previous_siblings
print("12.", div_ccc.find_previous_siblings(name='div'))

soup.find_previous_sibling()

做業：
採用requests庫爬取百度搜索頁面，輸入關鍵字，採用多線程或者多進程方式進行多頁爬取

https://www.baidu.com/s?wd=python&pn=20

分頁（頁數爲10頁）爬取