requests和BeautifulSoup

時間 2019-11-12

標籤 requests beautifulsoup 简体版

原文原文鏈接

一：Requests庫css

Requests is an elegant and simple HTTP library for Python, built for human beings.html

1.安裝html5

pip install requests

安裝小測python

>>> import requests
>>> r=requests.get("http://www.baidu.com")
>>> print(r.status_code)
200
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç\x99¾åº¦ä¸\x80ä¸\x8b class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ\x96°é\x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å\x9c°å\x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§\x86é¢\x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å\x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç\x99»å½\x95</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">ç\x99»å½\x95</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ\x9b´å¤\x9aäº§å\x93\x81</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å\x85³äº\x8eç\x99¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ\x84\x8fè§\x81å\x8f\x8dé¦\x88</a>&nbsp;äº¬ICPè¯\x81030173å\x8f·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

2.requests庫的八個方法json

requests.request() 構造一個請求，支撐如下各方法的基礎方法
requests.get() 獲取HTML網頁的主要方法，對應於HTTP的GET
requests.head() 獲取HTML網頁頭信息的方法，對應於HTTP的HEAD
requests.post() 向HTML網頁提交POST請求的方法，對應於HTTP的POS
requests.put() 向HTML網頁提交PUT請求的方法，對應於HTTP的PUT
requests.patch() 向HTML網頁提交局部修改請求，對應於HTTP的PATCH
requests.delete() 向HTML頁面提交刪除請求，對應於HTTP的DELETE

requests.options(url, **kwargs)

3 requests庫的兩個重要對象：request和response（包含爬蟲返回的內容）服務器

response = requests.get(url) cookie

構造一個向服務器請求資源的Request對象網絡

返回一個包含服務器資源的Response對象 app

∙ url : 擬獲取頁面的url連接
∙ params : url中的額外參數，字典或字節流格式，可選
∙ **kwargs: 12個控制訪問的參數 dom

Response對象包含服務器返回的全部信息，也包含請求的Request信息

>>> import requests
>>> r=requests.get("http://www.baidu.com")
>>> print(r.status_code)
200
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Server': 'bfe/1.0.8.18', 'Date': 'Fri, 17 Nov 2017 02:24:03 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:28 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'}
>>>

4 Response對象的屬性

r.status_code HTTP 請求的返回狀態，200表示鏈接成功，404表示失敗
r.text HTTP 響應內容的字符串形式，即，url對應的頁面內容
r.encoding 從HTTP header中猜想的響應內容編碼方式
apparent_encoding 從內容中分析出的響應內容編碼方式（備選編碼方式）
r.content HTTP 響應內容的二進制形式

與安裝小測比較

>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
>>> r.encoding="utf-8"
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>視頻</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登陸</a> </noscript> <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登陸</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>關於百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必讀</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a>&nbsp;京ICP證030173號&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

r.encoding 從HTTP header中猜想的響應內容編碼方式
r.apparent_encoding 從內容中分析出的響應內容編碼方式（備選編碼方式）

r.encoding：若是header中不存在charset，則認爲編碼爲ISO‐8859‐1 r.text根據r.encoding顯示網頁內容

r.apparent_encoding：根據網頁內容分析出的編碼方式，能夠看做是r.encoding的備選

5. requests異常

1 異常 說明
2 requests.ConnectionError 網絡鏈接錯誤異常，如DNS查詢失敗、拒絕鏈接等
3 requests.HTTPError          HTTP錯誤異常
4 requests.URLRequired           URL缺失異常
5 requests.TooManyRedirects     超過最大重定向次數，產生重定向異常
6 requests.ConnectTimeout         鏈接遠程服務器超時異常
7 requests.Timeout             請求URL超時，產生超時異常

6.response異常

r.raise_for_status() 若是不是200，產生異常 requests.HTTPError

r.raise_for_status()在方法內部判斷r.status_code是否等於200，不須要增長額外的if語句，該語句便於利用try‐except進行異常處理
7.HTTP，Hypertext Transfer Protocol，超文本傳輸協議
HTTP是一個基於「請求與響應」模式的、無狀態的應用層協議
HTTP協議採用URL做爲定位網絡資源的標識，URL格式以下：
http://host[:port][path]
host: 合法的Internet主機域名或IP地址
port: 端口號，缺省端口爲80
path: 請求資源的路徑

HTTP URL實例：
http://www.baidu.com
http://192.168.179.130/duty
HTTP URL的理解：
URL是經過HTTP協議存取資源的Internet路徑，一個URL對應一個數據資源

HTTP協議對資源的操做方法：

1 GET 請求獲取URL位置的資源
2 HEAD 請求獲取URL位置資源的響應消息報告，即得到該資源的頭部信息
3 POST 請求向URL位置的資源後附加新的數據
4 PUT 請求向URL位置存儲一個資源，覆蓋原URL位置的資源
5 PATCH 請求局部更新URL位置的資源，即改變該處資源的部份內容
6 DELETE 請求刪除URL位置存儲的資源

patch與put

假設URL位置有一組數據UserInfo，包括UserID、 UserName等20個字段
需求：用戶修改了UserName，其餘不變
• 採用PATCH，僅向URL提交UserName的局部更新請求
• 採用PUT，必須將全部20個字段一併提交到URL，未提交字段被刪除
PATCH的最主要好處：節省網絡帶寬

head

>>> r=requests.head("http://www.baidu.com")
>>> r.headers
{'Server': 'bfe/1.0.8.18', 'Date': 'Fri, 17 Nov 2017 02:51:22 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:50 GMT', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Content-Encoding': 'gzip'}
>>> r.text
''

參數

 1 **kwargs: 控制訪問的參數，均爲可選項
 2 params : 字典或字節序列，做爲參數增長到url中
 3 data : 字典、字節序列或文件對象，做爲Request的內容
 4 json : JSON格式的數據，做爲Request的內容
 5 headers : 字典，HTTP定製頭
 6 cookies : 字典或CookieJar，Request中的cookie
 7 auth : 元組，支持HTTP認證功能
 8  files : 字典類型，傳輸文件
 9 timeout : 設定超時時間，秒爲單位
10 proxies : 字典類型，設定訪問代理服務器，能夠增長登陸認證
11 allow_redirects : True/False，默認爲True，重定向開關
12 stream : True/False，默認爲True，獲取內容當即下載開關
13 verify : True/False，默認爲True，認證SSL證書開關
14 cert : 本地SSL證書路徑

Robots Exclusion Standard，網絡爬蟲排除標準
做用：
網站告知網絡爬蟲哪些頁面能夠抓取，哪些不行
形式：
在網站根目錄下的robots.txt文件

網絡爬蟲：自動或人工識別robots.txt，再進行內容爬取

約束性：Robots協議是建議但非約束性，網絡爬蟲能夠不遵照，但存在法律風險

安裝：pip install beautifulsoup4

Beautiful Soup庫，也叫beautifulsoup4 或 bs4
約定引用方式以下，即主要是用BeautifulSoup類

from bs4 import BeautifulSoup
import bs4

Beautiful Soup庫解析器

解析器　　　　　　　　使用方法　　　　　　　　　　　　條件
bs4的HTML解析器　　BeautifulSoup(mk,'html.parser') 　　安裝bs4庫
lxml的HTML解析器　　BeautifulSoup(mk,'lxml')　　 pip install lxml
lxml的XML解析器　　 BeautifulSoup(mk,'xml') 　　pip install lxml
html5lib的解析器　　BeautifulSoup(mk,'html5lib') 　　pip install html5lib

BeautifulSoup類的基本元素

Tag 標籤，最基本的信息組織單元，分別用<>和</>標明開頭和結尾，任何存在於HTML語法中的標籤均可以用soup.<tag>訪問得到，當HTML文檔中存在多個相同<tag>對應內容時，soup.<tag>返回第一個

Name 標籤的名字，<p>…</p>的名字是'p'，格式：<tag>.name,每一個<tag>都有本身的名字，經過<tag>.name獲取，字符串類型

Attributes 標籤的屬性，字典形式組織，格式：<tag>.attrs,一個<tag>能夠有0或多個屬性，字典類型

NavigableString 標籤內非屬性字符串，<>…</>中字符串，格式：<tag>.string,NavigableString能夠跨越多個層次

Comment 標籤內字符串的註釋部分，一種特殊的Comment類型,Comment是一種特殊類型

標籤樹的下行遍歷

.contents 子節點的列表，將<tag>全部兒子節點存入列表
.children 子節點的迭代類型，與.contents相似，用於循環遍歷兒子節點
.descendants 子孫節點的迭代類型，包含全部子孫節點，用於循環遍歷

上行遍歷

.parent 節點的父親標籤
.parents 節點先輩標籤的迭代類型，用於循環遍歷先輩節點

平行遍歷

.next_sibling 返回按照HTML文本順序的下一個平行節點標籤
.previous_sibling 返回按照HTML文本順序的上一個平行節點標籤
.next_siblings 迭代類型，返回按照HTML文本順序的後續全部平行節點標籤
.previous_siblings 迭代類型，返回按照HTML文本順序的前續全部平行節點標籤

bs4庫的prettify()方法
.prettify()爲HTML文本<>及其內容增長更加'\n'
.prettify()可用於標籤，方法：<tag>.prettify()

fiand_all()方法

<>.find_all(name, attrs, recursive, string, **kwargs)
∙ name : 對標籤名稱的檢索字符串
∙ attrs: 對標籤屬性值的檢索字符串，可標註屬性檢索
∙ recursive: 是否對子孫所有檢索，默認True
∙ string: <>…</>中字符串區域的檢索字符串
<tag>(..) 等價於 <tag>.find_all(..)
soup(..) 等價於 soup.find_all(..)

<>.find() 搜索且只返回一個結果，同.find_all()參數
<>.find_parents() 在先輩節點中搜索，返回列表類型，同.find_all()參數
<>.find_parent() 在先輩節點中返回一個結果，同.find()參數
<>.find_next_siblings() 在後續平行節點中搜索，返回列表類型，同.find_all()參數
<>.find_next_sibling() 在後續平行節點中返回一個結果，同.find()參數
<>.find_previous_siblings() 在前序平行節點中搜索，返回列表類型，同.find_all()參數
<>.find_previous_sibling() 在前序平行節點中返回一個結果，同.find()參數

相關標籤/搜索

requests+beautifulsoup

python+requests+beautifulsoup

python3+requests+beautifulsoup+mysql

requests+beautifulsoup+mysqldb

beautifulsoup

requests

python+beautifulsoup

webdriver+beautifulsoup

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。