第14.9節 Python中使用urllib.request+BeautifulSoup獲取url訪問的基本信息

時間 2020-05-06

標籤 14.9 python 使用 urllib.request+beautifulsoup urllib request beautifulsoup 獲取 url 訪問基本信息欄目 Python 简体版

原文原文鏈接

利用urllib.request讀取url文檔的內容並使用BeautifulSoup解析後，能夠經過一些基本的BeautifulSoup對象輸出html文檔的基本信息。以博文《第14.6節使用Python urllib.request模擬瀏覽器訪問網頁的實現代碼》訪問爲例，讀取和解析代碼以下：html

>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>> def getURLinf(url): 
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}
    req = urllib.request.Request(url=url,headers=header)
    resp = urllib.request.urlopen(req,timeout=5)
    html = resp.read().decode()
  
    soup = BeautifulSoup(html,'lxml')
    return (soup,req,resp) 
>>>  soup,req ,resp  = getURLinf(r'https://blog.csdn.net/LaoYuanPython/article/details/100629947')

可獲取的基本信息包括：
一、文檔標題瀏覽器

>>> soup.title
<title>第14.6節 使用Python urllib.request模擬瀏覽器訪問網頁的實現代碼 - 老猿Python - CSDN博客</title>

二、文檔是否爲xml文檔session

>>> soup.is_xml
False

三、文檔的url地址ui

>>> req.full_url
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>> resp.geturl()
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>> resp.url
'https://blog.csdn.net/LaoYuanPython/article/details/100629947'
>>>

四、文檔所在的主機url

>>> req.host
'blog.csdn.net'

五、請求頭的信息.net

>>> req.header_items()
[('Host', 'blog.csdn.net'), ('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36')]
>>>

六、響應狀態碼rest

>>> resp.getcode()
200
>>>

七、響應http報文頭信息code

>>> resp.headers.items()
[('Date', 'Sun, 08 Sep 2019 15:07:12 GMT'), ('Content-Type', 'text/html; charset=UTF-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'close'), ('Set-Cookie', 'acw_tc=2760828215679552322374611eb7315abdcfe4ee6f7af5d157db5621c4267d;path=/;HttpOnly;Max-Age=2678401'), ('Server', 'openresty'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'uuid_tt_dd=10_19729129290-1567955232238-614052; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Set-Cookie', 'dc_session_id=10_1567955232238.557324; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Vary', 'Accept-Encoding'), ('Strict-Transport-Security', 'max-age=86400')]
>>>

本節介紹了使用urllib.request讀取url文檔的內容並使用BeautifulSoup解析後能夠很方便的獲取的一些url訪問的基本信息，經過這些信息能夠對本次訪問提供一些概要的信息。xml

老猿Python，跟老猿學Python!
博客地址：https://blog.csdn.net/LaoYuanPython
老猿Python博客文章目錄：https://blog.csdn.net/LaoYuanPython/article/details/98245036
請你們多多支持，點贊、評論和加關注！謝謝！htm