爬蟲基本原理及urllib庫的基本使用

時間 2019-11-16

標籤爬蟲基本原理 urllib 基本使用欄目網絡爬蟲简体版

原文原文鏈接

爬蟲基本原理及urllib庫的基本使用

爬蟲的基本原理

爬蟲定義：請求網站並提取數據的自動化程序

1.能按做者要求下載數據或內容

2.能自動在網絡上流竄

爬蟲的分類

1.通用爬蟲（不分類）

2.專用爬蟲（聚焦爬蟲）（主講）

基本流程

1.發起請求 HTTP Request 請求方式：GET POST

2.獲取相應內容 HTTP Response

3.解析內容

4.保存數據

Request

請求方式：GET POST

請求URL：全球統一的資源定位符一個網頁文檔一張圖片一個視頻均可用URL惟一來肯定

請求頭：頭部信息 such as：User-Agent Host Cookies

請求體：請求時額外攜帶的數據

Response

狀態碼：（Status）:200(成功) 301(跳轉) 404（找不到頁面） 502（服務器錯誤）

響應頭：內容類型內容長度服務器信息設置Cookie......

響應體：最主要的部分包含了請求資源的內容 such as: 網頁HTML 圖片二進制數據......

能抓取怎樣的數據

1.網頁文本：HTML文檔 Json格式文本

2.圖片：獲取到的是二進制文件保存爲圖片格式

3.視頻：同爲二進制文件保存爲視頻格式便可

4.其餘

解析方式

1.直接處理

2.Json解析（Json格式）

3.正則表達式

4.BeautifulSoup

5.XPath

6.PyQuery

Urllib庫

Urllib庫是Python自帶的一個HTTP請求庫，包含如下幾個模塊：

urllib.request 請求模塊

urllib.error 異常處理模塊

urllib.parse url解析模塊

urllib.robotparser robots.txt解析模塊

#urllib.request
複製代碼

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
#調用decode()方法經過utf-8方式轉換爲咱們能讀懂的網頁代碼
#GET



import urllib.parse
import urllib.request
d = byte(urllib.parse.urlencode({'name':'kobe'}), encoding = 'utf-8')
response = urllib.response.urlopen('http://www.baidu.com',data = d)
print(response.read().decode('utf-8'))
#POST



import socket
import urllib.request

try:
    response = urllib.request.urlopen('http://www.baidu.com',timeout = o.o1)
except urllib.error.UELError as e:
    if isinstance(e.reason,socket.timeout)
            print('Time Out')
#設置請求的超時時間



import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.status)#獲取狀態碼
print(response.getHeaders())#獲取響應頭的信息 #打印一個元祖列表
print(response.getHeader('Server'))#



import urllib.request
import urllib.parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent':'......'
    'Host':'......'
}
dict = {'name':'kobe'}

data = bytes(parse.uelopen(dict),encoding = post)
req = request.Request(url = url,data =data,headers = deaders,method = post) # Request函數
response = request.urlopen(req)
print(response.read().decode('utf-8'))

# 當咱們想傳遞request headers的時候 urlopen()沒法支持 故須要這個新的方法

# 用Request方法進行POST請求並加入了請求頭





#urllib.error
import urllib.request
import urllib.error

try:
    response = request.urlopen('http://www.baidu.com')
except error.HTTPError as e:
    print(e.reason,e.code,e.header,sep = '\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully!')

# 從代碼中能夠看出 HTTPError是URLError的子類





#urllib.parse
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)

# 拆分

# ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

# 拆分紅對應的元組

# scheme參數提供一個默認值 當URL沒有協議信息時 返回默認值

from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

# http://www.baidu.com/index.html;user?a=6#comment

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','index.html'))
print(urljoin('http://www.baidu.com#comment','?username="zhangsan"'))
print(urljoin('http://www.baidu.com','www.sohu.com'))

# http://www.baidu.com/index.html

# http://www.baidu.com?username="zhangsan"

# http://www.baidu.com/www.sohu.com

# 若是第二個參數是第一個參數中沒有的url組成部分 那將進行添加 不然進行覆蓋



from urllib.parse import urlencode

params = {
'name':'zhangsan',
'age':22    
} 
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

# 'http://www.baidu.com?name=zhangsan&age=22'
複製代碼