python爬蟲——寫出最簡單的網頁爬蟲

時間 2019-12-07

原文原文鏈接

知識就像碎布，記得「縫一縫」，你才能華麗麗地亮相html

最近對python爬蟲有了強烈地興趣，在此分享本身的學習路徑，歡迎你們提出建議。咱們相互交流，共同進步。

1.開發工具

筆者使用的工具是sublime text3，它的短小精悍（可能男人們都不喜歡這個詞）使我十分着迷。推薦你們使用，固然若是你的電腦配置不錯，pycharm可能更加適合你。
sublime text3搭建python開發環境推薦查看此博客：
[sublime搭建python開發環境][http://www.cnblogs.com/codefish/p/4806849.html]

2.爬蟲介紹

爬蟲顧名思義，就是像蟲子同樣，爬在Internet這張大網上。如此，咱們即可以獲取本身想要的東西。
既然要爬在Internet上，那麼咱們就須要瞭解URL，法號「統一資源定位器」，小名「連接」。其結構主要由三部分組成：
（1）協議：如咱們在網址中常見的HTTP協議。
（2）域名或者IP地址：域名，如：www.baidu.com，IP地址，即將域名解析後對應的IP。
（3）路徑：即目錄或者文件等。

3.urllib開發最簡單的爬蟲

（1）urllib簡介

Module	Introduce
urllib.error	Exception classes raised by urllib.request.
urllib.parse	Parse URLs into or assemble them from components.
urllib.request	Extensible library for opening URLs.
urllib.response	Response classes used by urllib.
urllib.robotparser	Load a robots.txt file and answer questions about fetchability of other URLs.

（2）開發最簡單的爬蟲

百度首頁簡潔大方，很適合咱們爬蟲。
爬蟲代碼以下：

from urllib import request

def visit_baidu():
    URL = "http://www.baidu.com"
    # open the URL
    req = request.urlopen(URL)
    # read the URL 
    html = req.read()
    # decode the URL to utf-8
    html = html.decode("utf_8")
    print(html)

if __name__ == '__main__':
    visit_baidu()

結果以下圖：

咱們能夠經過在百度首頁空白處右擊，查看審查元素來和咱們的運行結果對比。
固然，request也能夠生成一個request對象，這個對象能夠用urlopen方法打開。
代碼以下：

from urllib import request

def vists_baidu():
    # create a request obkect
    req = request.Request('http://www.baidu.com')
    # open the request object
    response = request.urlopen(req)
    # read the response 
    html = response.read()
    html = html.decode('utf-8')
    print(html)

if __name__ == '__main__':
    vists_baidu()

運行結果和剛纔相同。

（3）錯誤處理

錯誤處理經過urllib模塊來處理，主要有URLError和HTTPError錯誤，其中HTTPError錯誤是URLError錯誤的子類，即HTTRPError也能夠經過URLError捕獲。
HTTPError能夠經過其code屬性來捕獲。
處理HTTPError的代碼以下：

from urllib import request
from urllib import error

def Err():
    url = "https://segmentfault.com/zzz"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.HTTPError as e:
        print(e.code)
if __name__ == '__main__':
    Err()

運行結果如圖：python

404爲打印出的錯誤代碼，關於此詳細信息你們能夠自行百度。

URLError能夠經過其reason屬性來捕獲。
chuliHTTPError的代碼以下：

from urllib import request
from urllib import error

def Err():
    url = "https://segmentf.com/"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.URLError as e:
        print(e.reason)
if __name__ == '__main__':
    Err()

運行結果如圖：

既然爲了處理錯誤，那麼最好兩個錯誤都寫入代碼中，畢竟越細緻越清晰。須注意的是，HTTPError是URLError的子類，因此必定要將HTTPError放在URLError的前面，不然都會輸出URLError的，如將404輸出爲Not Found。
代碼以下：

from urllib import request
from urllib import error

# 第一種方法，URLErroe和HTTPError
def Err():
    url = "https://segmentfault.com/zzz"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.HTTPError as e:
        print(e.code)
    except error.URLError as e:
        print(e.reason)

你們能夠更改url來查看各類錯誤的輸出形式。

新人初來乍到不容易，若是您以爲有那麼一丟丟好的話，請不要吝嗇您的讚揚~撒花。segmentfault

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。