學習筆記 - first web crawler

時間 2019-11-08

標籤學習筆記 web crawler 欄目 HTML 简体版

原文原文鏈接

打開jupyterhtml

首先咱們先導入python urllib庫裏面的request模塊python

from urllib.request import urlopen
複製代碼

urlopen 是用來打開並讀取一個從網絡獲取的遠程對象，是一個通用的庫，能夠讀取html文件圖像文件以及其餘任何文件流。api

html = urlopen("http://www.naver.com")
print(html.read())複製代碼

爬取結果：bash

BeautifulSoup

BeautifulSoup 庫經過定位HTML標籤來格式化和組織複雜的網絡信息，用簡單易用的python對象展示XML結構信息。
服務器

因爲 BeautifulSoup 庫不是 Python 標準庫，所以須要單獨安裝。可是jupyter中能夠直接使用，能夠省去不少時間，直接使用。（BeautifulSoup 庫最經常使用的對象剛好就是 BeautifulSoup 對象）
網絡

把文章開頭的例子進行調整：框架

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
html = urlopen("http://www.pythonscraping.com/pages/page1.html") 
bsObj = BeautifulSoup(html.read()) 
print(bsObj.h1)

複製代碼

導入 urlopen，而後調用 html.read() 獲取網頁的 HTML 內容。這樣就能夠把 HTML 內容傳到 BeautifulSoup 對象，轉換結構：
函數

想提取什麼標籤的話：ui

bsObj.html.body.h1 
        bsObj.body.h1 
        bsObj.html.h1複製代碼

如今初步的框架搭建出來了，開始考慮問題。url

問題

若是網頁在服務器上不存在（或者獲取頁面的時候出現錯誤）時候：

程序會返回 HTTP 錯誤。HTTP 錯誤多是「404 Page Not Found」「500 Internal Server Error」等。全部相似情形，urlopen 函數都會拋出「HTTPError」異常。

解決方法：

try:     
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:     
    print(e)     
    # 返回空值，中斷程序，或者執行另外一個方案 
else:    
    # 程序繼續。注意：若是你已經在上面異常捕捉那一段代碼裏返回或中斷（break），
    # 那麼就不須要使用else語句了，這段代碼也不會執行複製代碼

服務器不存在的時候：

若是服務器不存在（就是說連接打不開，或者是 URL 連接寫錯了），urlopen 會返回一個 None 對象。能夠增長一個判斷語句檢測返回的 html 是否是 None：

if html is None:     
    print("URL is not found") 
else:     
    # 程序繼續
複製代碼

當你調用的標籤不存在：

若是你想要調用的標籤不存在，BeautifulSoup 就會返回 None 對象。不過，若是再調用這個 None 對象下面的子標籤，就會發生 AttributeError 錯誤。

解決方法：

try:     
    badContent = bsObj.body.li
except AttributeError as e:     
    print("Tag was not found") 
else:     
    if badContent == None:         
        print ("Tag was not found")     
    else:         
        print(badContent)
複製代碼

從新組織代碼：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("沒有找到網頁")
else:
    print(title)複製代碼

結果：

P1：