Python 網絡爬蟲 009 (編程) 經過正則表達式來獲取一個網頁中的全部的URL連接，並下載這些URL連接的源代碼

時間 2019-11-21

標籤 python 網絡爬蟲編程經過正則表達式獲取一個網頁全部 url 連接下載這些源代碼欄目 Python 简体版

原文原文鏈接

經過正則表達式來獲取一個網頁中的全部的 URL連接，並下載這些 URL連接的源代碼

使用的系統：Windows 10 64位
Python 語言版本：Python 2.7.10 V
使用的編程 Python 的集成開發環境：PyCharm 2016 04
我使用的 urllib 的版本：urllib2html

注意： 我沒這裏使用的是 Python2 ，而不是Python3python

一 . 前言

經過以前兩節（爬取一個網頁的網絡爬蟲和解決爬取到的網頁顯示時亂碼問題），咱們終於完成了最終的 download() 函數。
而且上上一節，咱們經過網站地圖解析裏面的URL的方式爬取了目標站點的全部網頁。在上一節，介紹一種方法來爬取一個網頁裏面全部的連接網頁。這一節，咱們經過正則表達式來獲取一個網頁中的全部的URL連接，並下載這些URL連接的源代碼。git

二 . 簡介

到目前爲止，咱們已經利用目標網站的結構特色實現了兩個簡單爬蟲。只要這兩個技術可用，就應當使用其進行爬取，由於這兩個方法最小化了須要下載的網頁數量。不過，對於一些網站，咱們須要讓爬蟲表現得更像普通用戶：跟蹤連接，訪問感興趣的內容。github

經過跟蹤全部連接的方式，咱們能夠很容易地下載整個網站的頁面。可是這種方法會下載大量咱們並不須要的網頁。例如，咱們想要從一個在線論壇中爬取用戶帳號詳情頁，那麼此時咱們只須要下載帳戶頁，而不須要下載討論輪貼的頁面。本篇博客中的連接爬蟲將使用正則表達式來肯定須要下載那些頁面。web

三 . 初級代碼

import re

def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex """
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                crawl_queue.append(link)

def get_links(html):
    """Return a list of links from html """
    # a regular expression to extract all links from the webpage 
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)

四 . 講解初級代碼

1 . 正則表達式

def link_crawler(seed_url, link_regex):

這個函數就是咱們要在外部使用的函數。功能：先下載 seed_url 網頁的源代碼，而後提取出裏面全部的連接URL，接着對全部匹配到的連接URL與link_regex 進行匹配，若是連接URL裏面有link_regex內容，就將這個連接URL放入到隊列中，下一次執行 while crawl_queue: 就對這個連接URL 進行一樣的操做。反反覆覆，直到 crawl_queue 隊列爲空，才退出函數。express

2 . 編程

get_links(html) 函數的功能：用來獲取 html 網頁中全部的連接URL。api

3 . 瀏覽器

webpage_regex = re.compile('<a[^>]+href=["\']'(.*?)["\']', re.IGNORECASE)

作了一個匹配模板，存在 webpage_regex 對象裏面。匹配<a href="xxx"> 這樣的字符串，並提取出裏面xxx的內容，這個xxx就是網址 URL 。

4 .

return webpage_regex.findall(html)

使用 webpage_regex 這個模板對 html 網頁源代碼匹配全部符合<a href="xxx"> 格式的字符串，並提取出裏面的 xxx 內容。

詳細的正則表達式的知識，請到這個網站了解：
http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

五 . 運行

先啓動Python 終端交互指令，在PyCharm軟件的Terminal窗口中或者在Windows 系統的DOS窗口中執行下面的命令：

C:\Python27\python.exe -i 1-4-4-regular_expression.py

執行link_crawler() 函數：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

輸出：

Downloading:  http://example.webscraping.com
Downloading:  /index/1
Traceback (most recent call last):
  File "1-4-4-regular_expression.py", line 50, in <module>
    link_crawler('http://example.webscraping.com', '/(index|view)')
  File "1-4-4-regular_expression.py", line 36, in link_crawler
    html = download(url)
  File "1-4-4-regular_expression.py", line 13, in download
    html = urllib2.urlopen(request).read()
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 423, in open
    protocol = req.get_type()
  File "C:\Python27\lib\urllib2.py", line 285, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: /index/1

運行的時候，出現了錯誤。這個錯誤出在：下載 /index/1 URL時。這個 /index/1 是目標站點中的一個相對連接，就是完整網頁URL 的路徑部分，而沒有協議和服務器部分。咱們使用download() 函數是沒有辦法下載的。在瀏覽器裏瀏覽網頁，相對連接是能夠正常工做的，可是在使用 urllib2 下載網頁時，由於沒法知道上下文，因此沒法下載成功。

七 . 改進代碼

因此爲了讓urllib2 可以定爲網頁，咱們須要將相對連接轉換爲絕對連接，這樣方可解決問題。
Python 裏面有能夠實現這個功能的模塊：urlparse。

下面對 link_crawler() 函數進行改進：

import urlparse
def link_crawler(seed_url, link_regex):
    """Crawl from the given seed URL following links matched by link_regex """
    crawl_queue = [seed_url]
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urlparse.urljoin(seed_url, link)
                crawl_queue.append(link)

八 . 運行：

運行程序：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

輸出：

Downloading:  http://example.webscraping.com
Downloading:  http://example.webscraping.com/index/1
Downloading:  http://example.webscraping.com/index/2
Downloading:  http://example.webscraping.com/index/3
Downloading:  http://example.webscraping.com/index/4
Downloading:  http://example.webscraping.com/index/5
Downloading:  http://example.webscraping.com/index/6
Downloading:  http://example.webscraping.com/index/7
Downloading:  http://example.webscraping.com/index/8
Downloading:  http://example.webscraping.com/index/9
Downloading:  http://example.webscraping.com/index/10
Downloading:  http://example.webscraping.com/index/11
Downloading:  http://example.webscraping.com/index/12
Downloading:  http://example.webscraping.com/index/13
Downloading:  http://example.webscraping.com/index/14
Downloading:  http://example.webscraping.com/index/15
Downloading:  http://example.webscraping.com/index/16
Downloading:  http://example.webscraping.com/index/17
Downloading:  http://example.webscraping.com/index/18
Downloading:  http://example.webscraping.com/index/19
Downloading:  http://example.webscraping.com/index/20
Downloading:  http://example.webscraping.com/index/21
Downloading:  http://example.webscraping.com/index/22
Downloading:  http://example.webscraping.com/index/23
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/index/24

經過運行獲得的結果，你能夠看出來：雖然，如今能夠下載網頁沒有出錯，可是一樣的網頁會被不斷的下載到。爲何會這樣？這是由於這些連接URL相互之間存在連接。若是兩個網頁之間相互都有對方的連接，那麼對着這個程序，它會不斷死循環下去。

因此，咱們還須要繼續改進程序：避免爬取相同的連接，因此咱們須要記錄哪些連接已經被爬取過，若是已經被爬取過了，就不在爬取它。

九 . 繼續改進 `link_crawler()`函數：

def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    # keep track which URL's have seen before
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            # check if link matches expected regex
            if re.match(link_regex, link):
                # form absolute link
                link = urlparse.urljoin(seed_url, link)
                # check if have already seen this link
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)

十 . 運行：

>>> link_crawler('http://example.webscraping.com', '/(index|view)')

輸出：

Downloading:  http://example.webscraping.com
Downloading:  http://example.webscraping.com/index/1
Downloading:  http://example.webscraping.com/index/2
Downloading:  http://example.webscraping.com/index/3
Downloading:  http://example.webscraping.com/index/4
Downloading:  http://example.webscraping.com/index/5
Downloading:  http://example.webscraping.com/index/6
Downloading:  http://example.webscraping.com/index/7
Downloading:  http://example.webscraping.com/index/8
Downloading:  http://example.webscraping.com/index/9
Downloading:  http://example.webscraping.com/index/10
Downloading:  http://example.webscraping.com/index/11
Downloading:  http://example.webscraping.com/index/12
Downloading:  http://example.webscraping.com/index/13
Downloading:  http://example.webscraping.com/index/14
Downloading:  http://example.webscraping.com/index/15
Downloading:  http://example.webscraping.com/index/16
Downloading:  http://example.webscraping.com/index/17
Downloading:  http://example.webscraping.com/index/18
Downloading:  http://example.webscraping.com/index/19
Downloading:  http://example.webscraping.com/index/20
Downloading:  http://example.webscraping.com/index/21
Downloading:  http://example.webscraping.com/index/22
Downloading:  http://example.webscraping.com/index/23
Downloading:  http://example.webscraping.com/index/24
Downloading:  http://example.webscraping.com/index/25
Downloading:  http://example.webscraping.com/view/Zimbabwe-252
Downloading:  http://example.webscraping.com/view/Zambia-251
Downloading:  http://example.webscraping.com/view/Yemen-250
Downloading:  http://example.webscraping.com/view/Western-Sahara-249

如今這個程序就是一個很是完美的程序，它會爬取全部地點，而且可以如期中止。最終，完美獲得了一個可用的爬蟲。

總結：
這樣，咱們就已經介紹了3種爬取一個站點或者一個網頁裏面全部的連接URL的源代碼。這些只是初步的程序，接下來，咱們還可能會遇到這樣的問題：
1 . 若是一些網站設置了禁止爬取的URL，咱們爲了執行這個站點的規則，就要按照它的 robots.txt 文件來設計爬取程序。
2 . 在國內是上不了google的，那麼若是咱們想要使用代理的方式上谷歌，就須要給咱們的爬蟲程序設置代理。
3 . 若是咱們的爬蟲程序爬取網站的速度太快，可能就會被目標站點的服務器封殺，因此咱們須要限制下載速度。
4 . 有一些網頁裏面有相似日曆的東西，這個東西里面的每個日期都是一個URL連接，咱們有不會去爬取這種沒有意義的東西。日期是無止境的，因此對於咱們的爬蟲程序來講，這就是一個爬蟲陷阱，咱們須要避免陷入爬蟲陷阱。

咱們須要解決上這4個問題。才能獲得最終版本的爬蟲程序。