使用python多進程爬取網頁圖片

時間 2021-08-14

標籤 css html python 正則表達式 json windows 數組瀏覽器網絡 app 欄目 Python 简体版

原文原文鏈接

1. 爬蟲簡單介紹

當咱們打開一個網頁，在上面發現一些了有用的信息以後，因而經過人工的方式從網頁上一頓操做將信息記錄起來，而經過爬蟲，則能夠利用一些設定好的規則以及方法來自動的從該網頁上獲取信息，總而言之就是解放雙手，釋放天性。

2. 爬取圖片

是的，今天就是要爬取這個網站上的圖片，這個網站上的圖片基本上都是一些高清大圖，有不少的beautiful girls，因此我要爬下來，當作個人電腦背景。
css

2.1 簡單介紹

如圖所示，首先拿到一個網頁，咱們須要對這個網頁作解析，找到圖片對應的標籤，找到頁數對應的標籤，找到以後把裏面的url地址提取出來，而後下載就能夠了,具體的處理流程以下圖所示：

這裏藉助了python的幾個模塊：html

bs4 用來解析html，分析html來拿到對應的URL
requests 用來獲取html對象
multiprocessing 使用多進程來提升下載圖片的效率

下面只對bs4作一個簡單的介紹python

3. bs4模塊使用介紹

官方介紹
> Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫.它可以經過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工做時間
Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種: Tag , NavigableString , BeautifulSoup , Comment

這裏會用到前三個對象：Tag, NavigableString, BeautifulSoup正則表達式

總而言之就是能夠幫助咱們更簡單的去解析html。

下面以一段示例來進行說明：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

3.1 建立一個BeautifulSoup對象

from bs4 import BeautifulSoup
# 傳入上面的那一段html
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

執行結果
json

html_doc = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

3.1.1 獲取標題

soup = BeautifulSoup(html_doc, 'html.parser')
soup.title

執行結果windows

The Dormouse's story

3.1.2 獲取指定標籤

soup.p

執行結果數組

<p class="title"><b>The Dormouse's story</b></p>

3.1.3 查找指定的全部標籤

soup.find_all('a')

這裏須要注意的是，find_all方法查找的是全部的某個標籤，例如這裏寫的是查找全部的a標籤，返回的是一個列表。瀏覽器

3.1.4 獲取某個標籤裏的某個屬性

soup.p['class']

執行結果網絡

['title']

3.2 使用Tag對象

Tag對象跟原生的xml或者html中的tag(標籤)相同，能夠直接經過對應的名稱來獲取，什麼意思呢？以下所示：

來打印下tag的全部屬性就知道了app

源內容爲：<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup = BeautifulSoup(html_doc, 'html.parser')
t = soup.a
print(t.attrs)

輸出結果爲：

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

能夠看到t這個標籤有href,class,id這些屬性，那咱們分別來打印下這些屬性的結果

soup = BeautifulSoup(html_doc, 'html.parser')
t = soup.a
print(t['href'])
print(t['class'])
print(t['id'])

輸出結果爲

http://example.com/elsie
['sister']
link1

class輸出的結果爲一個數組，之因此是數組，是由於class爲多值屬性

另外Tag對象還有倆比較重要的屬性：name和string，咱們先經過結果來看下這倆屬性的做用

soup = BeautifulSoup(html_doc, 'html.parser')
t = soup.a
print(t.name)
print(t.string)

結果爲

a
Elsie

可見，name即爲標籤的名稱，string即爲標籤中包含的字符串。

3.3 查找文檔樹

查找文檔樹比較重要，由於本文在爬取圖片的時候，就是經過搜索指定標籤來獲取我想要的內容的。
在查找文檔樹時，比較經常使用的一個方法就是`find_all`了，能夠經過傳入指定的字符串，也能夠經過自定義正則表達式，也能夠傳一個列表，下面咱們分別介紹下。

3.3.1 查找全部指定的標籤

源內容：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a'))

這裏就是查找全部的a標籤，返回的是一個數組(列表)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

既然返回的是一個列表，那麼咱們就能夠對這個返回結果進行遍歷

soup = BeautifulSoup(html_doc, 'html.parser')
for r in soup.find_all('a'):
    print(r.string)

這裏就是獲取<a></a>標籤中包含的字符串，結果以下：

Elsie
Lacie
Tillie

3.3.2 自定義正則表達式進行搜索

soup = BeautifulSoup(html_doc, 'html.parser')
for r in soup.find_all(id=re.compile(r'link(\d+)')):
    print(r)

這個就是經過正則表達式來指定咱們要匹配的內容，id=link+數字，知足條件的就是那三個a標籤

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

3.3.3 傳入一個列表，同時搜索多個標籤

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(['a','p']))

結果爲

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

這樣的話輸出的結果就會不少，那麼咱們能不能添加一些過濾器呢？

3.3.4 使用過濾器

咱們須要對咱們上面的示例html作下修改，內容以下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<a href="http://example.com/tillie" class="sister" id="link4">Tillie</a> 
<p class="story">...</p>
"""

新增了一個a標籤，id=link4，方便咱們後面調試，下面定義一個過濾器，參考官方文檔

soup = BeautifulSoup(html_doc, 'html.parser')

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))

執行結果

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

最終結果沒有咱們剛加的那個id=link4的a標籤，so, 過濾器生效了。

3.3.5 使用keywork參數

若是直接指定一個名稱的參數，在搜索時，極可能不是很準確，這個時候若是知道某個tag的屬性，就能夠經過這個來搜索了

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(id='link4'))

執行結果以下

[<a class="sister" href="http://example.com/tillie" id="link4">Tillie</a>]

結果就是咱們剛加的a標籤

若是咱們想搜索包含id這個屬性的全部tag，則可使用find_all(id=True)

3.3.6 構造字典參數

可是有時候有的屬性沒法搜索，例如: data-*屬性，這個時候就能夠經過attrs參數來定義一個字典參數來搜索包含特殊屬性的tag，以下：

soup.find_all(atrs={"data-foo": "要搜索的值"})

3.3.7 按CSS搜索

按照CSS類名搜索tag的功能很是實用,但標識CSS類名的關鍵字 class 在Python中是保留字,使用 class 作參數會致使語法錯誤.從Beautiful Soup的4.1.1版本開始,能夠經過 class_ 參數搜索有指定CSS類名的tag:

soup.find_all("a", class_="story")

class_參數一樣接受不一樣類型的過濾器 ,字符串,正則表達式,方法或True:

# 指定正則
soup.find_all(class_=re.compile("itl"))

# 經過自定義過濾器
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

最後一個執行結果爲

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

到這裏基本上就對bs4這個模塊有了一個基本的認識，知道這些咱們就能夠來爬取咱們想要的圖片了。

4. 一步一步的爬取網絡圖片

首先咱們先訪問https://wallhaven.cc/這個網站，搜索一下咱們想搜的一些圖片，例如輸入關鍵詞sexy girl，瀏覽器地址欄上就變成了https://wallhaven.cc/search?q=sexy girl&page=2這個地址，而後在搜索下其餘的，發現這個網站的搜索結果的連接是有規律的，以下所示:

https://wallhaven.cc/search?q=關鍵詞&參數

知道這個信息後，那咱們就直接使用requests來獲取這個網頁信息了。

4.1 解析網站的URL

f12看了下請求時的一些header，就隨便拿了幾個，而後直接使用requests

import requests

def request_client(url):
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'
    headers = {
        'user-agent': user_agent,
        'accept-ranges': 'bytes',
        'accept-language': 'zh-CN,zh;q=0.9'
    }
    req = requests.get(url, headers=headers)
    return req

print(request_client("https://wallhaven.cc/search?q=sexy%20girl").text)

結果就返回了一個html內容，下面這段是關於獲取圖片地址的地方

<li>
    <figure class="thumb thumb-4y9pv7 thumb-sfw thumb-general" data-wallpaper-id="4y9pv7" style="width:300px;height:200px">
        <img alt="loading" class="lazyload" data-src="https://th.wallhaven.cc/small/4y/4y9pv7.jpg" src="" />
        <a class="preview" href="https://wallhaven.cc/w/4y9pv7" target="_blank">
        </a>
        <div class="thumb-info">
            <span class="wall-res">
                1920 x 1200
            </span>
            <a class="jsAnchor overlay-anchor wall-favs" data-href="https://wallhaven.cc/wallpaper/fav/4y9pv7">
                9
                <i class="fa fa-fw fa-star">
                </i>
            </a>
            <a class="jsAnchor thumb-tags-toggle tagged" data-href="https://wallhaven.cc/wallpaper/tags/4y9pv7" title="Tags">
                <i class="fas fa-fw fa-tags">
                </i>
            </a>
        </div>
    </figure>
</li>

能夠看到圖片地址是在data-src這個屬性下的，另外咱們還知道這個<img>標籤下的class=lazyload, 待會咱們能夠經過這兩點信息來使用正則來獲取到圖片URL

def get_img_url_list(soup):
    # 主要是爲了取出url，並將url解析成能夠進行下載的連接
    def get_url(tag):
        re_img = re.compile(r'data-src="(.+?\.jpg)"')
        url = re_img.findall(str(tag))[0]
        _, img_name = os.path.split(url)
        replace_content = {
            'th.wallhaven.cc': 'w.wallhaven.cc',
            '/small/': '/full/',
            img_name: 'wallhaven-' + img_name
        }
        for k, v in replace_content.items():
            url = url.replace(k, v)
        return url
    img_url_list = []
    for tag in soup.find_all("img", class_="lazyload"):
        img_url_list.append(get_url(tag))
    return img_url_list

這一步咱們返回了一個元素爲圖片URL的列表，而且代碼裏對獲取的URL作了處理，由於咱們拿到的URL並非真正的圖片地址，經過打開一個圖片，在瀏覽器f12上分析圖片地址變成了

# 真正的下載地址
https://w.wallhaven.cc/full/4o/wallhaven-4ozvv9.jpg
# html中的地址
https://th.wallhaven.cc/small/4o/4ozvv9.jpg

因此在代碼裏作了以下替換, small ---> full, 4ozvv9.jpg ---> wallhaven-4ozvv9.jpg

4.2 獲取頁數

這一步須要繼續分析剛獲取的html，截取關鍵一段

<ul class="pagination" data-pagination='{"total":638,"current":1,"url":"https:\/\/wallhaven.cc\/search?q=animals&amp;page=1"}' role="navigation">
    <li>
        <span aria-hidden="true" original-tile="Previous Page">
    <i class="far fa-angle-double-left">
    </i>
    </span>
    </li>
    <li aria-current="page" class="current">
        <span original-title="Page 1">
            1
        </span>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=2" original-title="Page 2">
            2
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=3" original-title="Page 3">
            3
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=4" original-title="Page 4">
            4
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=5" original-title="Page 5">
            5
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=6" original-title="Page 6">
            6
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=7" original-title="Page 7">
           7
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=8" original-title="Page 8">
           8
        </a>
    </li>
    <li aria-disabled="true">
        <span>
            …
        </span>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=637" original-title="Page 637">
            637
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=638" original-title="Page 638">
            638
        </a>
    </li>
    <li>
        <a aria-label="next" class="next" href="https://wallhaven.cc/search?q=animals&amp;page=2" rel="next">
            <i class="far fa-angle-double-right">
            </i>
        </a>
    </li>
</ul>

分析<ul></ul>標籤裏的內容，能夠看出，頁數是在data-pagination這個屬性下的，因此咱們只須要拿到這個屬性對應的value就能夠了

def get_max_page(soup):
    result = soup.find('ul', class_='pagination')['data-pagination']
    to_json = json.loads(result)
    return to_json['total'] if 'total' in to_json else 1

在返回的時候簡單判斷下，保證返回的值能讓後面的代碼繼續運行，由於頁數不影響咱們的結果。

4.3 圖片下載

def getImg(img_url_list: list, save_path):
    if not os.path.isdir(save_path):
        os.makedirs(save_path)
    # 對保存的路徑簡單處理下
    end_swith = '\\' if platform.system().lower() == 'windows' else '/'

    if not save_path.endswith(end_swith):
        save_path = save_path + end_swith
    # 開始下載並保存到指定目錄下
    for img in img_url_list:
        _, save_name = os.path.split(img)
        whole_save_path = save_path + save_name
        img_content = request_client(img).content
        with open(whole_save_path, 'wb') as fw:
            fw.write(img_content)
        print("ImageUrl: %s download successfully." % img)
    return

下載比較簡單，只要拿到圖片地址就能夠正常下載了。

4.4 並行下載

爲了提升下載的速度，這裏使用了多進程multiprocessing，另外爲了保證使用多進程時，不把機器CPU跑滿，這裏不會使用所有的核數

def run(base_url, save_path, page=1):
    url = base_url + '&page=%d' % page
    pageHtml = request_client(url).text
    img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml'))
    getImg(img_url_list, save_path)

if __name__ == '__main__':
    start_time = time.time()
    baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc"
    save_path = '/data/home/dogfei/Pictures/Wallpapers'
    baseHtml = request_client(baseUrl).text
    pages = get_max_page(BeautifulSoup(baseHtml, 'lxml'))
    # 將CPU核數減一，避免CPU跑滿
    cpu = cpu_count() - 1
    print("Cpu cores: %d" % cpu)
    pages = cpu if pages > cpu else pages
    # 建立一個進程池
    pool = Pool(processes=cpu)
    for p in range(1, pages + 1):
        pool.apply_async(run, args=(baseUrl, save_path, p,))
    pool.close()
    pool.join()
    end_time = time.time()
    print("Total time: %.2f seconds" % (end_time - start_time))

這裏在下載的時候，不會把全部頁的圖片都下載了，會作一個簡單的判斷，當總頁數不超過CPU的核數的時候，會所有下載，不然，只會下載CPU核數對應的頁數。

5. 總結

源碼：

import re
import os
import json
import time
import requests
import platform
from bs4 import BeautifulSoup
from bs4 import NavigableString
from multiprocessing import Pool, cpu_count


def request_client(url):
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'
    headers = {
        'user-agent': user_agent,
        'accept-ranges': 'bytes',
        'accept-language': 'zh-CN,zh;q=0.9'
    }
    req = requests.get(url, headers=headers)
    return req


def get_max_page(soup):
    result = soup.find('ul', class_='pagination')['data-pagination']
    to_json = json.loads(result)
    return to_json['total'] if 'total' in to_json else 1


def get_img_url_list(soup):
    # 主要是爲了取出url，並將url解析成能夠進行下載的連接
    def get_url(tag):
        re_img = re.compile(r'data-src="(.+?\.jpg)"')
        url = re_img.findall(str(tag))[0]
        _, img_name = os.path.split(url)
        replace_content = {
            'th.wallhaven.cc': 'w.wallhaven.cc',
            '/small/': '/full/',
            img_name: 'wallhaven-' + img_name
        }
        for k, v in replace_content.items():
            url = url.replace(k, v)
        return url
    img_url_list = []
    for tag in soup.find_all("img", class_="lazyload"):
        img_url_list.append(get_url(tag))
    return img_url_list


def getImg(img_url_list: list, save_path):
    if not os.path.isdir(save_path):
        os.makedirs(save_path)

    end_swith = '\\' if platform.system().lower() == 'windows' else '/'

    if not save_path.endswith(end_swith):
        save_path = save_path + end_swith

    for img in img_url_list:
        _, save_name = os.path.split(img)
        whole_save_path = save_path + save_name
        img_content = request_client(img).content
        with open(whole_save_path, 'wb') as fw:
            fw.write(img_content)
        print("ImageUrl: %s download successfully." % img)
    return


def run(base_url, save_path, page=1):
    url = base_url + '&page=%d' % page
    pageHtml = request_client(url).text
    img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml'))
    getImg(img_url_list, save_path)


if __name__ == '__main__':
    # 指定要下載的連接
    baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc"
    # 指定要保存的目錄位置
    save_path = '/data/home/dogfei/Pictures/Wallpapers'
    ######## 如下不須要修改
    start_time = time.time()
    baseHtml = request_client(baseUrl).text
    pages = get_max_page(BeautifulSoup(baseHtml, 'lxml'))
    cpu = cpu_count() - 1
    print("Cpu cores: %d" % cpu)
    pages = cpu if pages > cpu else pages
    pool = Pool(processes=cpu)
    for p in range(1, pages + 1):
        pool.apply_async(run, args=(baseUrl, save_path, p,))
    pool.close()
    pool.join()
    end_time = time.time()
    print("Total time: %.2f seconds" % (end_time - start_time))