回車桌面圖片爬取

時間 2019-12-07

標籤回車桌面圖片简体版

原文原文鏈接

回車桌面圖片爬取

今天咱們就來爬爬這個網站 https://tu.enterdesk.com/ 這個網站能爬的資源仍是不少的，但我就寫一個例子，其餘的能夠根據思路去寫。html

首先仍是先來分析下這個網站的圖片獲取過程
python

我選擇的是圖庫，先隨便選擇一個標籤，我這選寵物吧
bash

喲，咱們再看看有沒有翻頁開啓F12(開發者工具)多線程

用不習慣火狐，仍是開谷歌來看吧
app

那麼就訪問看看？隨便選取一個訪問看看是否是能出圖片
https://tu.enterdesk.com/chongwu/6.htmlide

結果確定是能夠的啦函數

問題來了，如今怎麼查看最後一頁的頁碼是什麼？一種是無限循環下去直到沒有圖片標籤的時候報錯，還有一種就是從源碼中找出頁碼那就得看有沒有頁碼按鈕剛纔滾輪比較快如今慢一點看有沒有頁碼這些東西
工具

這網站仍是有頁碼的，那說明在html源碼中能找到頁碼數網站

兩種方法：
F12工具選擇元素
ui

Ctrl+U走一波源代碼直接搜索

如今找到全部頁碼，接下來就是分析圖片源地址了

選擇目標圖片看看是否是源地址原圖打開一看其實不是
https://up.enterdesk.com/edpic_360_360/4c/3e/c2/4c3ec2be7061121ad5994a9b51241fa3.jpg

如今再點擊進去圖片裏面發現是原圖了這時再選擇圖片查看標籤的圖片連接

複製上圖裏面的連接打開一看就是原圖啦看下圖的連接怎麼那麼熟悉？

對比下兩個連接
https://up.enterdesk.com/edpic_360_360/4c/3e/c2/4c3ec2be7061121ad5994a9b51241fa3.jpg

https://up.enterdesk.com/edpic_source/4c/3e/c2/4c3ec2be7061121ad5994a9b51241fa3.jpg

略縮圖 edpic_360_360
原圖 edpic_source

這下總體思路就有啦，咱們能夠獲取略縮圖的連接將url進行重構，造成原圖連接，而後再批量下載就好啦！

開始擼代碼了！！！

第一個是 class Spider(): 咱們聲明瞭一個類,而後咱們使用 def __init__去聲明一個構造函數

import requests
all_urls = []  # 咱們拼接好的每一頁連接

class Spider():
    # 構造函數，初始化數據使用
    def __init__(self, target_url, headers):
        self.target_url = target_url
        self.headers = headers

    # 獲取全部的想要抓取的URL
    def getUrls(self):
        #獲取末頁
        response = requests.get(target_url % 1,headers=headers).text
        html = BeautifulSoup(response,'html.parser')
        res = html.find(class_='wrap no_a').attrs['href']  #找到末頁的標籤提取末頁的連接
        page_num = int(re.findall('(\d+)',res)[0])  #正則匹配 頁碼數
        global all_urls
        # 循環獲得拼接URL
        for i in range(1, page_num + 1):
            url = self.target_url % i
            all_urls.append(url)

分析怎麼提取末頁連接以下圖：

這裏咱們採用多線程的方式爬取，引入下面幾個模塊

from bs4 import BeautifulSoup #解析html
import threading #多線程
import re #正則匹配
import time #時間

新增長一個全局的變量，並且是多線程操做，咱們須要引入線程鎖，避免資源同時寫入出錯。

all_img_urls = []       #全部圖片連接
g_lock = threading.Lock()  #初始化一個鎖

聲明一個Producer的類，負責提取圖片連接，而後添加到 all_img_urls 這個全局變量中

class Producer(threading.Thread):

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
        }
        global all_urls
        while len(all_urls) > 0:
            g_lock.acquire()  # 在訪問all_urls的時候，須要使用鎖機制
            page_url = all_urls.pop(0)  # 經過pop方法移除第一個元素，而且返回該值
            g_lock.release()  # 使用完成以後及時把鎖給釋放，方便其餘線程使用
            try:
                print("分析" + page_url)
                response = requests.get(page_url, headers=headers, timeout=3).text
                html = BeautifulSoup(response,'html.parser')
                pic_link = html.find_all(class_='egeli_pic_li')[:-1]
                global all_img_urls
                g_lock.acquire()  # 這裏還有一個鎖
                for i in pic_link:
                    link = i.find('img')['src'].replace('edpic_360_360','edpic_source')
                    all_img_urls.append(link)
                g_lock.release()  # 釋放鎖
                # time.sleep(0.1)
            except:
                pass

線程鎖，在上面的代碼中，當咱們操做all_urls.pop(0)的時候，咱們是不但願其餘線程對他進行同時操做的，不然會出現意外，因此咱們使用g_lock.acquire()鎖定資源，而後使用完成以後，記住必定要立馬釋放g_lock.release(),不然這個資源就一直被佔用着，程序沒法進行下去了。

if __name__ == "__main__":

    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

    target_url = 'https://tu.enterdesk.com/chongwu/%d.html'  # 圖片集和列表規則

    print('開始獲取全部圖片頁連接！！！')
    spider = Spider(target_url, headers)
    spider.getUrls()
    print('完成獲取全部圖片頁，開始分析圖片連接！！！')

    threads = []
    for x in range(10):
        gain_link = Producer()
        gain_link.start()
        threads.append(gain_link)

    # join 線程同步 主線程任務結束以後 進入阻塞狀態 等待其餘的子線程執行結束以後 主線程在終止
    for tt in threads:
        tt.join()

下面再定義一個DownPic類用於下載圖片

class DownPic(threading.Thread):
    
    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
        }
        while True:
            global all_img_urls
            # 上鎖
            g_lock.acquire()
            if len(all_img_urls) == 0:  # 若是沒有圖片了，就解鎖
                # 無論什麼狀況，都要釋放鎖
                g_lock.release()
                break
            else:
                t = time.time()
                down_time = str(round(t * 1000))  # 毫秒級時間戳
                pic_name = 'D:\\test\\'+ down_time + '.jpg'
                pic = all_img_urls.pop(0)
                g_lock.release()
                response = requests.get(pic, headers=headers)
                with open(pic_name, 'wb') as f:
                    f.write(response.content)
                    f.close()
                print(pic_name + '   已下載完成！！！')

能夠看到利用了down_time = str(round(t * 1000)) 來生成毫秒級時間戳來命名圖片其實也能夠獲取圖片的名稱來命名那就靠本身去寫一個了

再從if __name__ == "__main__": 添加下面代碼用於開啓多線程下載

print('分析圖片連接完成，開始多線程下載！！！')
    for x in range(20):
        download = DownPic()
        download.start()

總體流程就這麼寫完啦！run下代碼

Tips：跑這個代碼須要在D盤建立test文件夾或者本身修改代碼實現其餘功能

附出完整代碼：

import requests
from bs4 import BeautifulSoup #解析html
import threading #多線程
import re #正則匹配
import time #時間


all_urls = []  # 咱們拼接好的每一頁連接
all_img_urls = []       #全部圖片連接
g_lock = threading.Lock()  #初始化一個鎖

class Spider():
    # 構造函數，初始化數據使用
    def __init__(self, target_url, headers):
        self.target_url = target_url
        self.headers = headers

    # 獲取全部的想要抓取的URL
    def getUrls(self):
        #獲取末頁
        response = requests.get(target_url % 1,headers=headers).text
        html = BeautifulSoup(response,'html.parser')
        res = html.find(class_='wrap no_a').attrs['href']  #找到末頁的標籤提取末頁的連接
        page_num = int(re.findall('(\d+)',res)[0])  #正則匹配 頁碼數
        global all_urls
        # 循環獲得拼接URL
        for i in range(1, page_num + 1):
            url = self.target_url % i
            all_urls.append(url)


#負責提取圖片連接
class Producer(threading.Thread):

    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
        }
        global all_urls
        while len(all_urls) > 0:
            g_lock.acquire()  # 在訪問all_urls的時候，須要使用鎖機制
            page_url = all_urls.pop(0)  # 經過pop方法移除第一個元素，而且返回該值
            g_lock.release()  # 使用完成以後及時把鎖給釋放，方便其餘線程使用
            try:
                print("分析" + page_url)
                response = requests.get(page_url, headers=headers, timeout=3).text
                html = BeautifulSoup(response,'html.parser')
                pic_link = html.find_all(class_='egeli_pic_li')[:-1]
                global all_img_urls
                g_lock.acquire()  # 這裏還有一個鎖
                for i in pic_link:
                    link = i.find('img')['src'].replace('edpic_360_360','edpic_source')
                    all_img_urls.append(link)
                g_lock.release()  # 釋放鎖
                # time.sleep(0.1)
            except:
                pass


class DownPic(threading.Thread):
    
    def run(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
        }
        while True:
            global all_img_urls
            # 上鎖
            g_lock.acquire()
            if len(all_img_urls) == 0:  # 若是沒有圖片了，就解鎖
                # 無論什麼狀況，都要釋放鎖
                g_lock.release()
                break
            else:
                t = time.time()
                down_time = str(round(t * 1000))  # 毫秒級時間戳
                pic_name = 'D:\\test\\'+ down_time + '.jpg'
                pic = all_img_urls.pop(0)
                g_lock.release()
                response = requests.get(pic, headers=headers)
                with open(pic_name, 'wb') as f:
                    f.write(response.content)
                    f.close()
                print(pic_name + '   已下載完成！！！')


if __name__ == "__main__":

    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

    target_url = 'https://tu.enterdesk.com/chongwu/%d.html'  # 圖片集和列表規則

    print('開始獲取全部圖片頁連接！！！')
    spider = Spider(target_url, headers)
    spider.getUrls()
    print('完成獲取全部圖片頁，開始分析圖片連接！！！')

    threads = []
    for x in range(10):
        gain_link = Producer()
        gain_link.start()
        threads.append(gain_link)

    # join 線程同步 主線程任務結束以後 進入阻塞狀態 等待其餘的子線程執行結束以後 主線程在終止
    for tt in threads:
        tt.join()

    print('分析圖片連接完成，開始多線程下載！！！')
    for x in range(20):
        download = DownPic()
        download.start()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。