Python爬蟲入門教程 8-100 蜂鳥網圖片爬取之三

時間 2019-11-16

原文原文鏈接

蜂鳥網圖片--囉嗦兩句

前幾天的教程內容量都比較大，今天寫一個相對簡單的，爬取的仍是蜂鳥，依舊採用aiohttp 但願你喜歡
爬取頁面https://tu.fengniao.com/15/ 本篇教程仍是基於學習的目的，爲啥選擇蜂鳥，沒辦法，我瞎選的。php

一頓熟悉的操做以後，我找到了下面的連接
https://tu.fengniao.com/ajax/ajaxTuPicList.php?page=2&tagsId=15&action=getPicListsgit

這個連接返回的是JSON格式的數據github

page =2頁碼，那麼從1開始進行循環就行了
tags=15 標籤名稱，15是兒童，13是美女，6391是私房照，只能幫助你到這了，畢竟我這是專業博客 ヾ(◍°∇°◍)ﾉﾞ
action=getPicLists接口地址，不變的地方

數據有了，開爬吧

import aiohttp
import asyncio

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
           "X-Requested-With": "XMLHttpRequest",
           "Accept": "*/*"}

async def get_source(url):
    print("正在操做:{}".format(url))
    conn = aiohttp.TCPConnector(verify_ssl=False)  # 防止ssl報錯,其中一種寫法
    async with aiohttp.ClientSession(connector=conn) as session:  # 建立session
        async with session.get(url, headers=headers, timeout=10) as response:  # 得到網絡請求
            if response.status == 200:  # 判斷返回的請求碼
                source = await response.text()  # 使用await關鍵字獲取返回結果
                print(source)
            else:
                print("網頁訪問失敗")


if __name__=="__main__":
        url_format = "https://tu.fengniao.com/ajax/ajaxTuPicList.php?page={}&tagsId=15&action=getPicLists"
        full_urllist= [url_format.format(i) for i in range(1,21)]
        event_loop = asyncio.get_event_loop()   #建立事件循環
        tasks = [get_source(url) for url in full_urllist]
        results = event_loop.run_until_complete(asyncio.wait(tasks))   #等待任務結束

上述代碼在執行過程當中發現，順發了20個請求，這樣子很容易就被人家斷定爲爬蟲，可能會被封IP或者帳號，咱們須要對併發量進行一下控制。
使Semaphore控制同時的併發量ajax

import aiohttp
import asyncio
# 代碼在上面
sema = asyncio.Semaphore(3)
async def get_source(url):
    # 代碼在上面
    #######################
# 爲避免爬蟲一次性請求次數太多，控制一下
async def x_get_source(url):
    with(await sema):
        await get_source(url)

if __name__=="__main__":
        url_format = "https://tu.fengniao.com/ajax/ajaxTuPicList.php?page={}&tagsId=15&action=getPicLists"
        full_urllist= [url_format.format(i) for i in range(1,21)]
        event_loop = asyncio.get_event_loop()   #建立事件循環
        tasks = [x_get_source(url) for url in full_urllist]
        results = event_loop.run_until_complete(asyncio.wait(tasks))   #等待任務結束

走一波代碼，出現下面的結果，就能夠啦！
json

在補充上圖片下載的代碼網絡

import aiohttp
import asyncio

import json

## 蜂鳥網圖片--代碼去上面找
async def get_source(url):
    print("正在操做:{}".format(url))
    conn = aiohttp.TCPConnector(verify_ssl=False)  # 防止ssl報錯,其中一種寫法
    async with aiohttp.ClientSession(connector=conn) as session:  # 建立session
        async with session.get(url, headers=headers, timeout=10) as response:  # 得到網絡請求
            if response.status == 200:  # 判斷返回的請求碼
                source = await response.text()  # 使用await關鍵字獲取返回結果
                ############################################################
                data = json.loads(source)
                photos = data["photos"]["photo"]
                for p in photos:
                    img = p["src"].split('?')[0]
                    try:
                        async with session.get(img, headers=headers) as img_res:
                            imgcode = await img_res.read()
                            with open("photos/{}".format(img.split('/')[-1]), 'wb') as f:
                                f.write(imgcode)
                                f.close()
                    except Exception as e:
                        print(e)
                ############################################################
            else:
                print("網頁訪問失敗")


# 爲避免爬蟲一次性請求次數太多，控制一下
async def x_get_source(url):
    with(await sema):
        await get_source(url)


if __name__=="__main__":
        #### 代碼去上面找

圖片下載成功，一個小爬蟲，咱們又寫完了，美滋滋
session