分析Ajax抓取今日頭條街拍美圖

時間 2019-11-21

標籤分析 ajax 抓取今日頭條欄目 Ajax 简体版

原文原文鏈接

聲明：此篇文章主要是觀看靜覓教學視頻後作的筆記，原教程地址https://cuiqingcai.com/html

實現流程介紹python

1.抓取索引頁內容：利用requests請求目標站點，獲得索引網頁HTML代碼，返回結果git

2.抓取詳情頁內容：解析返回結果，獲得詳情頁的連接，並進一步抓取詳情頁信息github

3.下載圖片與保存數據庫：將圖片下載到本地，並把頁面信息及圖片URL保存至MongoDBajax

4.開啓循環及多線程：對多頁內容遍歷，開啓多線程提升抓取速度正則表達式

具體實現數據庫

1. 首先訪問今日頭條網站輸入關鍵字來到索引頁，咱們須要經過分析網站來拿到進入詳細頁的url
通過觀察能夠發現每次滑動鼠標滾輪，新的標題連接就會被顯示，因此能夠發現其後臺爲Ajax請求，經過瀏覽器Network選項卡的XHR能夠找到Ajax的連接，其爲一個json數據，以搜索詞街拍爲例，其連接地址以下：json

https://www.toutiao.com/search_content/?offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=1&from=search_tab

2.經過點擊查看Query String Parameters中的內容，能夠看到一些相似字典的數據，因此這是一會須要經過urlencode來轉碼拼接成最終訪問的地址瀏覽器

offset: 0
format: json
keyword: 街拍
autoload: true
count: 20
cur_tab: 1
from: search_tab

3.隨着向下滑動滾動條顯示更多的圖片索引，會發現刷出了不少新的ajax請求，經過這個咱們能夠知道咱們以後能夠經過改變offset參數來獲取不一樣的拿到不一樣的索引界面，從而得到不一樣的圖集詳細頁url。開始只需實現一個offset參數的爬取，最後經過進程池Pool來建立實現多進程爬取不一樣offset參數的URL，加快爬取速度多線程

4.接下來就是分析查找圖集詳細頁的代碼，來找到圖片的url，這個圖片url隱藏的比較深，都在JS代碼中，因此不能使用BeautifulSoup和PyQuery來解析了，只能經過正則解析，使用正則解析要注意匹配規則必定要寫對。刷新頁面後，本身基礎比較差，找了很久換了火狐瀏覽器，又換回谷歌，最後在Network選項卡的Doc發現下面這個連接，而圖片地址就藏在gallery: JSON.parse裏

https://www.toutiao.com/a6585311263927042573/

5.代碼實現
代碼直接進行展現吧，須要的註釋我已經寫在代碼裏了，先編輯一個config.py的文件，裏面設置了代碼中用到的變量

MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'

GROUP_START = 1
GROUP_END = 20
KEYWORD='街拍'

#!/usr/bin/env python
# coding=utf-8

from urllib.parse import urlencode
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup
from json.decoder import JSONDecodeError
from hashlib import md5
from config import *
from multiprocessing import Pool
import requests
import json
import re
import os
import pymongo

client = pymongo.MongoClient(MONGO_URL, connect=False)
db = client[MONGO_DB]

def get_page_index(url, headers):
    """
        做用：返回頁面源碼
        url:請求地址
        headers：請求頭信息
    """
    try:
        response = requests.get(url, headers=headers)
        # 判斷是否訪問成功
        if response.status_code == 200:
            return response.text
    except ConnectionError:
        print('Erroe occured')
        return None

def parse_page_index(html):
    """
        做用：解析出標題URL地址
        html：網頁源碼
    """
    try:
        # 將數據轉爲json格式
        data = json.loads(html)
        # print(data)

        # 判斷data是否爲空，以及data字典中是否有data這個鍵
        if data and 'data' in data.keys():
            for item in data.get('data'):
                if item.get('article_url'):
                    yield item.get('article_url')
    except JSONDecodeError:
        pass

def get_page_detail(url, headers):
    """
        做用：返回標題URL網頁源碼
        url：標題URL地址
        headers：請求頭信息
    """
    try:
        response = requests.get(url, headers=headers)
        # 判斷是否訪問成功
        if response.status_code == 200:
            return response.text
    except ConnectionError:
        print('Error occured')
        return None

def parse_page_detail(html, url):
    """
        做用：解析標題URL地址的每一個圖片連接
        html：標題URL網頁源碼
        url：標題URL地址
    """
    # 利用BeautifulSoup找到title的文本
    soup = BeautifulSoup(html, 'lxml') 
    title = soup.title.text
    # 利用正則找到每一個下載圖片的地址
    images_pattern = re.compile('gallery: JSON.parse\("(.*)"\)', re.S)
    result = images_pattern.search(html)
    # print(result)
    if result:
        data = json.loads(result.group(1).replace('\\', ''))
        # 提取出sub_images鍵的鍵值
        if data and 'sub_images' in data.keys():
            sub_images = data.get('sub_images')
            # 使用列表生成式拿到每一個圖片URL
            images = [item.get('url') for item in sub_images]
            for image in images: 
                # 下載圖片
                download_image(image)
                # 將return的結果保存至MongoDB中
                return {
                    'title': title,
                    'url': url,
                    'images': images
                }

def download_image(url):
    """
        做用：返回圖片URL源碼
        url：圖片URL地址
    """
    print('Downloading', url)
    try:
        response = requests.get(url)
        # 判斷是否訪問成功
        if response.status_code == 200:
            save_image(response.content)
            return None
    except ConnectionError:
        return None

def save_image(content):
    """
        做用：保存圖像文件
        content：圖像二進制數據
    """
    # 使用md5加密內容，生成圖像名稱
    file_path = '{0}/{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')
    print(file_path)
    # 判斷該文件名是否存在
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            f.write(content)
            f.close()

def save_to_mongo(result):
    """
        做用：保存數據至MongoDB數據庫
        result：包括圖片標題，請求地址，圖像地址
    """
    if db[MONGO_TABLE].insert(result):
        print('Successfully Saved to Mongo', result)
        return True
    return False

def jiepai_Spider(offset):
    """
        做用：整個爬蟲調度器
        offset：位置參數
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
    }

    data = {
        "offset": offset,
        "format": "json",
        "keyword": "街拍",
        "autoload": "true",
        "count": "20",
        "cur_tab": "1",
        "from": "search_tab"
    }
    # 經過urlencode構造請求URL
    url = 'https://www.toutiao.com/search_content/' + '?' + urlencode(data)

    # 測試url
    # print(url)

    # 獲取頁面源碼
    html = get_page_index(url, headers)

    # 解析HTML，得到連接地址
    for url in parse_page_index(html):
        # print(url)
        # 得到每一個連接地址的HTML
        html = get_page_detail(url, headers)
        result = parse_page_detail(html, url)

        # 判斷result是否爲空，保存至MongoDB數據庫中
        if result: 
            save_to_mongo(result)


if __name__ == "__main__":
    # 建立進程池
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(jiepai_Spider, groups)
    pool.close()
    # 等待pool中全部子進程執行完成，必須放在close語句以後
    pool.join()

總結思考

1.在利用正則進行匹配的時候若是原文有‘(“ ”)'，這類符號時那麼你在進行正則表達式書寫的時候應該在前面加'\'。按理應該也可使用原始字符串r，但是我用完最後在匹配的時候返回的是None，留個疑問

pattern = re.compile('gallery: JSON\.parse\("(.*?)"\),', re.S)

2. db = client[MONGO_DB]這裏應該是方括號而不是 ( )，不然沒法正常訪問數據庫

3. 在Google瀏覽器中找不到圖片url，而後使用的是火狐瀏覽器來回查看

4.完整源碼地址：https://github.com/XiaoFei-97/toutiao_Spider-Ajax

原文出處：https://www.jzfblog.com/detail/66，文章的更新編輯以此連接爲準。歡迎關注源站文章！

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。