分析Ajax爬取今日頭條街拍美圖-崔慶才思路

時間 2020-02-15

標籤分析 ajax 今日頭條思路欄目 Ajax 简体版

原文原文鏈接

- 站點分析
- 源碼及遇到的問題

站點分析

首先,打開頭條,在搜索框輸入關鍵字以後,在返回的頁面中,勾選Perserve log,這玩意兒在頁面發生變化的時候,不會清除以前的交互信息.html

在返回的response中,咱們看不到常見的HTML代碼,因此初步斷定,這個網站是經過ajax動態加載的.java

pic-1581682361199.png

切換到XHR過濾器,進一步查看.python

pic-1581682361200.png

發現隨着網頁的滾動,會產生相似這樣的的Ajax請求出來. 仔細查看內容,能夠看到與網頁中條目對應的title和article_url.web

因此初步思路,經過article_url字段先抓取文章條目
ajax

分析json數據,能夠看到,這裏有article_url,另外,此次要抓取的是圖集形式的頁面,因此要注意下這個has_gallerymongodb

而後咱們再來看具體的頁面數據庫

在具體頁面的html中,咱們發現,圖片的全部連接直接在網頁源代碼中包含了,因此,咱們直接拿到源碼,正則匹配一下就行了.json

pic-1581682361200.png

至此,頁面分析完成.api

開工!cookie

源碼及遇到的問題

代碼結構

方法定義

def get_page_index(offset, keyword): 獲取搜索結果索引頁面
def parse_page_index(html): 解析索引頁面,主要是解析json內容,因此須要用到json.loads方法
def get_page_detail(url): 用來獲取具體圖片的頁面,與索引頁獲取差很少
def parse_page_details(html, url):解析具體圖集頁面
def save_to_mongo(result): 將標題,url等內容保存到mongoDB數據庫. 之因此使用mongoDB數據庫,由於mongoDB簡單,並且是K-V方式的存儲,對於字典類型很友好
def download_image(url): 下載圖片
def save_img(content): 保存圖片
def main(offset): 對以上各類方法的調用

須要的常量

MONGO_URL = 'localhost' # 數據庫位置
MONGO_DB = 'toutiao'    # 數據庫名
MONGO_TABLE = 'toutiao'# 表名
GROUP_START = 1 # 循環起始值
GROUP_END = 20 # 循環結束值
KEY_WORD = '街拍' # 搜索關鍵字

關於在代碼中遇到的問題

01. 數據庫鏈接

第一次在python中使用數據庫,並且用的仍是MongoDB. 使用以前引入 pymongo庫,數據庫鏈接的寫法比較簡單. 傳入url 而後在建立的client中直接指定數據庫名稱就能夠了.

client = pymongo.MongoClient(MONGO_URL,connect=False)
db = client[MONGO_DB]

02.今日頭條的反爬蟲機制

今日頭條比較有意思,反爬蟲機制不是直接給個400的迴應,而是返回一些錯誤的無效的代碼或者json. 不明白是什麼原理,是請求不對,仍是怎麼了. 因此針對今日頭條的反爬蟲機制,通過嘗試以後發現須要構造get的參數和請求頭.
並且今日頭條的請求頭中,須要帶上cookie信息. 否則返回的response仍是有問題.

這裏還要注意的就是cookie信息有時效問題,具體多長時間,我也沒搞明白,幾個小時應該是有的,因此在執行以前,cookie最好更新一下

一樣的在獲取詳情頁的時候也有這個問題存在. 並且還犯了一個被本身蠢哭的錯誤. headers沒有傳到requests方法中去.

def get_page_index(offset, keyword):
    timestamp = int(time.time())
    data = {
        "aid": "24",
        "app_name": "web_search",
        "offset": offset,
        "format": "json",
        "keyword": keyword,
        "autoload": "true",
        "count": "20",
        "en_qc": "1",
        "cur_tab": "1",
        "from": "search_tab",
        # "pd": "synthesis",
        "timestamp": timestamp
    }
    headers = {
        # 這裏當心cookie失效的問題
        'cookie': 'tt_webid=6791640396613223949; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6791640396613223949; csrftoken=4a29b1b1d9ecf8b5168f1955d2110f16; s_v_web_id=k6g11cxe_fWBnSuA7_RBx3_4Mo4_9a9z_XNI0WS8B9Fja; ttcid=3fdf0861117e48ac8b18940a5704991216; tt_scid=8Z.7-06X5KIZrlZF0PA9kgiudolF2L5j9bu9g6Pdm.4zcvNjlzQ1enH8qMQkYW8w9feb; __tasessionId=ngww6x1t11581323903383',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(data)
    response = requests.get(url, headers=headers)
    try:
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('Request failed!')
        return None

03. json解碼遇到的問題

因爲python和java轉移字符的區別(python經過''進行轉義,''自己不須要轉義),可是java須要\\來進行轉義,也就是''自己還須要一個''來進行轉義.

可是python的json.loads()方法和print方法在輸出的時候都會對轉義字符進行解釋.
因此當初在parse_page_details()這個方法中 json.loads()報錯,說json格式錯誤找不到'"'. 可是print出來的時候,又是一個''的樣子.

後來在在debug的時候,看到了真實的json字符串的樣子

因此就須要對這個json字符串進行預處理,而後再使用json.loads()進行解碼.

eval(repr(result.group(1)).replace('\\\\', '\\'))

插一個小話題,那就是str()方法和repr()方法的區別. 首先二者都是把對象轉換成字符串,而不管print方法仍是str()方法調用的都是類中的__str__ 而repr()方法調用的是__repr__ .
簡單來講,__str__方法是爲了知足可讀性,會對輸出內容作可讀性處理. 好比去掉字符串兩端的引號或者自動解析''等. 可是__repr__會盡可能保留原始數據格式,知足的是準確性需求. 因此這裏,咱們使用repr()方法拿到原始數據,而後將\\ 替換爲\

ps.\\\\ 是兩個\ 轉義了一下. 同理兩個斜槓是一個斜槓,由於也是轉義的.

而後就是eval方法是能把字符串轉換成對應的類型.

#字符串轉換成列表
 >>>a = "[[1,2], [3,4], [5,6], [7,8], [9,0]]"
 >>>type(a)
 <type 'str'>
 >>> b = eval(a)
 >>> print b
 [[1, 2], [3, 4], [5, 6], [7, 8], [9, 0]]
 >>> type(b)
 <type 'list'>
#字符串轉換成字典
>>> a = "{1: 'a', 2: 'b'}"
>>> type(a)<type 'str'
>>>> b = eval(a)
>>> print b
{1: 'a', 2: 'b'}>>> type(b)<type 'dict'>

理解repr()和eval()兩個方法以後,那上面的預處理代碼就好理解了,先經過repr()方法獲取原始字符串,而後替換,而後再給他轉換成可讀的字符串. 而後在用json.loads()解碼.

04. 關於response.text和response.content的區別

response.text 獲取文本值
response.content 獲取二進制內容

源代碼

import json
import os
import re
from hashlib import md5
from multiprocessing import Pool
from urllib.parse import urlencode
import pymongo
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException
from config import *

# mongodb 數據庫對象
# connext=False表示進程啓動的時候才進行鏈接
client = pymongo.MongoClient(MONGO_URL,connect=False)
db = client[MONGO_DB]

def get_page_index(offset, keyword):
    data = {
        "aid": "24",
        "app_name": "web_search",
        "offset": offset,
        "format": "json",
        "keyword": keyword,
        "autoload": "true",
        "count": "20",
        "en_qc": "1",
        "cur_tab": "1",
        "from": "search_tab",
        # "pd": "synthesis",
        # "timestamp": "1581315480994"
    }
    headers = {
        # 這裏當心cookie失效的問題
        'cookie': 'tt_webid=6791640396613223949; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6791640396613223949; csrftoken=4a29b1b1d9ecf8b5168f1955d2110f16; s_v_web_id=k6g11cxe_fWBnSuA7_RBx3_4Mo4_9a9z_XNI0WS8B9Fja; ttcid=3fdf0861117e48ac8b18940a5704991216; tt_scid=8Z.7-06X5KIZrlZF0PA9kgiudolF2L5j9bu9g6Pdm.4zcvNjlzQ1enH8qMQkYW8w9feb; __tasessionId=ngww6x1t11581323903383',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(data)
    response = requests.get(url, headers=headers)
    try:
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('Request failed!')
        return None

def parse_page_index(html):
    data = json.loads(html)
    # json.loads()方法會格式化結果,並生成一個字典類型
    # print(data)
    # print(type(data))
    try:
        if data and 'data' in data.keys():
            for item in data.get('data'):
                if item.get('has_gallery'):
                    yield item.get('article_url')
    except TypeError:
        pass

def get_page_detail(url):
    headers = {
        'cookie': 'tt_webid=6791640396613223949; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6791640396613223949; csrftoken=4a29b1b1d9ecf8b5168f1955d2110f16; s_v_web_id=k6g11cxe_fWBnSuA7_RBx3_4Mo4_9a9z_XNI0WS8B9Fja; ttcid=3fdf0861117e48ac8b18940a5704991216; tt_scid=8Z.7-06X5KIZrlZF0PA9kgiudolF2L5j9bu9g6Pdm.4zcvNjlzQ1enH8qMQkYW8w9feb; __tasessionId=yix51k4j41581315307695',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
        # ':scheme': 'https',
        # 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        # 'accept-encoding': 'gzip, deflate, br',
        # 'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-US;q=0.7'
    }
    try:
        # 他媽的被本身蠢哭...忘了寫headers了,搞了一個多小時
        response = requests.get(url, headers=headers)
        # print(response.status_code)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print("請求詳情頁出錯!")
        return None

def parse_page_details(html, url):
    soup = BeautifulSoup(html, 'xml')
    title = soup.select('title')[0].get_text()
    # print(title)
    img_pattern = re.compile('JSON.parse\("(.*?)"\),', re.S)
    result = re.search(img_pattern, html)
    if result:
        # 這裏注意一下雙斜槓的問題
        data = json.loads(eval(repr(result.group(1)).replace('\\\\', '\\')))
        if data and 'sub_images' in data.keys():
            sub_images = data.get('sub_images')
            images = [item.get('url') for item in sub_images]
            for image in images: download_image(image)
            return {
                'title': title,
                'url': url,
                'images': images
            }

def save_to_mongo(result):
    if db[MONGO_TABLE].insert_one(result):
        print('存儲到MongoDB成功', result)
        return True
    return False

def download_image(url):
    print('正在下載', url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_img(response.content)
        return None
    except RequestException:
        print('請求圖片出錯', url)
        return None

def save_img(content):
    file_path = '{0}/img_download/{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            f.write(content)
            f.close()

def main(offset):
    html = get_page_index(offset, KEY_WORD)
    for url in parse_page_index(html):
        html = get_page_detail(url)
        if html:
            result = parse_page_details(html, url)
            if result: save_to_mongo(result)

if __name__ == '__main__':
    groups = [x * 20 for x in range(GROUP_START, GROUP_END + 1)]
    pool = Pool()
    pool.map(main, groups)