python爬蟲框架scrapy 分析ajax，爬取知乎發現板塊

時間 2019-11-08

標籤 python 爬蟲框架 scrapy 分析 ajax 發現板塊欄目 Python 简体版

原文原文鏈接

前言

文的文字及圖片來源於網絡,僅供學習、交流使用,不具備任何商業用途,版權歸原做者全部,若有問題請及時聯繫咱們以做處理。html

做者：人走茶涼cscnode

PS：若有須要Python學習資料的小夥伴能夠加點擊下方連接自行獲取web

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef複製代碼

1、網頁分析及爬取字段

一、爬取字段 爬取字段很少，只須要三個字段便可，其中「內容」字段須要進到詳情頁爬取ajax

二、網頁分析json

知乎發現板塊爲典型的ajax加載頁面。咱們打開網頁，右鍵點擊檢查，切換到Network界面，點擊XHR，此狀態下，刷新出來的均爲Ajax加載條目。bash

接下來咱們不斷下拉網頁能夠看到ajax加載條目不斷出現。此ajax加載的params以下，我上次寫的百度圖片下載爬蟲咱們是經過構造params來實現的，在此次爬蟲中我嘗試使用此方法可是返回404結果因此咱們經過分析ajax的url來實現。此爲ajax加載出來的urlcookie

https://www.zhihu.com/node/ExploreAnswerListV2?params=%7B%22offset%22%3A10%2C%22type%22%3A%22day%22%7D

https://www.zhihu.com/node/ExploreAnswerListV2?params=%7B%22offset%22%3A15%2C%22type%22%3A%22day%22%7D複製代碼

經過上面兩個url分析咱們能夠看出此url只有一個可變參數爲網絡

因此咱們只需更改此參數便可。app

同時在分析網頁的時候咱們發現，此ajax加載上限爲40頁，其最後一頁ajax加載網址爲dom

https://www.zhihu.com/node/ExploreAnswerListV2?params=%7B%22offset%22%3A199%2C%22type%22%3A%22day%22%7D複製代碼

好了到目前爲止，咱們ajax請求的url地址咱們分析結束。爬取字段部分不在分析，都是很簡單的靜態網頁，使用xpath便可。

2、代碼及解析

一、items部分

# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 
class ZhihufaxianItem(scrapy.Item):
    # 標題
    title = scrapy.Field()
    # 做者
    author = scrapy.Field()
    # 內容
    content = scrapy.Field()
複製代碼

創建爬取字段

二、settings部分

# -*- coding: utf-8 -*-
 
# Scrapy settings for zhihufaxian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 
BOT_NAME = 'zhihufaxian'
 
SPIDER_MODULES = ['zhihufaxian.spiders']
NEWSPIDER_MODULE = 'zhihufaxian.spiders'
 
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
 
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
 
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
 
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
 
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
 
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
 
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'zhihufaxian.middlewares.ZhihufaxianSpiderMiddleware': 543,
#}
 
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'zhihufaxian.middlewares.ZhihufaxianDownloaderMiddleware': 543,
#}
 
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
 
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'zhihufaxian.pipelines.ZhihufaxianPipeline': 300,
}
 
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
 
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
複製代碼

打開itempippline部分修改user_agent便可。

三、spider

# -*- coding: utf-8 -*-
import scrapy
from zhihufaxian.items import ZhihufaxianItem
 
 
class ZhfxSpider(scrapy.Spider):
    name = 'zhfx'
    allowed_domains = ['zhihu.com']
    start_urls = ['http://zhihu.com/']
 
    # 知乎發現板塊只能ajax加載40頁
    def start_requests(self):
        base_url = "https://www.zhihu.com/node/ExploreAnswerListV2?"
        for page in range(1, 41):
            if page < 40:
                params = "params=%7B%22offset%22%3A" + str(page*5) + "%2C%22type%22%3A%22day%22%7D"
            else:
                params = "params=%7B%22offset%22%3A" + str(199) + "%2C%22type%22%3A%22day%22%7D"
            url = base_url + params
            yield scrapy.Request(
                url=url,
                callback=self.parse
            )
 
    def parse(self, response):
        list = response.xpath("//body/div")
        for li in list:
            item = ZhihufaxianItem()
            # 標題
            item["title"] = "".join(li.xpath(".//h2/a/text()").getall())
            item["title"] = item["title"].replace("\n", "")
            # 做者
            item["author"] = "".join(li.xpath(".//div[@class='zm-item-answer-author-info']/span[1]/span[1]/a/text()").getall())
            item["author"] = item["author"].replace("\n","")
            details_url = "".join(li.xpath(".//div[@class='zh-summary summary clearfix']/a/@href").getall())
            details_url = "https://www.zhihu.com" + details_url
            yield scrapy.Request(
                url=details_url,
                callback=self.details,
                meta={"item": item}
                )
            
    # 詳情頁獲取content
    def details(self, response):
        item = response.meta["item"]
        item["content"] = "".join(response.xpath("//div[@class='RichContent-inner']/span/p/text()").getall())
        print(item)
複製代碼