Scrapy-splash

Scrapy-splash

Splash是一個javascript渲染服務。它是一個帶有HTTP API的輕量級Web瀏覽器,使用Twisted和QT5在Python 3中實現。QT反應器用於使服務徹底異步,容許經過QT主循環利用webkit併發。
一些Splash功能:javascript

  • 並行處理多個網頁
  • 獲取HTML源代碼或截取屏幕截圖
  • 關閉圖像或使用Adblock Plus規則使渲染更快
  • 在頁面上下文中執行自定義JavaScript
  • 可經過Lua腳原本控制頁面的渲染過程
  • 在Splash-Jupyter 筆記本中開發Splash Lua腳本。
  • 以HAR格式獲取詳細的渲染信息

1.splash安裝

Scrapy-Splash的安裝分爲兩部分,一個是Splash服務的安裝,具體經過Docker來安裝服務,運行服務會啓動一個Splash服務,經過它的接口來實現JavaScript頁面的加載;另一個是Scrapy-Splash的Python庫的安裝,安裝後就可在Scrapy中使用Splash服務了,下面咱們分三部份來安裝:html

1.安裝docker

passjava

2.安裝splash服務

docker pull scrapinghub/splash
docker run -d -p 8050:8050 scrapinghub/splash

3.Python包Scrapy-Splash安裝

pip3 install scrapy-splash

2.Scrapy-Splash使用

1.setting添加配置

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,  # 配置splash服務
}

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,  # 配置splash服務
    'scrapy_splash.SplashMiddleware': 725,  # 配置splash服務
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,  # 配置splash服務
}


# 添加splash服務器地址:
SPLASH_URL = "http://192.168.31.111:8050/"
# 設置去重過濾器:
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"

# 開啓換成
HTTPCACHE_ENABLED = True
# 緩存超時時間
HTTPCACHE_EXPIRATION_SECS = 0
# 緩存保存路徑
HTTPCACHE_DIR = 'httpcache'
# 緩存忽略的Http狀態碼
HTTPCACHE_IGNORE_HTTP_CODES = []
# 最後配置一個Cache存儲
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

2.spider.py

import scrapy
from scrapy.http import Request, FormRequest
from scrapy.selector import Selector
from scrapy_splash.request import SplashRequest, SplashFormRequest


class JdSpiderSpider(scrapy.Spider):
    name = 'jd_spider'
    allowed_domains = ['.com']
    start_urls = ['https://www.baidu.com']

    def start_requests(self):
        splash_args = {"lua_source": """
                    --splash.response_body_enabled = true
                    splash.private_mode_enabled = false
                    splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")
                    assert(splash:go("https://item.jd.com/5089239.html"))
                    splash:wait(3)
                    return {html = splash:html()}
                    """}
        # yield SplashRequest("https://item.jd.com/5089239.html", endpoint='run', args=splash_args, callback=self.onSave)
        yield SplashRequest("https://item.jd.com/35674728065.html", endpoint='run', args=splash_args,
                            callback=self.onSave)

    def onSave(self, response):
        value = response.xpath('//span[@class="p-price"]//text()').extract()
        print(value)

    def parse(self, response):
        pass


def SplashRequest(url=None, callback=None, method='GET', endpoint='render.html', args=None, splash_url=None, slot_policy=SlotPolicy.PER_DOMAIN, splash_headers=None, dont_process_response=False, dont_send_headers=False, magic_response=True,
             session_id='default', http_status_from_error_code=True, cache_args=None, meta=None, **kwargs):
    url:與scrapy.Request中的url相同,也就是待爬取頁面的url
    headers:與scrapy.Request中的headers相同
    cookies:與scrapy.Request中的cookies相同
    args:傳遞給Splash的參數,如wait(等待時間),timeout(超時時間),images(是否禁止加載圖片,0禁止,1不由止), proxy(設置代理)等
    args={'wait': 5,
              'lua_source': source,
              'proxy': 'http://proxy_ip:proxy_port'
              }
    endpoint:Splash服務端點,默認爲'render.html',即JS頁面渲染服務
    splash_url:Splash服務器地址,默認爲None,即便用settings.py配置文件中的SPLASH_URL = 'http://localhost:8050'
    method:請求類型


def SplashFormRequest(url=None, callback=None, method=None, formdata=None,
             body=None, **kwargs):
    body:請求體
相關文章
相關標籤/搜索