Scrapy-splash
Splash是一個javascript渲染服務。它是一個帶有HTTP API的輕量級Web瀏覽器,使用Twisted和QT5在Python 3中實現。QT反應器用於使服務徹底異步,容許經過QT主循環利用webkit併發。
一些Splash功能:javascript
- 並行處理多個網頁
- 獲取HTML源代碼或截取屏幕截圖
- 關閉圖像或使用Adblock Plus規則使渲染更快
- 在頁面上下文中執行自定義JavaScript
- 可經過Lua腳原本控制頁面的渲染過程
- 在Splash-Jupyter 筆記本中開發Splash Lua腳本。
- 以HAR格式獲取詳細的渲染信息
1.splash安裝
Scrapy-Splash的安裝分爲兩部分,一個是Splash服務的安裝,具體經過Docker來安裝服務,運行服務會啓動一個Splash服務,經過它的接口來實現JavaScript頁面的加載;另一個是Scrapy-Splash的Python庫的安裝,安裝後就可在Scrapy中使用Splash服務了,下面咱們分三部份來安裝:html
1.安裝docker
passjava
2.安裝splash服務
docker pull scrapinghub/splash docker run -d -p 8050:8050 scrapinghub/splash
3.Python包Scrapy-Splash安裝
pip3 install scrapy-splash
2.Scrapy-Splash使用
1.setting添加配置
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, # 配置splash服務 } DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, # 配置splash服務 'scrapy_splash.SplashMiddleware': 725, # 配置splash服務 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, # 配置splash服務 } # 添加splash服務器地址: SPLASH_URL = "http://192.168.31.111:8050/" # 設置去重過濾器: DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter" # 開啓換成 HTTPCACHE_ENABLED = True # 緩存超時時間 HTTPCACHE_EXPIRATION_SECS = 0 # 緩存保存路徑 HTTPCACHE_DIR = 'httpcache' # 緩存忽略的Http狀態碼 HTTPCACHE_IGNORE_HTTP_CODES = [] # 最後配置一個Cache存儲 HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
2.spider.py
import scrapy from scrapy.http import Request, FormRequest from scrapy.selector import Selector from scrapy_splash.request import SplashRequest, SplashFormRequest class JdSpiderSpider(scrapy.Spider): name = 'jd_spider' allowed_domains = ['.com'] start_urls = ['https://www.baidu.com'] def start_requests(self): splash_args = {"lua_source": """ --splash.response_body_enabled = true splash.private_mode_enabled = false splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36") assert(splash:go("https://item.jd.com/5089239.html")) splash:wait(3) return {html = splash:html()} """} # yield SplashRequest("https://item.jd.com/5089239.html", endpoint='run', args=splash_args, callback=self.onSave) yield SplashRequest("https://item.jd.com/35674728065.html", endpoint='run', args=splash_args, callback=self.onSave) def onSave(self, response): value = response.xpath('//span[@class="p-price"]//text()').extract() print(value) def parse(self, response): pass def SplashRequest(url=None, callback=None, method='GET', endpoint='render.html', args=None, splash_url=None, slot_policy=SlotPolicy.PER_DOMAIN, splash_headers=None, dont_process_response=False, dont_send_headers=False, magic_response=True, session_id='default', http_status_from_error_code=True, cache_args=None, meta=None, **kwargs): url:與scrapy.Request中的url相同,也就是待爬取頁面的url headers:與scrapy.Request中的headers相同 cookies:與scrapy.Request中的cookies相同 args:傳遞給Splash的參數,如wait(等待時間),timeout(超時時間),images(是否禁止加載圖片,0禁止,1不由止), proxy(設置代理)等 args={'wait': 5, 'lua_source': source, 'proxy': 'http://proxy_ip:proxy_port' } endpoint:Splash服務端點,默認爲'render.html',即JS頁面渲染服務 splash_url:Splash服務器地址,默認爲None,即便用settings.py配置文件中的SPLASH_URL = 'http://localhost:8050' method:請求類型 def SplashFormRequest(url=None, callback=None, method=None, formdata=None, body=None, **kwargs): body:請求體