安裝Splash(拉取鏡像下來)
docker pull scrapinghub/splash
安裝scrapy-splash
pip install scrapy-splash
啓動容器
docker run -p 8050:8050 scrapinghub/splash
setting 裏面配置
SPLASH_URL = 'http://192.168.99.100:8050' #(很重要寫錯了會出目標電腦積極拒絕)
添加Splash中間件,指定優先級
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
設置Splash本身的去重過濾器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
緩存後臺存儲介質
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' # 以上兩條必加
eg:
import scrapy
from scrapy_splash import SplashRequest
class JsSpider(scrapy.Spider):
name = "jd"
allowed_domains = ["jd.com"]
start_urls = [
"http://www.jd.com/"
]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
print('----------使用splash爬取京東網首頁異步加載內容-----------')
rs=response.xpath('//span[@class="ui-areamini-text"]/text()').extract()[0]
print(rs)
print('---------------success----------------')
官方文檔:https://pypi.python.org/pypi/scrapy-splashpython