Scrapy——5html
(Downloader Middleware)下載中間件經常使用函數有哪些web
設置setting.py裏的DOWNLOADER_MIDDLIEWARES,添加本身編寫的下載中間件類cookie
詳情能夠參考https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/settings.html#concurrent-itemsdom
16
100
180
0
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
RANDOMIZE_DOWNLOAD_DELAY
設定。 默認狀況下,Scrapy在兩個請求間不等待一個固定的值, 而是使用0.5到1.5之間的一個隨機值 * DOWNLOAD_DELAY
的結果做爲等待間隔。CONCURRENT_REQUESTS_PER_IP
非0時,延遲針對的是每一個ip而不是網站。download_delay
屬性爲每一個spider設置該設定。'utf-8'
{}
True
對接selenium實戰——PM2.5歷史數據_空氣質量指數歷史數據_中國空氣質量在線監測分析平...scrapy
此網站的數據都是經過加密傳輸的,咱們能夠經過對接selenium跳過數據加密過程,直接獲取到網頁js渲染後的代碼,達到數據的提取ide
惟一的缺點就是速度太慢函數
本次實戰旨在selenium的對接,就不考慮保存數據的問題,因此不用配置items文件網站
# -*- coding: utf-8 -*- import scrapy class AqistudySpider(scrapy.Spider): name = 'aqistudy' # allowed_domains = ['aqistudy.cn'] start_urls = ['https://www.aqistudy.cn/historydata/'] def parse(self, response): print('開始獲取主要城市地址...') city_list = response.xpath("//ul[@class='unstyled']/li/a/@href").extract() for city_url in city_list[1:3]: yield scrapy.Request(url=self.start_urls[0]+city_url, callback=self.parse_month) def parse_month(self, response): print('開始獲取當前城市的月份地址...') month_urls= response.xpath('//ul[@class="unstyled1"]/li/a/@href').extract() for month_url in month_urls: yield scrapy.Request(url=self.start_urls[0]+month_url, callback=self.parse_day) def parse_day(self, response): print('開始獲取空氣數據...') print(response.xpath('//h2[@id="title"]/text()').extract_first()+'\n') item_list = response.xpath('//tr')[1:] for item in item_list: print('day: '+item.xpath('./td[1]/text()').extract_first() + '\t' + 'API: '+item.xpath('./td[2]/text()').extract_first() + '\t' + '質量: '+item.xpath('./td[3]/span/text()').extract_first() + '\t' + 'MP2.5: '+item.xpath('./td[4]/text()').extract_first() + '\t' + 'MP10: '+item.xpath('./td[5]/text()').extract_first() + '\t' + 'SO2: '+item.xpath('./td[6]/text()').extract_first() + '\t' + 'CO: '+item.xpath('./td[7]/text()').extract_first() + '\t' + 'NO2: '+item.xpath('./td[8]/text()').extract_first() + '\t' + 'O3_8h: '+item.xpath('./td[9]/text()').extract_first() )
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://doc.scrapy.org/en/latest/topics/spider-middleware.html import time from selenium import webdriver import scrapy from scrapy import signals class AreaSpiderMiddleware(object): ...... class AreaDownloaderMiddleware(object): ...... class AreaMiddleware(object): def process_request(self, request, spider): self.driver = webdriver.PhantomJS() if 'month' in request.url: self.driver.get(request.url) time.sleep(2) html = self.driver.page_source self.driver.quit() return scrapy.http.HtmlResponse(url=request.url, body=html,request=request, encoding='utf-8' )
# -*- coding: utf-8 -*- # @Time : 2018/11/12 16:56 # @Author : wjh # @File : main.py from scrapy.cmdline import execute execute(['scrapy','crawl','aqistudy'])
運行結果以下: