scrapy是爲了爬取網站數據提取數據而寫的框架,內置了多功能,通用性強,容易學習的一個爬蟲框架html
pip install scrapy -i https://pypi.douban.com/simplepython
在window中安裝scrapy須要安裝twisted和pywin32,安裝twisted,在http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下載,而後經過cmd進入目錄輸入命令pip install Twisted-19.2.1-cp36-cp36m-win_amd64.whl安裝,安裝pywin32之間pip install pywin32 -i https://pypi.couban.com/simple框架
建立項目:經過命令行輸入scrapy startproject 項目名dom
建立普通爬蟲文件:cd到項目中scrapy genspider 爬蟲文件名 初始url(隨便設置,後面能夠本身在爬蟲文件中設置)scrapy
啓動普通爬蟲文件項目:scrapy crawl 文件名ide
啓動項目是有可選參數 --nolog是取消查看日誌信息, -o 文件名 是將結果輸出到自定義文件名中,通常狀況下不用函數
項目建立有如下目錄所示學習
└─myproject │ items.py # 定義提交到管道的屬性 │ middlewares.py # 中間件文件 │ pipelines.py # 管道文件 │ settings.py # 配置文件 │ __init__.py │ ├─spiders # 爬蟲文件夾 │ │ project.py # 自定義建立的爬蟲文件 │ │ __init__.py
settings.py(之挑選了常常使用的配置)網站
# UA假裝字段 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36' # 日誌顯示配置 LOG_LEVEL = 'INFO' # robots協議配置 ROBOTSTXT_OBEY = False #線程數 CONCURRENT_REQUESTS = 32 #開啓中間件 DOWNLOADER_MIDDLEWARES = { 'wangyi.middlewares.WangyiDownloaderMiddleware': 543, } #管道開啓 ITEM_PIPELINES = { 'wangyi.pipelines.WangyiPipeline': 300, }
爬蟲文件url
# -*- coding: utf-8 -*- import scrapy class ProjectSpider(scrapy.Spider): name = 'project' # 容許爬蟲的網站,通常狀況下注釋 # allowed_domains = ['www.xxx.com'] # 看是爬蟲的初始url start_urls = ['http://www.xxx.com/'] def parse(self, response): ''' 訪問url後的回調函數 :param response: 響應對象 :return: ''' pass
items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class MyprojectItem(scrapy.Item): ''' 提交管道字段,用scrapy.Field()定義便可 ''' # define the fields for your item here like: # name = scrapy.Field() pass
middlewares.py
因爲在中間件中有自動生成的兩個類,這裏只介紹其中一個經常使用的類,且只介紹裏面經常使用的方法經常使用的
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals # 通常狀況下不適用該各種 class MyprojectSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class MyprojectDownloaderMiddleware(object): def process_request(self, request, spider): ''' 攔截所用可以正常訪問的請求,即請求前通過治理 :param request: 請求對象 :param spider: 爬蟲文件中的類實例化的對象,能夠調用其中的屬性和方法 :return: ''' return None def process_response(self, request, response, spider): ''' 攔截全部響應,能夠在這類對響應對象進行處理 :param request: 請求對象 :param response: 響應對象 :param spider: 爬蟲文件中的類實例化的對象,能夠調用其中的屬性和方法 :return: ''' return response def process_exception(self, request, exception, spider): ''' 攔截所用可以異常訪問的請求,即請求前通過治理 :param request: 請求對象 :param spider: 爬蟲文件中的類實例化的對象,能夠調用其中的屬性和方法 :return: ''' pass
pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class MyprojectPipeline(object): def process_item(self, item, spider): # 計較過來的數據經過item接受,能夠理解爲字典,字典的鍵是items.py種定義的字段,值是爬蟲文件裏提交過來的數據 return item