爬蟲之scrapy框架

時間 2019-11-16

原文原文鏈接

概述

scrapy是爲了爬取網站數據提取數據而寫的框架,內置了多功能,通用性強,容易學習的一個爬蟲框架html

安裝scrapy

pip install scrapy -i https://pypi.douban.com/simplepython

在window中安裝scrapy須要安裝twisted和pywin32,安裝twisted,在http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下載,而後經過cmd進入目錄輸入命令pip install Twisted-19.2.1-cp36-cp36m-win_amd64.whl安裝,安裝pywin32之間pip install pywin32 -i https://pypi.couban.com/simple框架

使用

建立項目:經過命令行輸入scrapy startproject 項目名dom

建立普通爬蟲文件:cd到項目中scrapy genspider 爬蟲文件名初始url(隨便設置,後面能夠本身在爬蟲文件中設置)scrapy

啓動普通爬蟲文件項目:scrapy crawl 文件名ide

啓動項目是有可選參數 --nolog是取消查看日誌信息, -o 文件名是將結果輸出到自定義文件名中,通常狀況下不用函數

項目建立有如下目錄所示學習

└─myproject
    │  items.py                # 定義提交到管道的屬性
    │  middlewares.py      # 中間件文件
    │  pipelines.py           # 管道文件
    │  settings.py            # 配置文件
    │  __init__.py
    │
    ├─spiders                  # 爬蟲文件夾
    │  │  project.py         # 自定義建立的爬蟲文件
    │  │  __init__.py

各個文件的使用

settings.py(之挑選了常常使用的配置)網站

# UA假裝字段
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'

# 日誌顯示配置
LOG_LEVEL = 'INFO'

# robots協議配置
ROBOTSTXT_OBEY = False

#線程數
CONCURRENT_REQUESTS = 32

#開啓中間件
DOWNLOADER_MIDDLEWARES = {
   'wangyi.middlewares.WangyiDownloaderMiddleware': 543,
}

#管道開啓
ITEM_PIPELINES = {
   'wangyi.pipelines.WangyiPipeline': 300,
}

爬蟲文件url

# -*- coding: utf-8 -*-
import scrapy


class ProjectSpider(scrapy.Spider):
    name = 'project'
    # 容許爬蟲的網站,通常狀況下注釋
    # allowed_domains = ['www.xxx.com']
    # 看是爬蟲的初始url
    start_urls = ['http://www.xxx.com/']

    def parse(self, response):
        '''
        訪問url後的回調函數
        :param response: 響應對象
        :return:
        '''
        pass

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyprojectItem(scrapy.Item):
    '''
    提交管道字段,用scrapy.Field()定義便可
    '''
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

middlewares.py

因爲在中間件中有自動生成的兩個類,這裏只介紹其中一個經常使用的類,且只介紹裏面經常使用的方法經常使用的

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# 通常狀況下不適用該各種
class MyprojectSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class MyprojectDownloaderMiddleware(object):

    def process_request(self, request, spider):
        '''
        攔截所用可以正常訪問的請求,即請求前通過治理
        :param request: 請求對象
        :param spider: 爬蟲文件中的類實例化的對象,能夠調用其中的屬性和方法
        :return:
        '''
        return None

    def process_response(self, request, response, spider):
        '''
        攔截全部響應,能夠在這類對響應對象進行處理
        :param request: 請求對象
        :param response: 響應對象
        :param spider: 爬蟲文件中的類實例化的對象,能夠調用其中的屬性和方法
        :return:
        '''
        return response

    def process_exception(self, request, exception, spider):
        '''
        攔截所用可以異常訪問的請求,即請求前通過治理
        :param request: 請求對象
        :param spider: 爬蟲文件中的類實例化的對象,能夠調用其中的屬性和方法
        :return:
        '''
        pass

View Code

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class MyprojectPipeline(object):
    def process_item(self, item, spider):
        # 計較過來的數據經過item接受,能夠理解爲字典,字典的鍵是items.py種定義的字段,值是爬蟲文件裏提交過來的數據
        return item

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。