12.scrapy框架

時間 2019-11-09

原文原文鏈接

一.Scrapy 框架簡介html

1.簡介python

Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架，用途很是普遍。

框架的力量，用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲，用來抓取網頁內容以及各類圖片，很是之方便。

Scrapy 使用了 Twisted'twɪstɪd異步網絡框架來處理網絡通信，能夠加快咱們的下載速度，不用本身去實現異步框架，而且包含了各類中間件接口，能夠靈活的完成各類需求

框架圖以下：web

流程：json

Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信，信號、數據傳遞等。

Scheduler(調度器): 它負責接受引擎發送過來的Request請求，並按照必定的方式進行整理排列，入隊，當引擎須要時，交還給引擎。

Downloader（下載器）：負責下載Scrapy Engine(引擎)發送的全部Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，

Spider（爬蟲）：它負責處理全部Responses,從中分析提取數據，獲取Item字段須要的數據，並將須要跟進的URL提交給引擎，再次進入Scheduler(調度器)，

Item Pipeline(管道)：它負責處理Spider中獲取到的Item，並進行進行後期處理（詳細分析、過濾、存儲等）的地方.

Downloader Middlewares（下載中間件）：你能夠看成是一個能夠自定義擴展下載功能的組件。

Spider Middlewares（Spider中間件）：你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件（好比進入Spider的Responses;和從Spider出去的Requests）

2.用法步驟cookie

1.新建項目 (scrapy startproject xxx)：新建一個新的爬蟲項目
2.明確目標 （編寫items.py）：明確你想要抓取的目標
3.製做爬蟲 （spiders/xxspider.py）：製做爬蟲開始爬取網頁
4.存儲內容 （pipelines.py）：設計管道存儲爬取內容

3.安裝網絡

Windows 安裝方式
Python 2 / 3
升級pip版本：pip install --upgrade pip
經過pip 安裝 Scrapy 框架pip install Scrapy

Ubuntu 須要9.10或以上版本安裝方式
Python 2 / 3
安裝非Python的依賴 sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
經過pip 安裝 Scrapy 框架 sudo pip install scrapy
安裝後，只要在命令終端輸入 scrapy，看到正確信息便可

二.快速入門app

a.建立一個新的Scrapy項目。進入自定義的項目目錄中，運行下列命令：框架

scrapy startproject qsbk    #[qsbk]爲項目名字

其中， qsbk爲項目名稱，能夠看到將會建立一個 qsbk文件夾，目錄結構大體以下：

下面來簡單介紹一下各個主要文件的做用：

scrapy.cfg ：項目的配置文件

qsbk/ ：項目的Python模塊，將會從這裏引用代碼

qsbk/items.py ：項目的目標文件

qsbk/pipelines.py ：項目的管道文件

qsbk/settings.py ：項目的設置文件

qsbk/spiders/ ：存儲爬蟲代碼目錄

b.建立一個案例dom

使用命令建立一個爬蟲，先進去qsbk目錄

scrapy genspider qsbk_spider "qiushibaike.com"    #qsbk_spider 是爬蟲名，不能和項目名同樣，[qiushibaike.com]爲要爬取的域名，

建立了一個名字叫作qsbk_spider的爬蟲，而且能爬取的網頁只會限制在qiushibaike.com這個域名下

settings.py異步

# -*- coding: utf-8 -*-

# Scrapy settings for qsbk project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'qsbk'

SPIDER_MODULES = ['qsbk.spiders']
NEWSPIDER_MODULE = 'qsbk.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'qsbk (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}


DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}


# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'qsbk.middlewares.QsbkSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'qsbk.middlewares.QsbkDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'qsbk.pipelines.QsbkPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

qsbk_spider.py

import scrapy


class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        #SelectorList
        duanzidivs=response.xpath('//div[@id="content-left"]/div')
        for duanzidiv in duanzidivs:
            #Selector
            authors=duanzidiv.xpath('.//h2/text()').get().strip()
            content=duanzidiv.xpath('.//div[@class="content"]/span/text()').getall()
            content="".join(content)
            print(authors,content)

在qsbk目錄下建立start.py

from scrapy import cmdline

cmdline.execute(["scrapy","crawl","qsbk_spider"])

執行便可看到結果

c.保存數據

qsbk_spider.py

import scrapy
from qsbk.items import QsbkItem

class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        #SelectorList
        duanzidivs=response.xpath('//div[@id="content-left"]/div')
        items=[]
        for duanzidiv in duanzidivs:
            #Selector
            author=duanzidiv.xpath('.//h2/text()').get().strip()
            content=duanzidiv.xpath('.//div[@class="content"]/span/text()').getall()
            content="".join(content)
            item=QsbkItem(author=author,content=content)
            yield item  #或者將此行註銷，將下面兩行的#去掉
            # items.append(item)
        # return items

#注意：提取出來的數據是一個Selector或者是一個SelectorList
# getall()獲取的是全部文本，返回的是一個列表，get()獲取的是第一個文本，返回的是str

settings.py

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

ROBOTSTXT_OBEY = False


ITEM_PIPELINES = {
   'qsbk.pipelines.QsbkPipeline': 300,
}         #將此行的註釋打開

pipelines.py

import json
class QsbkPipeline(object):

    def __init__(self):
        self.fp=open("duanzi.json","w",encoding="utf-8")

    def open_spider(self,spider):
        print("開始了")


    def process_item(self, item, spider):
        item_json=json.dumps(dict(item),ensure_ascii=False)
        self.fp.write(item_json+"\n")
        return item

    def close_spider(self,spider):
        self.fp.close()
        print("結束了")

#open_spider:當爬蟲被打開的時候執行
#process_item：當爬蟲有item傳過來的時候被調用
#close_spider：當爬蟲關閉的時候會被調用
#要激活pipline，應該再settings.py中，設置ITEM_PIPLINES

items.py

import scrapy


class QsbkItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author=scrapy.Field()
    content=scrapy.Field()

在qsbk目錄下建立start.py

from scrapy import cmdline

cmdline.execute(["scrapy","crawl","qsbk_spider"])

d.優化保存數據

pipelines.py

from scrapy.exporters import JsonItemExporter
class QsbkPipeline(object):

    def __init__(self):
        self.fp=open("duanzi.json","wb")
        self.exporter=JsonItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")
        self.exporter.start_exporting()

    def open_spider(self,spider):
        print("開始了")


    def process_item(self, item, spider):
        # item_json=json.dumps(dict(item),ensure_ascii=False)
        # self.fp.write(item_json+"\n")
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("結束了")

或者

from scrapy.exporters import JsonLinesItemExporter
class QsbkPipeline(object):

    def __init__(self):
        self.fp=open("duanzi.json","wb")
        self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")


    def open_spider(self,spider):
        print("開始了")


    def process_item(self, item, spider):
        # item_json=json.dumps(dict(item),ensure_ascii=False)
        # self.fp.write(item_json+"\n")
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("結束了")

其餘不變

e.屢次請求下一頁

qsbk_spider.py

import scrapy
from qsbk.items import QsbkItem

class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']
    base_domain="https://www.qiushibaike.com"

    def parse(self, response):
        #SelectorList
        duanzidivs=response.xpath('//div[@id="content-left"]/div')
        items=[]
        for duanzidiv in duanzidivs:
            #Selector
            author=duanzidiv.xpath('.//h2/text()').get().strip()
            content=duanzidiv.xpath('.//div[@class="content"]/span/text()').getall()
            content="".join(content)
            item=QsbkItem(author=author,content=content)
            yield item  #或者將此行註銷，將下面兩行的#去掉
        #     items.append(item)
        # return items
        next_url=response.xpath('//ul[@class="pagination"]/li[last()]/a/@href').get()
        if not next_url:
            return
        else:
            yield scrapy.Request(self.base_domain+next_url,callback=self.parse)

待

1. Scrapy框架----- Scrapy Shell
2. 爬蟲框架：scrapy 爬蟲框架：scrapy
3. Scrapy框架架構
4. Scrapy框架2
5. Python - Scrapy 框架
6. Python-scrapy框架
7. Scrapy框架
8. Python-Scrapy框架
9. Scrapy 框架
10. Python Scrapy框架
更多相關文章...
• Docker 架構 - Docker教程
• SSH框架（Struts2+Spring+Hibernate）搭建整合詳細步驟 - Spring教程
• 適用於PHP初學者的學習線路和建議
• 常用的分佈式事務解決方案

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。