12.scrapy框架

 

一.Scrapy 框架簡介html

1.簡介python

Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架,用途很是普遍。

框架的力量,用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲,用來抓取網頁內容以及各類圖片,很是之方便。

Scrapy 使用了 Twisted'twɪstɪd異步網絡框架來處理網絡通信,能夠加快咱們的下載速度,不用本身去實現異步框架,而且包含了各類中間件接口,能夠靈活的完成各類需求

框架圖以下:web

流程:json

Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信,信號、數據傳遞等。

Scheduler(調度器): 它負責接受引擎發送過來的Request請求,並按照必定的方式進行整理排列,入隊,當引擎須要時,交還給引擎。

Downloader(下載器):負責下載Scrapy Engine(引擎)發送的全部Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來處理,

Spider(爬蟲):它負責處理全部Responses,從中分析提取數據,獲取Item字段須要的數據,並將須要跟進的URL提交給引擎,再次進入Scheduler(調度器),

Item Pipeline(管道):它負責處理Spider中獲取到的Item,並進行進行後期處理(詳細分析、過濾、存儲等)的地方.

Downloader Middlewares(下載中間件):你能夠看成是一個能夠自定義擴展下載功能的組件。

Spider Middlewares(Spider中間件):你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件(好比進入Spider的Responses;和從Spider出去的Requests)

2.用法步驟cookie

1.新建項目 (scrapy startproject xxx):新建一個新的爬蟲項目
2.明確目標 (編寫items.py):明確你想要抓取的目標
3.製做爬蟲 (spiders/xxspider.py):製做爬蟲開始爬取網頁
4.存儲內容 (pipelines.py):設計管道存儲爬取內容

3.安裝網絡

Windows 安裝方式
Python 2 / 3
升級pip版本:pip install --upgrade pip
經過pip 安裝 Scrapy 框架pip install Scrapy

Ubuntu 須要9.10或以上版本安裝方式
Python 2 / 3
安裝非Python的依賴 sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
經過pip 安裝 Scrapy 框架 sudo pip install scrapy
安裝後,只要在命令終端輸入 scrapy,看到正確信息便可

二.快速入門app

a.建立一個新的Scrapy項目。進入自定義的項目目錄中,運行下列命令:框架

scrapy startproject qsbk    #[qsbk]爲項目名字
其中, qsbk爲項目名稱,能夠看到將會建立一個 qsbk文件夾,目錄結構大體以下:

下面來簡單介紹一下各個主要文件的做用:

scrapy.cfg :項目的配置文件

qsbk/ :項目的Python模塊,將會從這裏引用代碼

qsbk/items.py :項目的目標文件

qsbk/pipelines.py :項目的管道文件

qsbk/settings.py :項目的設置文件

qsbk/spiders/ :存儲爬蟲代碼目錄

b.建立一個案例dom

使用命令建立一個爬蟲,先進去qsbk目錄
scrapy genspider qsbk_spider "qiushibaike.com"    #qsbk_spider 是爬蟲名,不能和項目名同樣,[qiushibaike.com]爲要爬取的域名,
建立了一個名字叫作qsbk_spider的爬蟲,而且能爬取的網頁只會限制在qiushibaike.com這個域名下

 settings.py異步

# -*- coding: utf-8 -*-

# Scrapy settings for qsbk project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'qsbk'

SPIDER_MODULES = ['qsbk.spiders']
NEWSPIDER_MODULE = 'qsbk.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'qsbk (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}


DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}


# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'qsbk.middlewares.QsbkSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'qsbk.middlewares.QsbkDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'qsbk.pipelines.QsbkPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

qsbk_spider.py

import scrapy


class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        #SelectorList
        duanzidivs=response.xpath('//div[@id="content-left"]/div')
        for duanzidiv in duanzidivs:
            #Selector
            authors=duanzidiv.xpath('.//h2/text()').get().strip()
            content=duanzidiv.xpath('.//div[@class="content"]/span/text()').getall()
            content="".join(content)
            print(authors,content)

在qsbk目錄下建立start.py

from scrapy import cmdline

cmdline.execute(["scrapy","crawl","qsbk_spider"])

執行便可看到結果

 c.保存數據

qsbk_spider.py

import scrapy
from qsbk.items import QsbkItem

class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']

    def parse(self, response):
        #SelectorList
        duanzidivs=response.xpath('//div[@id="content-left"]/div')
        items=[]
        for duanzidiv in duanzidivs:
            #Selector
            author=duanzidiv.xpath('.//h2/text()').get().strip()
            content=duanzidiv.xpath('.//div[@class="content"]/span/text()').getall()
            content="".join(content)
            item=QsbkItem(author=author,content=content)
            yield item  #或者將此行註銷,將下面兩行的#去掉
            # items.append(item)
        # return items

#注意:提取出來的數據是一個Selector或者是一個SelectorList
# getall()獲取的是全部文本,返回的是一個列表,get()獲取的是第一個文本,返回的是str

settings.py

DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = { 'qsbk.pipelines.QsbkPipeline': 300, } #將此行的註釋打開

pipelines.py

import json
class QsbkPipeline(object):

    def __init__(self):
        self.fp=open("duanzi.json","w",encoding="utf-8")

    def open_spider(self,spider):
        print("開始了")


    def process_item(self, item, spider):
        item_json=json.dumps(dict(item),ensure_ascii=False)
        self.fp.write(item_json+"\n")
        return item

    def close_spider(self,spider):
        self.fp.close()
        print("結束了")

#open_spider:當爬蟲被打開的時候執行
#process_item:當爬蟲有item傳過來的時候被調用
#close_spider:當爬蟲關閉的時候會被調用
#要激活pipline,應該再settings.py中,設置ITEM_PIPLINES

items.py

import scrapy


class QsbkItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author=scrapy.Field()
    content=scrapy.Field()

在qsbk目錄下建立start.py

from scrapy import cmdline

cmdline.execute(["scrapy","crawl","qsbk_spider"])

d.優化保存數據

pipelines.py

from scrapy.exporters import JsonItemExporter
class QsbkPipeline(object):

    def __init__(self):
        self.fp=open("duanzi.json","wb")
        self.exporter=JsonItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")
        self.exporter.start_exporting()

    def open_spider(self,spider):
        print("開始了")


    def process_item(self, item, spider):
        # item_json=json.dumps(dict(item),ensure_ascii=False)
        # self.fp.write(item_json+"\n")
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("結束了")

或者

from scrapy.exporters import JsonLinesItemExporter
class QsbkPipeline(object):

    def __init__(self):
        self.fp=open("duanzi.json","wb")
        self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")


    def open_spider(self,spider):
        print("開始了")


    def process_item(self, item, spider):
        # item_json=json.dumps(dict(item),ensure_ascii=False)
        # self.fp.write(item_json+"\n")
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.fp.close()
        print("結束了")

其餘不變

 e.屢次請求下一頁

qsbk_spider.py

import scrapy
from qsbk.items import QsbkItem

class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']
    base_domain="https://www.qiushibaike.com"

    def parse(self, response):
        #SelectorList
        duanzidivs=response.xpath('//div[@id="content-left"]/div')
        items=[]
        for duanzidiv in duanzidivs:
            #Selector
            author=duanzidiv.xpath('.//h2/text()').get().strip()
            content=duanzidiv.xpath('.//div[@class="content"]/span/text()').getall()
            content="".join(content)
            item=QsbkItem(author=author,content=content)
            yield item  #或者將此行註銷,將下面兩行的#去掉
        #     items.append(item)
        # return items
        next_url=response.xpath('//ul[@class="pagination"]/li[last()]/a/@href').get()
        if not next_url:
            return
        else:
            yield scrapy.Request(self.base_domain+next_url,callback=self.parse)

相關文章
相關標籤/搜索