python——Scrapy框架簡介、內置選擇器、管道文件、爬蟲模塊中的spider類

爬蟲的自我修養_4

1、Scrapy 框架簡介

  • Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架,用途很是普遍。php

  • 框架的力量,用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲,用來抓取網頁內容以及各類圖片,很是之方便。css

  • Scrapy 使用了 Twisted['twɪstɪd](其主要對手是Tornado)異步網絡框架來處理網絡通信,能夠加快咱們的下載速度,不用本身去實現異步框架,而且包含了各類中間件接口,能夠靈活的完成各類需求。html

Scrapy架構圖(綠線是數據流向):

  • Scrapy Engine(引擎): 負責SpiderItemPipelineDownloaderScheduler中間的通信,信號、數據傳遞等。python

  • Scheduler(調度器): 它負責接受引擎發送過來的Request請求,並按照必定的方式進行整理排列,入隊,當引擎須要時,交還給引擎web

  • Downloader(下載器):負責下載Scrapy Engine(引擎)發送的全部Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來處理,正則表達式

  • Spider(爬蟲):它負責處理全部Responses,從中分析提取數據,獲取Item字段須要的數據,並將須要跟進的URL提交給引擎,再次進入Scheduler(調度器)數據庫

  • Item Pipeline(管道):它負責處理Spider中獲取到的Item,並進行進行後期處理(詳細分析、過濾、存儲等)的地方.json

  • Downloader Middlewares(下載中間件):你能夠看成是一個能夠自定義擴展下載功能的組件。api

  • Spider Middlewares(Spider中間件):你能夠理解爲是一個能夠自定擴展和操做引擎Spider中間通訊的功能組件(好比進入Spider的Responses;和從Spider出去的Requests)cookie

一切的開始是從咱們寫的爬蟲(Spider)開始的,咱們向引擎(Scrapu Engine)發送請求,引擎將發送來的Request請求交給調度器,調度器將他們入隊,當引擎須要的時候,將他們按先進先出的方式出隊,而後引擎把他們交給下載器(Downloader),下載器下載完畢後把Response交給引擎,引擎又交給咱們寫的爬蟲程序,咱們經過處理Response將裏面要繼續爬取的URL交給引擎(重複上面的步驟),須要保存的發送給管道(Item Pipeline)處理

製做 Scrapy 爬蟲 一共須要4步:

  • 新建項目 (scrapy startproject xxx):新建一個新的爬蟲項目(新建項目方法:scrapy crawl + 爬蟲項目名
  • 明確目標 (編寫items.py):明確你想要抓取的目標
  • 製做爬蟲 (spiders/xxspider.py):製做爬蟲開始爬取網頁
  • 存儲內容 (pipelines.py):設計管道存儲爬取內容

pycharm啓動工程的start.py文件

from scrapy import cmdline
cmdline.execute("scrapy crawl 起的爬蟲名".split())

2、Scrapy Selectors選擇器

crapy Selectors 內置 XPath 和 CSS Selector 表達式機制

Selector有四個基本的方法,最經常使用的仍是xpath:

  • xpath(): 傳入xpath表達式,返回該表達式所對應的全部節點的selector list列表
  • extract(): 序列化該節點爲Unicode字符串並返回list
  • css(): 傳入CSS表達式,返回該表達式所對應的全部節點的selector list列表,語法同 BeautifulSoup4
  • re(): 根據傳入的正則表達式對數據進行提取,返回Unicode字符串list列表

XPath表達式的例子及對應的含義:

/html/head/title: 選擇<HTML>文檔中 <head> 標籤內的 <title> 元素
/html/head/title/text(): 選擇上面提到的 <title> 元素的文字
//td: 選擇全部的 <td> 元素
//div[@class="mine"]: 選擇全部具備 class="mine" 屬性的 div 元素

其餘的看前兩篇博客吧

3、Item Pipeline

當Item在Spider中被收集以後,它將會被傳遞到Item Pipeline,這些Item Pipeline組件按定義的順序處理Item。

每一個Item Pipeline都是實現了簡單方法的Python類,好比決定此Item是丟棄而存儲。如下是item pipeline的一些典型應用:

  • 驗證爬取的數據(檢查item包含某些字段,好比說name字段)
  • 查重(並丟棄)
  • 將爬取結果保存到文件或者數據庫中

編寫item pipeline

編寫item pipeline很簡單,item pipiline組件是一個獨立的Python類,其中process_item()方法必須實現:

class XingePipeline(object):
    def __init__(self):    
        # 可選實現,作參數初始化等
        # 初始函數和結束函數只執行一遍,中間的proces_item函數,來數據就執行一遍,因此不用寫ab
		self.file = open('teacher.json', 'wb')	 # 打開文件

    def process_item(self, item, spider):
        # item (Item 對象) – 被爬取的item
        # spider (Spider 對象) – 爬取該item的spider
        # 這個方法必須實現,每一個item pipeline組件都須要調用該方法,
        # 這個方法必須返回一個 Item 對象,被丟棄的item將不會被以後的pipeline組件所處理。
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content)
        return item

    def open_spider(self, spider):
        # spider (Spider 對象) – 被開啓的spider
        # 可選實現,當spider被開啓時,這個方法被調用。

    def close_spider(self, spider):
        # spider (Spider 對象) – 被關閉的spider
        # 可選實現,當spider被關閉時,這個方法被調用
		self.file.close()

要啓用pipeline,必需要在settings文件中把註釋去掉

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "mySpider.pipelines.ItcastJsonPipeline":300
}

4、Spider類

Spider類定義瞭如何爬取某個(或某些)網站。包括了爬取的動做(例如:是否跟進連接)以及如何從網頁的內容中提取結構化數據(爬取item)。 換句話說,Spider就是您定義爬取的動做及分析某個網頁(或者是有些網頁)的地方。

class scrapy.Spider是最基本的類,全部編寫的爬蟲必須繼承這個類。

主要用到的函數及調用順序爲:

__init__() : 初始化爬蟲名字和start_urls列表

start_requests() 調用make_requests_from url():生成Requests對象交給Scrapy下載並返回response

parse() : 解析response,並返回Item或Requests(需指定回調函數)。Item傳給Item pipline持久化 , 而Requests交由Scrapy下載,並由指定的回調函數處理(默認parse()),一直進行循環,直處處理完全部的數據爲止。

主要屬性和方法

  • name

    定義spider名字的字符串。

    例如,若是spider爬取 mywebsite.com ,該spider一般會被命名爲 mywebsite

  • allowed_domains

    包含了spider容許爬取的域名(domain)的列表,可選。

  • start_urls

    初始URL元祖/列表。當沒有制定特定的URL時,spider將從該列表中開始進行爬取。

  • start_requests(self)

    該方法必須返回一個可迭代對象(iterable)。該對象包含了spider用於爬取(默認實現是使用 start_urls 的url)的第一個Request。

    當spider啓動爬取而且未指定start_urls時,該方法被調用。

  • parse(self, response)

    當請求url返回網頁沒有指定回調函數時,默認的Request對象回調函數。用來處理網頁返回的response,以及生成Item或者Request對象。

  • log(self, message[, level, component])

    使用 scrapy.log.msg() 方法記錄(log)message。 更多數據請參見 logging

parse方法的工做規則

1. 由於使用的yield,而不是return。parse函數將會被當作一個生成器使用。scrapy會逐一獲取parse方法中生成的結果,並判斷該結果是一個什麼樣的類型;
2. 若是是request則加入爬取隊列,若是是item類型則使用pipeline處理,其餘類型則返回錯誤信息。
3. scrapy取到第一部分的request不會立馬就去發送這個request,只是把這個request放到隊列裏,而後接着從生成器裏獲取;
4. 取盡第一部分的request,而後再獲取第二部分的item,取到item了,就會放到對應的pipeline裏處理;
5. parse()方法做爲回調函數(callback)賦值給了Request,指定parse()方法來處理這些請求 scrapy.Request(url, callback=self.parse)
6. Request對象通過調度,執行生成 scrapy.http.response()的響應對象,並送回給parse()方法,直到調度器中沒有Request(遞歸的思路)
7. 取盡以後,parse()工做結束,引擎再根據隊列和pipelines中的內容去執行相應的操做;
8. 程序在取得各個頁面的items前,會先處理完以前全部的request隊列裏的請求,而後再提取items。
7. 這一切的一切,Scrapy引擎和調度器將負責到底。

小Tips

爲何要用yield?

yield的主要做用是將函數 ==> 生成器

經過yield能夠給item返回數據 也能夠發送下一個的request請求。
若是用return的話,會結束函數。

若是須要返回包含成百上千個元素的list,想必會佔用不少計算機資源以及時間。若是用yield
就能夠緩和這種狀況了。

settings文件

# -*- coding: utf-8 -*-

# Scrapy settings for douyuScripy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douyuScripy'	# 工程名

SPIDER_MODULES = ['douyuScripy.spiders']	# 爬蟲文件路徑
NEWSPIDER_MODULE = 'douyuScripy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douyuScripy (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True	# 是否符合爬蟲規則,咱們本身寫爬蟲固然是不遵照了呀,註釋掉就行了

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32	# 啓動的協程數量,默認是16個

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 2		# 每次請求的等待時間
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16	# 將單個域執行的併發請求的最大數量,默認是8
#CONCURRENT_REQUESTS_PER_IP = 16	# 將對單個IP執行的併發請求的最大數量,默認是0,若是非零,併發限制將應用於每一個IP,而不是每一個域。

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False	# 是否保存cookie,默認是True

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False		# 指定是否啓用telnet控制檯(和Windows不要緊),默認是True

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {		# 請求頭文件
    "User-Agent" : "DYZB/1 CFNetwork/808.2.16 Darwin/16.3.0"
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'douyuScripy.middlewares.DouyuscripySpiderMiddleware': 543,	# 爬蟲中間件,後面的值越小,優先級越高
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {		
#    'douyuScripy.middlewares.MyCustomDownloaderMiddleware': 543,	# 下載中間件
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'douyuScripy.pipelines.DouyuscripyPipeline': 300,	# 使用哪一個管道,多個的話,先走後面值小的
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

示例1、爬取itcast的教師姓名+信息

爬蟲模塊

# -*- coding: utf-8 -*-
import scrapy
from douyu.items import DouyuItem
import json


class DouyumeinvSpider(scrapy.Spider):
    name = "douyumeinv"
    allowed_domains = ["capi.douyucdn.cn"]

    offset = 0
    url = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="

    start_urls = [url + str(offset)]

    def parse(self, response):
        # scrapy獲取html頁面用的是response.body ==> 字節類型(bytes) response.text ==> 字符串類型(unicode)
        # 把json格式的數據轉換爲python格式,data段是列表
        data = json.loads(response.text)["data"]
        for each in data:
            item = DouyuItem()
            item["nickname"] = each["nickname"]
            item["imagelink"] = each["vertical_src"]

            yield item
        if self.offset < 40:
            self.offset += 20
            yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

管道模塊

import json

class ItcastPipeline(object):
    def __init__(self):
        self.filename = open('teacher.json','wb')

    def process_item(self, item, spider):
        text = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.filename.write(text.encode('utf-8'))
        return item

    def close_spider(self,spider):
        self.filename.close()
1 import scrapy
2 
3 class ItcastItem(scrapy.Item):
4     # define the fields for your item here like:
5     name = scrapy.Field()
6     title = scrapy.Field()
7     info = scrapy.Field()
items模塊
 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for myScripy project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     http://doc.scrapy.org/en/latest/topics/settings.html
 9 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
10 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'myScripy'
13 
14 SPIDER_MODULES = ['myScripy.spiders']
15 NEWSPIDER_MODULE = 'myScripy.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'myScripy (+http://www.yourdomain.com)'
20 
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = True
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 #DOWNLOAD_DELAY = 3
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 #DEFAULT_REQUEST_HEADERS = {
43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 #   'Accept-Language': 'en',
45 #}
46 
47 # Enable or disable spider middlewares
48 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    'myScripy.middlewares.MyscripySpiderMiddleware': 543,
51 #}
52 
53 # Enable or disable downloader middlewares
54 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
55 #DOWNLOADER_MIDDLEWARES = {
56 #    'myScripy.middlewares.MyCustomDownloaderMiddleware': 543,
57 #}
58 
59 # Enable or disable extensions
60 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    'scrapy.extensions.telnet.TelnetConsole': None,
63 #}
64 
65 # Configure item pipelines
66 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
67 ITEM_PIPELINES = {
68    'myScripy.pipelines.ItcastPipeline': 300,
69 }
70 
71 # Enable and configure the AutoThrottle extension (disabled by default)
72 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
73 #AUTOTHROTTLE_ENABLED = True
74 # The initial download delay
75 #AUTOTHROTTLE_START_DELAY = 5
76 # The maximum download delay to be set in case of high latencies
77 #AUTOTHROTTLE_MAX_DELAY = 60
78 # The average number of requests Scrapy should be sending in parallel to
79 # each remote server
80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 # Enable showing throttling stats for every response received:
82 #AUTOTHROTTLE_DEBUG = False
83 
84 # Enable and configure HTTP caching (disabled by default)
85 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 #HTTPCACHE_ENABLED = True
87 #HTTPCACHE_EXPIRATION_SECS = 0
88 #HTTPCACHE_DIR = 'httpcache'
89 #HTTPCACHE_IGNORE_HTTP_CODES = []
90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
settings文件

小Tips:

scrapy獲取html頁面:
response.body ==> 字節類型(bytes) response.text ==> 字符串類型(unicode)

requests獲取:
response.content==> 字節類型(bytes) response.text ==> 字符串類型(unicode)

urllib2獲取
response.read()

示例2、騰訊招聘網自動翻頁採集

爬蟲模塊

# -*- coding: utf-8 -*-
import scrapy
from day_30.TencentScripy.TencentScripy.items import TencentscripyItem

class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['tencent.com']   # 不要在這上面隨便加/,不然有時候只會取一頁
    url = 'http://hr.tencent.com/position.php?&start='
    offset = 0
    start_urls = [url+str(offset),]

    def parse(self, response):
        list = response.xpath("//tr[@class='even']|//tr[@class='odd']")
        for i in list:
            # 初始化模型對象,就在for循環裏面實例化吧
            item = TencentscripyItem()

            name = i.xpath('.//a/text()').extract()[0]
            link = i.xpath('.//a/@href').extract()[0]
            type = i.xpath('./td[2]/text()').extract()[0]
            number = i.xpath('./td[3]/text()').extract()[0]
            place = i.xpath('./td[4]/text()').extract()[0]
            rtime = i.xpath('./td[5]/text()').extract()[0]
            item['name'] = name
            item['link'] = link
            item['type'] = type
            item['number'] = number
            item['place'] = place
            item['rtime'] = rtime
            yield item
        if self.offset < 1680:
            self.offset += 10
            # 每次處理完一頁的數據以後,從新發送下一頁頁面請求
            # self.offset自增10,同時拼接爲新的url,並調用回調函數self.parse處理Response
            yield scrapy.Request(self.url+str(self.offset),callback=self.parse)     # 這裏的回調函數後面不用加()

小Tips:頁碼自增高逼格方法

curpage = re.search('(\d+)',response.url).group(1)	# 從任意地方查找數字並賦值給curpage
page = int(curpage) + 10	# 將連接中的頁碼+10
url = re.sub('\d+', str(page), response.url)	# 找到連接中的數字,替換成新的值

管道模塊

import json

class TencentscripyPipeline(object):
    def __init__(self):
        self.filename = open('tencent-job.json','wb')

    def process_item(self, item, spider):
        text = json.dumps(dict(item),ensure_ascii=False).encode('utf-8')+b'\n'
        self.filename.write(text)
        return item

    def close_spider(self,spider):
        self.filename.close()
 1 import scrapy
 2 
 3 class TencentItem(scrapy.Item):
 4     # define the fields for your item here like:
 5     # 職位名
 6     positionname = scrapy.Field()
 7     # 詳情鏈接
 8     positionlink = scrapy.Field()
 9     # 職位類別
10     positionType = scrapy.Field()
11     # 招聘人數
12     peopleNum = scrapy.Field()
13     # 工做地點
14     workLocation = scrapy.Field()
15     # 發佈時間
16     publishTime = scrapy.Field()
items.py
 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for tencent project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     http://doc.scrapy.org/en/latest/topics/settings.html
 9 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
10 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'tencent'
13 
14 SPIDER_MODULES = ['tencent.spiders']
15 NEWSPIDER_MODULE = 'tencent.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'tencent (+http://www.yourdomain.com)'
20 
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = True
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 DOWNLOAD_DELAY = 2
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 DEFAULT_REQUEST_HEADERS = {
43     "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
44     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
45 }
46 
47 # Enable or disable spider middlewares
48 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    'tencent.middlewares.MyCustomSpiderMiddleware': 543,
51 #}
52 
53 # Enable or disable downloader middlewares
54 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
55 #DOWNLOADER_MIDDLEWARES = {
56 #    'tencent.middlewares.MyCustomDownloaderMiddleware': 543,
57 #}
58 
59 # Enable or disable extensions
60 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    'scrapy.extensions.telnet.TelnetConsole': None,
63 #}
64 
65 # Configure item pipelines
66 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
67 ITEM_PIPELINES = {
68     'tencent.pipelines.TencentPipeline': 300,
69 }
70 
71 # Enable and configure the AutoThrottle extension (disabled by default)
72 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
73 #AUTOTHROTTLE_ENABLED = True
74 # The initial download delay
75 #AUTOTHROTTLE_START_DELAY = 5
76 # The maximum download delay to be set in case of high latencies
77 #AUTOTHROTTLE_MAX_DELAY = 60
78 # The average number of requests Scrapy should be sending in parallel to
79 # each remote server
80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 # Enable showing throttling stats for every response received:
82 #AUTOTHROTTLE_DEBUG = False
83 
84 # Enable and configure HTTP caching (disabled by default)
85 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 #HTTPCACHE_ENABLED = True
87 #HTTPCACHE_EXPIRATION_SECS = 0
88 #HTTPCACHE_DIR = 'httpcache'
89 #HTTPCACHE_IGNORE_HTTP_CODES = []
90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
settings.py

示例3、提取json文件裏的圖片***

爬蟲模塊

# -*- coding: utf-8 -*-
import scrapy,json
from day_30.douyuScripy.douyuScripy.items import DouyuscripyItem

class DouyuSpider(scrapy.Spider):
    name = 'douyu'
    allowed_domains = ['capi.douyucdn.cn']
    url = 'http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset='
    offset = 0
    start_urls = [url+str(offset)]

    def parse(self, response):
        # scrapy獲取html頁面用的是response.body ==> 字節類型(bytes) response.text ==> 字符串類型(unicode)
        # 把json格式的數據轉換爲python格式,data段是列表
        py_dic = json.loads(response.text)
        # {error: 0,data: [{room_id: "2690605",room_src: "https://rpic.douyucdn.cn/live-cover/appCovers/2017/11/22/2690605_20171122081559_small.jpg",...
        data_list = py_dic['data']

        for data in data_list:

            item = DouyuscripyItem()    # 這個實例化仍是放在循環裏面吧,要否則會有意想不到的驚喜喲
            item['room_id'] = data['room_id']
            item['room_name'] = data['room_name']
            item['vertical_src'] = data['vertical_src']
            yield item
        if self.offset < 40:
            self.offset += 20
            yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

管道模塊***(保存圖片+更名)

import scrapy,os
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline

class DouyuscripyPipeline(ImagesPipeline):
    IMAGES_STORE = get_project_settings().get('IMAGES_STORE')

    def get_media_requests(self, item, info):
        # 獲取圖片的連接
        image_url = item["vertical_src"]
        # 向圖片的url地址發送請求獲取圖片
        yield scrapy.Request(image_url)

    def item_completed(self, result, item, info):
        # 獲取文件的名字
        image_path = [x["path"] for ok, x in result if ok]
        # 更改文件的名字爲直播間名+房間號
        os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["room_name"]+'-'+item['room_id'] + ".jpg")
        # 將圖片儲存的路徑保存到item中,返回item
        item["imagePath"] = self.IMAGES_STORE + "/" + item["room_name"]+item['room_id']

        return item
1 import scrapy
2 
3 class DouyuscripyItem(scrapy.Item):
4     # define the fields for your item here like:
5     vertical_src = scrapy.Field()    # 圖片    鏈接
6     room_name = scrapy.Field()    # 房間名
7     room_id = scrapy.Field()   # 房間id號 
8     imagePath = scrapy.Field()    # 本地文件保存路徑
items.py
 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for douyuScripy project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     http://doc.scrapy.org/en/latest/topics/settings.html
 9 #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
10 #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'douyuScripy'
13 
14 SPIDER_MODULES = ['douyuScripy.spiders']
15 NEWSPIDER_MODULE = 'douyuScripy.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'douyuScripy (+http://www.yourdomain.com)'
20 
21 # Obey robots.txt rules
22 # ROBOTSTXT_OBEY = True
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 DOWNLOAD_DELAY = 2
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 DEFAULT_REQUEST_HEADERS = {
43     "User-Agent" : "DYZB/1 CFNetwork/808.2.16 Darwin/16.3.0"
44   # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
45   # 'Accept-Language': 'en',
46 }
47 
48 IMAGES_STORE = "C:/Users/鑫。/PycharmProjects/study/day_30/douyuScripy/Images"
49 
50 # Enable or disable spider middlewares
51 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
52 #SPIDER_MIDDLEWARES = {
53 #    'douyuScripy.middlewares.DouyuscripySpiderMiddleware': 543,
54 #}
55 
56 # Enable or disable downloader middlewares
57 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
58 #DOWNLOADER_MIDDLEWARES = {
59 #    'douyuScripy.middlewares.MyCustomDownloaderMiddleware': 543,
60 #}
61 
62 # Enable or disable extensions
63 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
64 #EXTENSIONS = {
65 #    'scrapy.extensions.telnet.TelnetConsole': None,
66 #}
67 
68 # Configure item pipelines
69 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
70 ITEM_PIPELINES = {
71    'douyuScripy.pipelines.DouyuscripyPipeline': 300,
72 }
73 
74 # Enable and configure the AutoThrottle extension (disabled by default)
75 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
76 #AUTOTHROTTLE_ENABLED = True
77 # The initial download delay
78 #AUTOTHROTTLE_START_DELAY = 5
79 # The maximum download delay to be set in case of high latencies
80 #AUTOTHROTTLE_MAX_DELAY = 60
81 # The average number of requests Scrapy should be sending in parallel to
82 # each remote server
83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
84 # Enable showing throttling stats for every response received:
85 #AUTOTHROTTLE_DEBUG = False
86 
87 # Enable and configure HTTP caching (disabled by default)
88 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
89 #HTTPCACHE_ENABLED = True
90 #HTTPCACHE_EXPIRATION_SECS = 0
91 #HTTPCACHE_DIR = 'httpcache'
92 #HTTPCACHE_IGNORE_HTTP_CODES = []
93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
settings.py 加了一個圖片存放路徑
相關文章
相關標籤/搜索