Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架,用途很是普遍。php
框架的力量,用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲,用來抓取網頁內容以及各類圖片,很是之方便。css
Scrapy 使用了 Twisted['twɪstɪd]
(其主要對手是Tornado)異步網絡框架來處理網絡通信,能夠加快咱們的下載速度,不用本身去實現異步框架,而且包含了各類中間件接口,能夠靈活的完成各類需求。html
Scrapy Engine(引擎)
: 負責Spider
、ItemPipeline
、Downloader
、Scheduler
中間的通信,信號、數據傳遞等。python
Scheduler(調度器)
: 它負責接受引擎
發送過來的Request請求,並按照必定的方式進行整理排列,入隊,當引擎
須要時,交還給引擎
。web
Downloader(下載器)
:負責下載Scrapy Engine(引擎)
發送的全部Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎)
,由引擎
交給Spider
來處理,正則表達式
Spider(爬蟲)
:它負責處理全部Responses,從中分析提取數據,獲取Item字段須要的數據,並將須要跟進的URL提交給引擎
,再次進入Scheduler(調度器)
,數據庫
Item Pipeline(管道)
:它負責處理Spider
中獲取到的Item,並進行進行後期處理(詳細分析、過濾、存儲等)的地方.json
Downloader Middlewares(下載中間件)
:你能夠看成是一個能夠自定義擴展下載功能的組件。api
Spider Middlewares(Spider中間件)
:你能夠理解爲是一個能夠自定擴展和操做引擎
和Spider
中間通訊
的功能組件(好比進入Spider
的Responses;和從Spider
出去的Requests)cookie
一切的開始是從咱們寫的爬蟲(Spider)開始的,咱們向引擎(Scrapu Engine)發送請求,引擎將發送來的Request請求交給調度器,調度器將他們入隊,當引擎須要的時候,將他們按先進先出的方式出隊,而後引擎把他們交給下載器(Downloader),下載器下載完畢後把Response交給引擎,引擎又交給咱們寫的爬蟲程序,咱們經過處理Response將裏面要繼續爬取的URL交給引擎(重複上面的步驟),須要保存的發送給管道(Item Pipeline)處理
from scrapy import cmdline cmdline.execute("scrapy crawl 起的爬蟲名".split())
Selector有四個基本的方法,最經常使用的仍是xpath:
/html/head/title: 選擇<HTML>文檔中 <head> 標籤內的 <title> 元素 /html/head/title/text(): 選擇上面提到的 <title> 元素的文字 //td: 選擇全部的 <td> 元素 //div[@class="mine"]: 選擇全部具備 class="mine" 屬性的 div 元素
其餘的看前兩篇博客吧
當Item在Spider中被收集以後,它將會被傳遞到Item Pipeline,這些Item Pipeline組件按定義的順序處理Item。
每一個Item Pipeline都是實現了簡單方法的Python類,好比決定此Item是丟棄而存儲。如下是item pipeline的一些典型應用:
編寫item pipeline很簡單,item pipiline組件是一個獨立的Python類,其中process_item()方法必須實現:
class XingePipeline(object): def __init__(self): # 可選實現,作參數初始化等 # 初始函數和結束函數只執行一遍,中間的proces_item函數,來數據就執行一遍,因此不用寫ab self.file = open('teacher.json', 'wb') # 打開文件 def process_item(self, item, spider): # item (Item 對象) – 被爬取的item # spider (Spider 對象) – 爬取該item的spider # 這個方法必須實現,每一個item pipeline組件都須要調用該方法, # 這個方法必須返回一個 Item 對象,被丟棄的item將不會被以後的pipeline組件所處理。 content = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(content) return item def open_spider(self, spider): # spider (Spider 對象) – 被開啓的spider # 可選實現,當spider被開啓時,這個方法被調用。 def close_spider(self, spider): # spider (Spider 對象) – 被關閉的spider # 可選實現,當spider被關閉時,這個方法被調用 self.file.close()
要啓用pipeline,必需要在settings文件中把註釋去掉
# Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { "mySpider.pipelines.ItcastJsonPipeline":300 }
Spider類定義瞭如何爬取某個(或某些)網站。包括了爬取的動做(例如:是否跟進連接)以及如何從網頁的內容中提取結構化數據(爬取item)。 換句話說,Spider就是您定義爬取的動做及分析某個網頁(或者是有些網頁)的地方。
class scrapy.Spider
是最基本的類,全部編寫的爬蟲必須繼承這個類。
主要用到的函數及調用順序爲:
__init__()
: 初始化爬蟲名字和start_urls列表
start_requests() 調用make_requests_from url()
:生成Requests對象交給Scrapy下載並返回response
parse()
: 解析response,並返回Item或Requests(需指定回調函數)。Item傳給Item pipline持久化 , 而Requests交由Scrapy下載,並由指定的回調函數處理(默認parse()),一直進行循環,直處處理完全部的數據爲止。
name
定義spider名字的字符串。
例如,若是spider爬取 mywebsite.com ,該spider一般會被命名爲 mywebsite
allowed_domains
包含了spider容許爬取的域名(domain)的列表,可選。
start_urls
初始URL元祖/列表。當沒有制定特定的URL時,spider將從該列表中開始進行爬取。
start_requests(self)
該方法必須返回一個可迭代對象(iterable)。該對象包含了spider用於爬取(默認實現是使用 start_urls 的url)的第一個Request。
當spider啓動爬取而且未指定start_urls時,該方法被調用。
parse(self, response)
當請求url返回網頁沒有指定回調函數時,默認的Request對象回調函數。用來處理網頁返回的response,以及生成Item或者Request對象。
log(self, message[, level, component])
使用 scrapy.log.msg() 方法記錄(log)message。 更多數據請參見 logging
parse方法的工做規則
1. 由於使用的yield,而不是return。parse函數將會被當作一個生成器使用。scrapy會逐一獲取parse方法中生成的結果,並判斷該結果是一個什麼樣的類型; 2. 若是是request則加入爬取隊列,若是是item類型則使用pipeline處理,其餘類型則返回錯誤信息。 3. scrapy取到第一部分的request不會立馬就去發送這個request,只是把這個request放到隊列裏,而後接着從生成器裏獲取; 4. 取盡第一部分的request,而後再獲取第二部分的item,取到item了,就會放到對應的pipeline裏處理; 5. parse()方法做爲回調函數(callback)賦值給了Request,指定parse()方法來處理這些請求 scrapy.Request(url, callback=self.parse) 6. Request對象通過調度,執行生成 scrapy.http.response()的響應對象,並送回給parse()方法,直到調度器中沒有Request(遞歸的思路) 7. 取盡以後,parse()工做結束,引擎再根據隊列和pipelines中的內容去執行相應的操做; 8. 程序在取得各個頁面的items前,會先處理完以前全部的request隊列裏的請求,而後再提取items。 7. 這一切的一切,Scrapy引擎和調度器將負責到底。
小Tips
爲何要用yield? yield的主要做用是將函數 ==> 生成器 經過yield能夠給item返回數據 也能夠發送下一個的request請求。 若是用return的話,會結束函數。 若是須要返回包含成百上千個元素的list,想必會佔用不少計算機資源以及時間。若是用yield 就能夠緩和這種狀況了。
settings文件
# -*- coding: utf-8 -*- # Scrapy settings for douyuScripy project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douyuScripy' # 工程名 SPIDER_MODULES = ['douyuScripy.spiders'] # 爬蟲文件路徑 NEWSPIDER_MODULE = 'douyuScripy.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'douyuScripy (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # 是否符合爬蟲規則,咱們本身寫爬蟲固然是不遵照了呀,註釋掉就行了 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # 啓動的協程數量,默認是16個 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 2 # 每次請求的等待時間 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 # 將單個域執行的併發請求的最大數量,默認是8 #CONCURRENT_REQUESTS_PER_IP = 16 # 將對單個IP執行的併發請求的最大數量,默認是0,若是非零,併發限制將應用於每一個IP,而不是每一個域。 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # 是否保存cookie,默認是True # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # 指定是否啓用telnet控制檯(和Windows不要緊),默認是True # Override the default request headers: DEFAULT_REQUEST_HEADERS = { # 請求頭文件 "User-Agent" : "DYZB/1 CFNetwork/808.2.16 Darwin/16.3.0" # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', } # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'douyuScripy.middlewares.DouyuscripySpiderMiddleware': 543, # 爬蟲中間件,後面的值越小,優先級越高 #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'douyuScripy.middlewares.MyCustomDownloaderMiddleware': 543, # 下載中間件 #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'douyuScripy.pipelines.DouyuscripyPipeline': 300, # 使用哪一個管道,多個的話,先走後面值小的 } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
示例1、爬取itcast的教師姓名+信息
爬蟲模塊
# -*- coding: utf-8 -*- import scrapy from douyu.items import DouyuItem import json class DouyumeinvSpider(scrapy.Spider): name = "douyumeinv" allowed_domains = ["capi.douyucdn.cn"] offset = 0 url = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=" start_urls = [url + str(offset)] def parse(self, response): # scrapy獲取html頁面用的是response.body ==> 字節類型(bytes) response.text ==> 字符串類型(unicode) # 把json格式的數據轉換爲python格式,data段是列表 data = json.loads(response.text)["data"] for each in data: item = DouyuItem() item["nickname"] = each["nickname"] item["imagelink"] = each["vertical_src"] yield item if self.offset < 40: self.offset += 20 yield scrapy.Request(self.url + str(self.offset), callback = self.parse)
管道模塊
import json class ItcastPipeline(object): def __init__(self): self.filename = open('teacher.json','wb') def process_item(self, item, spider): text = json.dumps(dict(item),ensure_ascii=False)+'\n' self.filename.write(text.encode('utf-8')) return item def close_spider(self,spider): self.filename.close()
1 import scrapy 2 3 class ItcastItem(scrapy.Item): 4 # define the fields for your item here like: 5 name = scrapy.Field() 6 title = scrapy.Field() 7 info = scrapy.Field()
1 # -*- coding: utf-8 -*- 2 3 # Scrapy settings for myScripy project 4 # 5 # For simplicity, this file contains only settings considered important or 6 # commonly used. You can find more settings consulting the documentation: 7 # 8 # http://doc.scrapy.org/en/latest/topics/settings.html 9 # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 10 # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 11 12 BOT_NAME = 'myScripy' 13 14 SPIDER_MODULES = ['myScripy.spiders'] 15 NEWSPIDER_MODULE = 'myScripy.spiders' 16 17 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 #USER_AGENT = 'myScripy (+http://www.yourdomain.com)' 20 21 # Obey robots.txt rules 22 ROBOTSTXT_OBEY = True 23 24 # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 #CONCURRENT_REQUESTS = 32 26 27 # Configure a delay for requests for the same website (default: 0) 28 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 29 # See also autothrottle settings and docs 30 #DOWNLOAD_DELAY = 3 31 # The download delay setting will honor only one of: 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 #CONCURRENT_REQUESTS_PER_IP = 16 34 35 # Disable cookies (enabled by default) 36 #COOKIES_ENABLED = False 37 38 # Disable Telnet Console (enabled by default) 39 #TELNETCONSOLE_ENABLED = False 40 41 # Override the default request headers: 42 #DEFAULT_REQUEST_HEADERS = { 43 # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 44 # 'Accept-Language': 'en', 45 #} 46 47 # Enable or disable spider middlewares 48 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 49 #SPIDER_MIDDLEWARES = { 50 # 'myScripy.middlewares.MyscripySpiderMiddleware': 543, 51 #} 52 53 # Enable or disable downloader middlewares 54 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 55 #DOWNLOADER_MIDDLEWARES = { 56 # 'myScripy.middlewares.MyCustomDownloaderMiddleware': 543, 57 #} 58 59 # Enable or disable extensions 60 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 61 #EXTENSIONS = { 62 # 'scrapy.extensions.telnet.TelnetConsole': None, 63 #} 64 65 # Configure item pipelines 66 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 67 ITEM_PIPELINES = { 68 'myScripy.pipelines.ItcastPipeline': 300, 69 } 70 71 # Enable and configure the AutoThrottle extension (disabled by default) 72 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html 73 #AUTOTHROTTLE_ENABLED = True 74 # The initial download delay 75 #AUTOTHROTTLE_START_DELAY = 5 76 # The maximum download delay to be set in case of high latencies 77 #AUTOTHROTTLE_MAX_DELAY = 60 78 # The average number of requests Scrapy should be sending in parallel to 79 # each remote server 80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 81 # Enable showing throttling stats for every response received: 82 #AUTOTHROTTLE_DEBUG = False 83 84 # Enable and configure HTTP caching (disabled by default) 85 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 86 #HTTPCACHE_ENABLED = True 87 #HTTPCACHE_EXPIRATION_SECS = 0 88 #HTTPCACHE_DIR = 'httpcache' 89 #HTTPCACHE_IGNORE_HTTP_CODES = [] 90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
小Tips:
scrapy獲取html頁面: response.body ==> 字節類型(bytes) response.text ==> 字符串類型(unicode) requests獲取: response.content==> 字節類型(bytes) response.text ==> 字符串類型(unicode) urllib2獲取 response.read()
示例2、騰訊招聘網自動翻頁採集
爬蟲模塊
# -*- coding: utf-8 -*- import scrapy from day_30.TencentScripy.TencentScripy.items import TencentscripyItem class TencentSpider(scrapy.Spider): name = 'tencent' allowed_domains = ['tencent.com'] # 不要在這上面隨便加/,不然有時候只會取一頁 url = 'http://hr.tencent.com/position.php?&start=' offset = 0 start_urls = [url+str(offset),] def parse(self, response): list = response.xpath("//tr[@class='even']|//tr[@class='odd']") for i in list: # 初始化模型對象,就在for循環裏面實例化吧 item = TencentscripyItem() name = i.xpath('.//a/text()').extract()[0] link = i.xpath('.//a/@href').extract()[0] type = i.xpath('./td[2]/text()').extract()[0] number = i.xpath('./td[3]/text()').extract()[0] place = i.xpath('./td[4]/text()').extract()[0] rtime = i.xpath('./td[5]/text()').extract()[0] item['name'] = name item['link'] = link item['type'] = type item['number'] = number item['place'] = place item['rtime'] = rtime yield item if self.offset < 1680: self.offset += 10 # 每次處理完一頁的數據以後,從新發送下一頁頁面請求 # self.offset自增10,同時拼接爲新的url,並調用回調函數self.parse處理Response yield scrapy.Request(self.url+str(self.offset),callback=self.parse) # 這裏的回調函數後面不用加()
小Tips:頁碼自增高逼格方法
curpage = re.search('(\d+)',response.url).group(1) # 從任意地方查找數字並賦值給curpage page = int(curpage) + 10 # 將連接中的頁碼+10 url = re.sub('\d+', str(page), response.url) # 找到連接中的數字,替換成新的值
管道模塊
import json class TencentscripyPipeline(object): def __init__(self): self.filename = open('tencent-job.json','wb') def process_item(self, item, spider): text = json.dumps(dict(item),ensure_ascii=False).encode('utf-8')+b'\n' self.filename.write(text) return item def close_spider(self,spider): self.filename.close()
1 import scrapy 2 3 class TencentItem(scrapy.Item): 4 # define the fields for your item here like: 5 # 職位名 6 positionname = scrapy.Field() 7 # 詳情鏈接 8 positionlink = scrapy.Field() 9 # 職位類別 10 positionType = scrapy.Field() 11 # 招聘人數 12 peopleNum = scrapy.Field() 13 # 工做地點 14 workLocation = scrapy.Field() 15 # 發佈時間 16 publishTime = scrapy.Field()
1 # -*- coding: utf-8 -*- 2 3 # Scrapy settings for tencent project 4 # 5 # For simplicity, this file contains only settings considered important or 6 # commonly used. You can find more settings consulting the documentation: 7 # 8 # http://doc.scrapy.org/en/latest/topics/settings.html 9 # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 10 # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 11 12 BOT_NAME = 'tencent' 13 14 SPIDER_MODULES = ['tencent.spiders'] 15 NEWSPIDER_MODULE = 'tencent.spiders' 16 17 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 #USER_AGENT = 'tencent (+http://www.yourdomain.com)' 20 21 # Obey robots.txt rules 22 ROBOTSTXT_OBEY = True 23 24 # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 #CONCURRENT_REQUESTS = 32 26 27 # Configure a delay for requests for the same website (default: 0) 28 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 29 # See also autothrottle settings and docs 30 DOWNLOAD_DELAY = 2 31 # The download delay setting will honor only one of: 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 #CONCURRENT_REQUESTS_PER_IP = 16 34 35 # Disable cookies (enabled by default) 36 #COOKIES_ENABLED = False 37 38 # Disable Telnet Console (enabled by default) 39 #TELNETCONSOLE_ENABLED = False 40 41 # Override the default request headers: 42 DEFAULT_REQUEST_HEADERS = { 43 "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;", 44 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' 45 } 46 47 # Enable or disable spider middlewares 48 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 49 #SPIDER_MIDDLEWARES = { 50 # 'tencent.middlewares.MyCustomSpiderMiddleware': 543, 51 #} 52 53 # Enable or disable downloader middlewares 54 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 55 #DOWNLOADER_MIDDLEWARES = { 56 # 'tencent.middlewares.MyCustomDownloaderMiddleware': 543, 57 #} 58 59 # Enable or disable extensions 60 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 61 #EXTENSIONS = { 62 # 'scrapy.extensions.telnet.TelnetConsole': None, 63 #} 64 65 # Configure item pipelines 66 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 67 ITEM_PIPELINES = { 68 'tencent.pipelines.TencentPipeline': 300, 69 } 70 71 # Enable and configure the AutoThrottle extension (disabled by default) 72 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html 73 #AUTOTHROTTLE_ENABLED = True 74 # The initial download delay 75 #AUTOTHROTTLE_START_DELAY = 5 76 # The maximum download delay to be set in case of high latencies 77 #AUTOTHROTTLE_MAX_DELAY = 60 78 # The average number of requests Scrapy should be sending in parallel to 79 # each remote server 80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 81 # Enable showing throttling stats for every response received: 82 #AUTOTHROTTLE_DEBUG = False 83 84 # Enable and configure HTTP caching (disabled by default) 85 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 86 #HTTPCACHE_ENABLED = True 87 #HTTPCACHE_EXPIRATION_SECS = 0 88 #HTTPCACHE_DIR = 'httpcache' 89 #HTTPCACHE_IGNORE_HTTP_CODES = [] 90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
示例3、提取json文件裏的圖片***
爬蟲模塊
# -*- coding: utf-8 -*- import scrapy,json from day_30.douyuScripy.douyuScripy.items import DouyuscripyItem class DouyuSpider(scrapy.Spider): name = 'douyu' allowed_domains = ['capi.douyucdn.cn'] url = 'http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=' offset = 0 start_urls = [url+str(offset)] def parse(self, response): # scrapy獲取html頁面用的是response.body ==> 字節類型(bytes) response.text ==> 字符串類型(unicode) # 把json格式的數據轉換爲python格式,data段是列表 py_dic = json.loads(response.text) # {error: 0,data: [{room_id: "2690605",room_src: "https://rpic.douyucdn.cn/live-cover/appCovers/2017/11/22/2690605_20171122081559_small.jpg",... data_list = py_dic['data'] for data in data_list: item = DouyuscripyItem() # 這個實例化仍是放在循環裏面吧,要否則會有意想不到的驚喜喲 item['room_id'] = data['room_id'] item['room_name'] = data['room_name'] item['vertical_src'] = data['vertical_src'] yield item if self.offset < 40: self.offset += 20 yield scrapy.Request(self.url+str(self.offset),callback=self.parse)
管道模塊***(保存圖片+更名)
import scrapy,os from scrapy.utils.project import get_project_settings from scrapy.pipelines.images import ImagesPipeline class DouyuscripyPipeline(ImagesPipeline): IMAGES_STORE = get_project_settings().get('IMAGES_STORE') def get_media_requests(self, item, info): # 獲取圖片的連接 image_url = item["vertical_src"] # 向圖片的url地址發送請求獲取圖片 yield scrapy.Request(image_url) def item_completed(self, result, item, info): # 獲取文件的名字 image_path = [x["path"] for ok, x in result if ok] # 更改文件的名字爲直播間名+房間號 os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["room_name"]+'-'+item['room_id'] + ".jpg") # 將圖片儲存的路徑保存到item中,返回item item["imagePath"] = self.IMAGES_STORE + "/" + item["room_name"]+item['room_id'] return item
1 import scrapy 2 3 class DouyuscripyItem(scrapy.Item): 4 # define the fields for your item here like: 5 vertical_src = scrapy.Field() # 圖片 鏈接 6 room_name = scrapy.Field() # 房間名 7 room_id = scrapy.Field() # 房間id號 8 imagePath = scrapy.Field() # 本地文件保存路徑
1 # -*- coding: utf-8 -*- 2 3 # Scrapy settings for douyuScripy project 4 # 5 # For simplicity, this file contains only settings considered important or 6 # commonly used. You can find more settings consulting the documentation: 7 # 8 # http://doc.scrapy.org/en/latest/topics/settings.html 9 # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 10 # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 11 12 BOT_NAME = 'douyuScripy' 13 14 SPIDER_MODULES = ['douyuScripy.spiders'] 15 NEWSPIDER_MODULE = 'douyuScripy.spiders' 16 17 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent 19 #USER_AGENT = 'douyuScripy (+http://www.yourdomain.com)' 20 21 # Obey robots.txt rules 22 # ROBOTSTXT_OBEY = True 23 24 # Configure maximum concurrent requests performed by Scrapy (default: 16) 25 #CONCURRENT_REQUESTS = 32 26 27 # Configure a delay for requests for the same website (default: 0) 28 # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 29 # See also autothrottle settings and docs 30 DOWNLOAD_DELAY = 2 31 # The download delay setting will honor only one of: 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 33 #CONCURRENT_REQUESTS_PER_IP = 16 34 35 # Disable cookies (enabled by default) 36 #COOKIES_ENABLED = False 37 38 # Disable Telnet Console (enabled by default) 39 #TELNETCONSOLE_ENABLED = False 40 41 # Override the default request headers: 42 DEFAULT_REQUEST_HEADERS = { 43 "User-Agent" : "DYZB/1 CFNetwork/808.2.16 Darwin/16.3.0" 44 # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 45 # 'Accept-Language': 'en', 46 } 47 48 IMAGES_STORE = "C:/Users/鑫。/PycharmProjects/study/day_30/douyuScripy/Images" 49 50 # Enable or disable spider middlewares 51 # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 52 #SPIDER_MIDDLEWARES = { 53 # 'douyuScripy.middlewares.DouyuscripySpiderMiddleware': 543, 54 #} 55 56 # Enable or disable downloader middlewares 57 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 58 #DOWNLOADER_MIDDLEWARES = { 59 # 'douyuScripy.middlewares.MyCustomDownloaderMiddleware': 543, 60 #} 61 62 # Enable or disable extensions 63 # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 64 #EXTENSIONS = { 65 # 'scrapy.extensions.telnet.TelnetConsole': None, 66 #} 67 68 # Configure item pipelines 69 # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 70 ITEM_PIPELINES = { 71 'douyuScripy.pipelines.DouyuscripyPipeline': 300, 72 } 73 74 # Enable and configure the AutoThrottle extension (disabled by default) 75 # See http://doc.scrapy.org/en/latest/topics/autothrottle.html 76 #AUTOTHROTTLE_ENABLED = True 77 # The initial download delay 78 #AUTOTHROTTLE_START_DELAY = 5 79 # The maximum download delay to be set in case of high latencies 80 #AUTOTHROTTLE_MAX_DELAY = 60 81 # The average number of requests Scrapy should be sending in parallel to 82 # each remote server 83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 84 # Enable showing throttling stats for every response received: 85 #AUTOTHROTTLE_DEBUG = False 86 87 # Enable and configure HTTP caching (disabled by default) 88 # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 89 #HTTPCACHE_ENABLED = True 90 #HTTPCACHE_EXPIRATION_SECS = 0 91 #HTTPCACHE_DIR = 'httpcache' 92 #HTTPCACHE_IGNORE_HTTP_CODES = [] 93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'