概念:爲來爬取網站數據而編寫的一款應用框架,集成了相應的功能且具備很強通用型的項目模版html
功能:scrapy框架提供了高性能的異步下載、解析、持久化存儲操做...python
用來處理整個系統的數據流處理,出發事物(框架核心)mysql
用來接受引擎發過來的請求,壓入隊列中,並在引擎再次請求的時候返回,能夠想象成一個url(抓取網頁的網址或者說是連接)的優先隊列,由它來決定下一個要抓取的網址是什麼,同時去除重複的網址linux
用於下載網頁的內容,並將網頁內容返回給蜘蛛(scrapy下載器是簡歷在twisted這個高效異步模型上的)正則表達式
爬蟲是主要幹活的,用於從特定的網頁中提取本身須要的信息,即所謂的實體(item)。用戶也能夠從中提取出連接,讓scrapy繼續抓取下一個頁面redis
負責處理爬蟲從網頁抽取的實體,主要的功能是持久化實體,驗證明體的有效性,清除不須要的信息,當頁面爬蟲解析後,將發送到項目管道,並通過幾個特定的次序處理數據sql
安裝:數據庫
安裝成功後能夠在終端/命令行輸入scrapy測試是否安裝成功windows
scrapy startproject 項目名
目錄結構:瀏覽器
- scrapy.cfg:配置文件 - items.py:設置數據存儲模版,用戶結構化數據 - pipelines:數據持久化處理 - settings.py:配置文件。如:遞歸層數,併發數,延遲下載等 - spiders:爬蟲目錄。如:建立文件,編寫爬蟲解析規則
建立的第一個爬蟲文件中的內容
# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): # 爬蟲文件的名稱: 經過爬蟲文件名稱能夠指定定位到某一個具體的爬蟲文件 name = 'first' # 容許的域名:只能夠爬取指定域名下的頁面數據 allowed_domains = ['https://www.qiushibaike.com'] # 起始url:當前工程將要爬取頁面所對應的url start_urls = ['http://https://www.qiushibaike.com/'] # 解析方法:對獲取的頁面數據進行指定內容的解析 # response:根據起始url列表發起請求,請求成功後返回的響應對象 # parse方法的返回值:必須爲迭代器或者空 def parse(self, response): pass
在建立的爬蟲文件中編寫爬蟲程序來完成爬蟲的相關操做
# 請求載體的身份標示,改成瀏覽器的身份標示 USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' # Obey robots.txt rules # 嚴格聽從門戶網站的robots協議 ROBOTSTXT_OBEY = False
在對爬取內容解析時建議使用xpath進行指定內容解析
# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.qiushibaike.com/text'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') f = open('段子.txt','w',encoding='utf-8') count = 0 for div in div_list: # xpath解析到的指定內容被存儲到來selector對象 # extract()該方法能夠將selector對象中存儲的數據值拿到 # extract_first()取到第一個數據,等同於extract()[0] author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() # author = div.xpath('./div/a[2]/h2/text()').extract()[0] content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip() count += 1 f.write(author+':\n'+content+'\n---------------分割線--------------\n\n\n') print('共抓取到:',count)
class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.qiushibaike.com/text'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') data_list = [] for div in div_list: # xpath解析到的指定內容被存儲到來selector對象 # extract()該方法能夠將selector對象中存儲的數據值拿到 # extract_first()取到第一個數據,等同於extract()[0] author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() # author = div.xpath('./div/a[2]/h2/text()').extract()[0] content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip() dic = { "author":author, "content":content } data_list.append(dic) return data_list
指令:
scrapy crawl first -o qiubai.csv --nolog
基於管道
基於管道的數據存儲代碼實現流程:
在items.py中:
import scrapy class QiushiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() author = scrapy.Field() content = scrapy.Field()
在pipelines.py中
class QiushiPipeline(object): file = None def open_spider(self,spider): # 整個爬蟲過程當中,該方法只會被調用一次,能夠在這裏打開文件 self.file = open('qiubai.txt','w',encoding='utf-8') print('開始爬蟲') def process_item(self, item, spider): # 該方法能夠接收爬蟲文件中提交過來的item對象,而且對item對象中存儲頁面數據進行持久化存儲 # 參數item表示的就是接收到的item對象 # 每當爬蟲文件向管道提交一次item,則該方法就會被執行一次 author = item['author'] content = item['content'] self.file.write(author+':\n'+content+'\n-------------------\n\n\n') return item def close_spider(self,spider): # 該方法只會在爬蟲結束的時候調用一次 print('爬蟲結束') self.file.close()
在爬蟲文件中:
import scrapy from qiushi.items import QiushiItem class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.qiushibaike.com/text'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() # author = div.xpath('./div/a[2]/h2/text()').extract()[0] content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip() # 1. 將解析到的頁面數據存儲到items對象 item = QiushiItem() item['author'] = author item['content'] = content # 2.將item對象提交給管道 yield item
配置文件:取消67line的註釋
ITEM_PIPELINES = { 'qiushi.pipelines.QiushiPipeline': 300, }
使用mysql數據庫進行持久化存儲時與基於管道存儲方式無太大區別,只是須要在pipelines中編寫pymysql鏈接、操做數據庫等
import pymysql class QiubaiPipeline(object): conn = None cursor = None def open_spider(self,spider): print('爬蟲開始') self.conn = pymysql.connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai') def process_item(self, item, spider): sql = 'insert into qiubai(author,content) values ("%s","%s")'%(item['author'],item['content']) self.cursor = self.conn.cursor() try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): print('爬蟲結束') self.cursor.close() self.conn.close()
安裝redis數據庫
cd redis-5.0.3
make
./redis-server ../redis.conf
reids的簡單使用
127.0.0.1:6379> set name 'hahha' OK 127.0.0.1:6379> get name "hahha"
import redis class QiubaiPipeline(object): conn = None def open_spider(self,spider): print('爬蟲開始') self.conn = redis.Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = { 'author':item['author'], 'content':item['content'] } self.conn.lpush('data',dic) return item def close_spider(self,spider): print('爬蟲結束')
需求:將爬取到的數據值分別存儲到磁盤、mysql、redis
pipelines.py
import redis class QiubaiPipeline(object): conn = None def open_spider(self,spider): print('爬蟲開始') self.conn = redis.Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = { 'author':item['author'], 'content':item['content'] } # self.conn.lpush('data',dic) print('數據寫入到redis數據庫中') return item def close_spider(self,spider): print('爬蟲結束') class QiubaiFiles(object): def process_item(self, item, spider): print('數據寫入到磁盤文件中') return item class QiubaiMySQL(object): def process_item(self, item, spider): print('數據寫入到mysql數據庫中') return item
settings.py
ITEM_PIPELINES = { 'qiubai.pipelines.QiubaiPipeline': 300, 'qiubai.pipelines.QiubaiFiles': 400, 'qiubai.pipelines.QiubaiMySQL': 500 }
使用請求的手動發送能夠實現多個url進行數據爬取
import scrapy from qiushiPage.items import QiushipageItem class QiushiSpider(scrapy.Spider): name = 'qiushi' # allowed_domains = ['https://www.qiushibaike.com/text/'] start_urls = ['https://www.qiushibaike.com/text/'] pageNum = 1 url = 'https://www.qiushibaike.com/text/page/%d/' def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() content = div.xpath('.//div[@class="content"]/span/text()').extract_first() item = QiushipageItem() item['author'] = author item['content'] = content yield item print('還在執行嗎') # 使用手動請求方式進行多個url爬取 if self.pageNum <= 13: self.pageNum += 1 print('執行了碼', self.pageNum) # 判斷頁碼是否小於13 new_url = format(self.url % self.pageNum) # callback函數是回調函數,第二頁解析的內容與開始解析的內容是同樣的,能夠使用parse函數進行解析,也能夠本身定義函數進行解析 yield scrapy.Request(url=new_url,callback=self.parse)
scrapy要發送post請求,必定要對父類中的start_requests方法進行重寫
class PostRequestSpider(scrapy.Spider): name = 'post_request' # allowed_domains = ['www.baidu.com'] start_urls = ['https://fanyi.baidu.com/sug'] # 發起post請求須要對父類中的start_requests方法進行重寫 def start_requests(self): data = { 'kw':'dog' } for url in self.start_urls: # 方式1 # yield scrapy.Request(url=url,callback=self.parse,method='post') # 方式2 yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data) def parse(self, response): print(response.text)
不須要對cookie刻意的去提取或者儲存,scrapy.Request會自動存儲cookie,在下次發起請求的時候會攜帶自動存儲下來的cookie
scrapy更換請求ip是經過下載中間件實現的,在middlewares.py中能夠自定義一個類,在類中實現一個process_request方法,方法有三個參數,self\request\spider
而後經過更改request.meta['proxy']的屬性進行更換ip,更換後在setting.py中開啓下載中間件
middlewares.py
class MyPro(object): def process_request(self,request,spider): request.meta['proxy'] = 'http://61.166.153.167:8080'
settings.py
DOWNLOADER_MIDDLEWARES = { 'postDemo.middlewares.MyPro': 543, }
在進行訪問時即可以對請求ip進行自動更改
種類:
# 指定終端輸出指定等級的日誌信息 LOG_LEVEL = 'ERROR'
也能夠指定輸出日誌信息的文件,而不是輸出在屏幕上,一樣的是在settings.py中添加一個屬性LOG_FILE
LOG_FILE = 'log.txt'
import scrapy from moviePro.items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['http://www.55xia.com'] start_urls = ['http://www.55xia.com/movie/'] def parseMoviePage(self,response): # 取出item item = response.meta['item'] direct = response.xpath('//html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]//text()').extract_first() country = response.xpath('//html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[4]/td[2]/a/text()').extract_first() movie_referral = response.xpath('/html/body/div[1]/div/div/div[1]/div[2]/div[2]/p/text()').extract_first() download_url = response.xpath('//td[@class="text-break"]/div/a[@rel="nofollow"]/@href').extract_first() password = response.xpath('//td[@class="text-break"]/div/strong/text()').extract_first() download = '連接:%s密碼:%s'%(download_url,password) item['download'] = download item['country'] = country item['direct'] = direct item['movie_referral'] = movie_referral yield item def parse(self, response): div_list = response.xpath('//html/body/div[1]/div[1]/div[2]/div') for div in div_list: name = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first() parse_url = div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first() genre = div.xpath('.//div[@class="otherinfo"]//text()').extract() genre = '|'.join(genre) url = 'http:%s'%parse_url item = MovieproItem() item['name'] = name item['genre'] = genre # 請求傳參,兩個解析響應的頁面不一樣,須要同時保存數據,將item經過meta傳到回調函數中,在回調函數中能夠使用response.meta['item']取出 # meta參數必須接收一個字典 yield scrapy.Request(url=url,callback=self.parseMoviePage,meta={'item':item})
問題:若是咱們想要對某一個網站的全站數據進行爬取
解決方案:
CrawlSpider概念:CrawlSpider起始就是Spider的一個子類,CrawlSpider功能更增強(連接提取器,解析器)
scrapy genspider -t crawl chouti dig.chouti.com
連接提取器:顧名思義,用來提取指定的連接(url)
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class ChoutiSpider(CrawlSpider): name = 'chouti' # allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] # 實例化了一個連接提取器對象 # 連接提取器:顧名思義,用來提取指定的連接(url) # allow參數:賦值一個正則表達式參數 # 連接提取器就能夠根據正則表達式在頁面中指定的連接 # 提取到的連接會所有交給規則解析起 link = LinkExtractor(allow=r'/all/hot/recent/\d+') rules = ( # 實例化了一個規則解析器對象 # 規則解析器接受了連接提取器發送的連接後,就會對這些連接發起請求,獲取頁面內容,就會根據指定規則對頁面內容進行解析 # callback: 指定一個解析規則(方法/函數) # follow參數:是否將連接提取器繼續做用到連接提取器提取出所表示的頁面數據中 Rule(link, callback='parse_item', follow=True), ) def parse_item(self, response): print(response) # 能夠對響應數據進行解析
分佈式爬蟲:
# 安裝 pip install scrapy-redis
from scrapy_redis.spiders import RedisCrawlSpider class QiubaiSpider(RedisCrawlSpider): pass
from redisPro.items import RedisproItem class QiubaiSpider(RedisCrawlSpider): name = 'qiubai' # allowed_domains = ['https://www.qiushibaike.com/pic/'] # start_urls = ['https://www.qiushibaike.com/pic//'] # 調度器隊列的名稱 redis_key = 'qiubaispider' # 表示與start_urls含義是同樣的 rules = ( Rule(LinkExtractor(allow=r'/pic/page/\d+'), callback='parse_item', follow=True), ) def parse_item(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: img_url = 'https:'+div.xpath('.//div[@class="thumb"]/a/img/@src') item = RedisproItem() item['img_url'] = img_url yield item
ITEM_PIPELINES = { # 原生 # 'redisPro.pipelines.RedisproPipeline': 300, # 分佈式組件提供的共享管道 'scrapy_redis.pipelines.RedisPipeline':400, }
# 使用scrapy-redis組件去重隊列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis組件本身的調度器 SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # 是否容許暫停 SCHEDULER_PERSIST = True
scrapy runspider qiubai.py
將起始url放置到調度器的隊列中:redis-cli:
lpush 隊列名稱 (redis-key) 起始url
lpush qiubaispider https://www.qiushibaike.com/pic/
https://pic.qiushibaike.com/system/pictures/12140/121401684/medium/59AUGYJ1J0ZAPSOL.jpg