Scrapy不是一個函數功能庫,而是一個爬蟲框架。
爬蟲框架是實現爬蟲功能的一個軟件結構和功能組件集合。
爬蟲框架是一個半成品,可以幫助用戶實現專業網絡爬蟲。
html
1.scrapy爬蟲框架結構react
數據流向步驟1:shell
1 Engine從Spider處得到爬取請求(Request)
2 Engine將爬取請求轉發給Scheduler,用於調度 數據庫
數據流向步驟2:json
3 Engine從Scheduler處得到下一個要爬取的請求cookie
4 Engine將爬取請求經過中間件發送給Downloader網絡
5 爬取網頁後,Downloader造成響應(Response)經過中間件發給Engine併發
6 Engine將收到的響應經過中間件發送給Spider處理
app
數據流向步驟3:框架
7 Spider處理響應後產生爬取項(scraped Item)和新的爬取請求(Requests)給Engine
8 Engine將爬取項發送給Item Pipeline(框架出口)
9 Engine將爬取請求發送給Scheduler
Engine控制各模塊數據流,不間斷從Scheduler處
得到爬取請求,直至請求爲空
框架入口:Spider的初始爬取請求
框架出口:Item Pipeline
engine、scheduler、Downloader是已有實現,spiders和pipline須要編寫。
Engine
(1) 控制全部模塊之間的數據流
(2) 根據條件觸發事件
Downloader
根據請求下載網頁
Scheduler
對全部爬取請求進行調度管理
Downloader Middleware
目的:實施Engine、 Scheduler和Downloader之間進行用戶可配置的控制
功能:修改、丟棄、新增請求或響應
用戶能夠編寫配置代碼
Spider
(1) 解析Downloader返回的響應(Response)
(2) 產生爬取項(scraped item)
(3) 產生額外的爬取請求(Request)
用戶能夠編寫配置代碼
Item Pipelines
(1) 以流水線方式處理Spider產生的爬取項
(2) 由一組操做順序組成,相似流水線,每一個操做是一個Item Pipeline類型
(3) 可能操做包括:清理、檢驗和查重爬取項中的HTML數據、將數據存儲到數據庫
須要用戶編寫配置代碼
Spider Middleware
目的:對請求和爬取項的再處理
功能:修改、丟棄、新增請求或爬取項
2.scrapy經常使用命令
startproject 建立一個新工程 scrapy startproject <name> [dir] genspider 建立一個爬蟲 scrapy genspider [options] <name> <domain> settings 得到爬蟲配置信息 scrapy settings [options] crawl 運行一個爬蟲 scrapy crawl <spider> list 列出工程中全部爬蟲 scrapy list shell 啓動URL調試命令行 scrapy shell [url]
3.建立scrapy
應用Scrapy爬蟲框架主要是編寫配置型代碼
步驟1:創建一個Scrapy爬蟲工程
選取一個目錄(D:\,而後執行以下命令
scrapy startproject <name> [dir]
生成的目錄結構
# -*- coding: utf-8 -*- import scrapy import sys import io from scrapy.http import Request from scrapy.selector import Selector, HtmlXPathSelector from ..items import ChoutiItem sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') class ChoutiSpider(scrapy.Spider): name = "chouti" allowed_domains = ["chouti.com"] start_urls = ['http://dig.chouti.com/'] visited_urls =set() # def start_requests(self): # for url in self.start_urls: # yield Request(url,callback=self.parse) def parse(self, response): # content = str(response.body,encoding='utf-8') # 找到文檔中全部A標籤 # hxs = Selector(response=response).xpath('//a') # 標籤對象列表 # for i in hxs: # print(i) # 標籤對象 # 對象轉換爲字符串 # hxs = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]').extract() # 標籤對象列表 # hxs = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]') # 標籤對象列表 # for obj in hxs: # a = obj.xpath('.//a[@class="show-content"]/text()').extract_first() # print(a.strip()) # 選擇器: """ // 表示子孫中 .// 當前對象的子孫中 / 兒子 /div 兒子中的div標籤 /div[@id="i1"] 兒子中的div標籤且id=i1 /div[@id="i1"] 兒子中的div標籤且id=i1 obj.extract() # 列表中的每個對象轉換字符串 =》 [] obj.extract_first() # 列表中的每個對象轉換字符串 => 列表第一個元素 //div/text() 獲取某個標籤的文本 """ # 獲取當前頁的全部頁碼 # hxs = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/text()') # hxs0 = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract() # for item in hxs0: # if item in self.visited_urls: # print('已經存在', item) # else: # self.visited_urls.add(item) # print(item) # hxs2 = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract() # hxs2 = Selector(response=response).xpath('//a[starts-with(@href, "/all/hot/recent/")]/@href').extract() # hxs2 = Selector(response=response).xpath('//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract() # for url in hxs2: # md5_url = self.md5(url) # if md5_url in self.visited_urls: # pass # # print('已經存在', url) # else: # self.visited_urls.add(md5_url) # print(url) # url = "http://dig.chouti.com%s" %url # # 將新要訪問的url添加到調度器 # yield Request(url=url,callback=self.parse) # # a/@href 獲取屬性 # # //a[starts-with(@href, "/all/hot/recent/")]/@href' 已xx開始 # # //a[re:test(@href, "/all/hot/recent/\d+")] 正則 # # yield Request(url=url,callback=self.parse) # 將新要訪問的url添加到調度器 # # 重寫start_requests,指定最開始處理請求的方法 # # # def show(self,response): # # print(response.text) # # def md5(self, url): # import hashlib # obj = hashlib.md5() # obj.update(bytes(url, encoding='utf-8')) # return obj.hexdigest() # hxs = HtmlXPathSelector(response) # response hxs1 = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]') # 標籤對象列表 for obj in hxs1: title = obj.xpath('.//a[@class="show-content color-chag"]/text()').extract_first().strip() href = obj.xpath('.//a[@class="show-content color-chag"]/@href').extract_first().strip() # print(title) # print(href) item_obj = ChoutiItem(title=title,href=href) # 將item對象傳遞給pipeline yield item_obj
parse()用於處理響應,解析內容造成字典,發現新的URL爬取請求
配置產生的spider爬蟲:(1)初始URL地址 (2)獲取頁面後的解析方式
yield 生成器
生成器每次產生一個值(yield語句),函數被凍結,被喚醒後再產生一個值,生成器是一個不斷產生值的函數
生成器相比一次列出全部內容的優點:
1)更節省存儲空間
2)響應更迅速
3)使用更靈活
4.scrapy數據類型
request類:
class scrapy.http.Request()
Request對象表示一個HTTP請求
由Spider生成,由Downloader執行
屬性或方法 說明 .url Request對應的請求URL地址 .method 對應的請求方法,'GET' 'POST'等 .headers 字典類型風格的請求頭 .body 請求內容主體,字符串類型 .meta 用戶添加的擴展信息,在Scrapy內部模塊間傳遞信息使用 .copy() 複製該請求
response類:
class scrapy.http.Response()
Response對象表示一個HTTP響應
由Downloader生成,由Spider處理
屬性或方法 說明
.url Response對應的URL地址
.status HTTP狀態碼,默認是200
.headers Response對應的頭部信息
.body Response對應的內容信息,字符串類型
.flags 一組標記
.request 產生Response類型對應的Request對象
.copy() 複製該響應
Item類
class scrapy.item.Item()
Item對象表示一個從HTML頁面中提取的信息內容
由Spider生成,由Item Pipeline處理
Item相似字典類型,能夠按照字典類型操做
Scrapy爬蟲支持多種HTML信息爬取方法:
• Beautiful Soup
• lxml
• re
• XPath Selector
• CSS Selector
settings.py配置併發鏈接選項
選項 說明
CONCURRENT_REQUESTS Downloader最大併發請求下載數量,默認32
CONCURRENT_ITEMS Item Pipeline最大併發ITEM處理數量,默認100
CONCURRENT_REQUESTS_PER_DOMAIN 每一個目標域名最大的併發請求數量,默認8
CONCURRENT_REQUESTS_PER_IP 每一個目標IP最大的併發請求數量,默認0,非0有效
5.格式化處理
import scrapy import sys import io from scrapy.http import Request from scrapy.selector import Selector, HtmlXPathSelector from ..items import ChoutiItem sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') class ChoutiSpider(scrapy.Spider): name = "chouti" allowed_domains = ["chouti.com"] start_urls = ['http://dig.chouti.com/'] visited_urls =set() hxs1 = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]') # 標籤對象列表 for obj in hxs1: title = obj.xpath('.//a[@class="show-content"]/text()').extract_first().strip() href = obj.xpath('.//a[@class="show-content"]/@href').extract_first().strip() item_obj = ChoutiItem(title=title,href=href) # 將item對象傳遞給pipeline yield item_obj
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class ChoutiItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() href = scrapy.Field()
class Day96Pipeline(object): def process_item(self, item, spider): print(spider, item) # if spider.name == 'chouti' tpl = "%s\n%s\n\n" %(item['title'],item['href']) f = open('news.json', 'a') f.write(tpl) f.close()
ITEM_PIPELINES = { # 'day96.pipelines.Day96Pipeline': 300, # 'day96.pipelines.Day96Pipeline': 300, 'day96.pipelines.Day96Pipeline': 300, }
自定義pipeline
from scrapy.exceptions import DropItem class Day96Pipeline(object): def __init__(self,conn_str): self.conn_str = conn_str @classmethod def from_crawler(cls, crawler): """ 初始化時候,用於建立pipeline對象 :param crawler: :return: """ conn_str = crawler.settings.get('DB') return cls(conn_str) def open_spider(self,spider): """ 爬蟲開始執行時,調用 :param spider: :return: """ self.conn = open(self.conn_str, 'a') def close_spider(self,spider): """ 爬蟲關閉時,被調用 :param spider: :return: """ self.conn.close() def process_item(self, item, spider): """ 每當數據須要持久化時,就會被調用 :param item: :param spider: :return: """ # if spider.name == 'chouti' tpl = "%s\n%s\n\n" %(item['title'],item['href']) self.conn.write(tpl) # 交給下一個pipeline處理 return item # 丟棄item,不交給 # raise DropItem()
6.cookies
from scrapy.http.cookies import CookieJar cookie_obj = CookieJar() cookie_obj.extract_cookies(response,response.request) print(cookie_obj._cookies)
# -*- coding: utf-8 -*- import scrapy import sys import io from scrapy.http import Request from scrapy.selector import Selector, HtmlXPathSelector from ..items import ChoutiItem sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') from scrapy.http.cookies import CookieJar class ChoutiSpider(scrapy.Spider): name = "chouti" allowed_domains = ["chouti.com",] start_urls = ['http://dig.chouti.com/'] cookie_dict = None def parse(self, response): print("spider.reponse",response) cookie_obj = CookieJar() cookie_obj.extract_cookies(response,response.request) self.cookie_dict = cookie_obj._cookies # 帶上用戶名密碼+cookie yield Request( url="http://dig.chouti.com/login", method='POST', body = "phone=8615131255089&password=woshiniba&oneMonth=1", headers={'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8"}, cookies=cookie_obj._cookies, callback=self.check_login ) def check_login(self,response): print(response.text) yield Request(url="http://dig.chouti.com/",callback=self.good) def good(self,response): id_list = Selector(response=response).xpath('//div[@share-linkid]/@share-linkid').extract() for nid in id_list: print(nid) url = "http://dig.chouti.com/link/vote?linksId=%s" % nid yield Request( url=url, method="POST", cookies=self.cookie_dict, callback=self.show ) page_urls = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract() for page in page_urls: url = "http://dig.chouti.com%s" % page yield Request(url=url,callback=self.good) def show(self,response): print(response.text)
7.自定義擴展
from scrapy import signals class MyExtend: def __init__(self,crawler): self.crawler = crawler # 鉤子上掛障礙物 # 在指定信號上註冊操做 crawler.signals.connect(self.start, signals.engine_started) crawler.signals.connect(self.close, signals.spider_closed) @classmethod def from_crawler(cls, crawler): return cls(crawler) def start(self): print('signals.engine_started.start') def close(self): print('signals.spider_closed.close')
8.證書
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 方式一:使用默認 os.environ { http_proxy:http://root:@ip:port/ https_proxy:http://ip:port/ } 方式二:使用自定義下載中間件 def to_bytes(text, encoding=None, errors='strict'): if isinstance(text, bytes): return text if not isinstance(text, six.string_types): raise TypeError('to_bytes must receive a unicode, str or bytes ' 'object, got %s' % type(text).__name__) if encoding is None: encoding = 'utf-8' return text.encode(encoding, errors) class ProxyMiddleware(object): def process_request(self, request, spider): PROXIES = [ {'ip_port': '111.11.228.75:80', 'user_pass': ''}, {'ip_port': '120.198.243.22:80', 'user_pass': ''}, {'ip_port': '111.8.60.9:8123', 'user_pass': ''}, {'ip_port': '101.71.27.120:80', 'user_pass': ''}, {'ip_port': '122.96.59.104:80', 'user_pass': ''}, {'ip_port': '122.224.249.122:8088', 'user_pass': ''}, ] proxy = random.choice(PROXIES) if proxy['user_pass'] is not None: request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass'])) request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass) print "**************ProxyMiddleware have pass************" + proxy['ip_port'] else: print "**************ProxyMiddleware no pass************" + proxy['ip_port'] request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) DOWNLOADER_MIDDLEWARES = { 'step8_king.middlewares.ProxyMiddleware': 500, }
9.中間件
class SpiderMiddleware(object): def process_spider_input(self,response, spider): """ 下載完成,執行,而後交給parse處理 :param response: :param spider: :return: """ pass def process_spider_output(self,response, result, spider): """ spider處理完成,返回時調用 :param response: :param result: :param spider: :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable) """ return result def process_spider_exception(self,response, exception, spider): """ 異常調用 :param response: :param exception: :param spider: :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline """ return None def process_start_requests(self,start_requests, spider): """ 爬蟲啓動時調用 :param start_requests: :param spider: :return: 包含 Request 對象的可迭代對象 """ return start_requests
class DownMiddleware1(object): def process_request(self, request, spider): """ 請求須要被下載時,通過全部下載器中間件的process_request調用 :param request: :param spider: :return: None,繼續後續中間件去下載; Response對象,中止process_request的執行,開始執行process_response Request對象,中止中間件的執行,將Request從新調度器 raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception """ pass def process_response(self, request, response, spider): """ spider處理完成,返回時調用 :param response: :param result: :param spider: :return: Response 對象:轉交給其餘中間件process_response Request 對象:中止中間件,request會被從新調度下載 raise IgnoreRequest 異常:調用Request.errback """ print('response1') return response def process_exception(self, request, exception, spider): """ 當下載處理器(download handler)或 process_request() (下載中間件)拋出異常 :param response: :param exception: :param spider: :return: None:繼續交給後續中間件處理異常; Response對象:中止後續process_exception方法 Request對象:中止中間件,request將會被從新調用下載 """ return None
10.自定製命令
scrapy除了命令行,還能夠在程序中啓動爬蟲,因爲scrapy實在Twisted異步網絡庫上構建的,所以必須在Twisted reactor裏運行。
from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): # 找到全部的爬蟲名稱 print(type(self.crawler_process)) #<class 'scrapy.crawler.CrawlerProcess'> from scrapy.crawler import CrawlerProcess # 1. 執行CrawlerProcess構造方法 # 2. CrawlerProcess對象含有配置文件的spiders # 2.1,爲每一個爬蟲建立一個Crawler # 2.2,執行 d = Crawler.crawl(...) # ************************ # # d.addBoth(_done) # 2.3, CrawlerProcess對象._active = {d,} # 3. dd = defer.DeferredList(self._active) # dd.addBoth(self._stop_reactor) # self._stop_reactor ==> reactor.stop() # reactor.run # 獲取當前全部爬蟲的名稱 spider_list = self.crawler_process.spiders.list() # spider_list = ["chouti",'cnblogs'] for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start()