scraoy之日誌等級處理/多pipeline的處理/多item的處理

一.Scrapy的日誌等級

  1.配置

 - 在使用scrapy crawl spiderFileName運行程序時,在終端裏打印輸出的就是scrapy的日誌信息。   - 日誌信息的種類:         ERROR : 通常錯誤         WARNING : 警告         INFO : 通常的信息         DEBUG : 調試信息

- 設置日誌信息指定輸出:html

 
 

    在settings.py配置文件中,加入web

 
 

                    LOG_LEVEL = ‘指定日誌信息種類’便可。數據庫

 
 

                    LOG_FILE = 'log.txt'則表示將日誌信息寫入到指定文件中進行存儲,設置後終端不顯示日誌內容json

 注:日誌在setting裏面默認是沒有的,須要本身添加

   2.使用

#在須要些日誌的py文件中導包
import logging 
#這樣實例化後能夠在輸出日誌的時候知道錯誤的位置再那個文件產生的 LOGGER
= logging.getLogger(__name__) def start_requests(self): if self.shared_data is not None: user = self.shared_data['entry_data']['ProfilePage'][0]['graphql']['user'] self.user_id = user['id'] self.count = user['edge_owner_to_timeline_media']['count'] LOGGER.info('\n{}\nUser id:{}\nTotal {} photos.\n{}\n'.format('-' * 20, self.user_id, self.count, '-' * 20)) for i, url in enumerate(self.start_urls): yield self.request("", self.parse_item) else: LOGGER.error('-----[ERROR] shared_data is None.')

#若是不實例化也是能夠的,直接
logging.info('dasdasdasd')
loggging.debug('sfdghj')
#這樣不會日誌不會顯示產生錯誤的文件的位置

   3.擴展,在普通程序中使用logging

import logging LOG_FORMAT = "%(asctime)s %(name)s %(levelname)s %(pathname)s %(message)s "#配置輸出日誌格式
DATE_FORMAT = '%Y-%m-%d %H:%M:%S %a ' #配置輸出時間的格式,注意月份和天數不要搞亂了
logging.basicConfig(level=logging.DEBUG, format=LOG_FORMAT, datefmt = DATE_FORMAT , filename=r"d:\test\test.log" #有了filename參數就不會直接輸出顯示到控制檯,而是直接寫入文件
#實例化一個logger )
logger=logging.getLogger(__name__)

logger.debug(
"msg1") logger.info("msg2") logger.warning("msg3") logger.error("msg4") logger.critical("msg5")

 

二.如何提升scrapy的爬取效率

增長併發: 默認scrapy開啓的併發線程爲32個,能夠適當進行增長。在settings配置文件中修改CONCURRENT_REQUESTS = 100值爲100,併發設置成了爲100。 下降日誌級別: 在運行scrapy時,會有大量日誌信息的輸出,爲了減小CPU的使用率。能夠設置log輸出信息爲INFO或者ERROR便可。在配置文件中編寫:LOG_LEVEL = ‘INFO’ 禁止cookie: 若是不是真的須要cookie,則在scrapy爬取數據時能夠進制cookie從而減小CPU的使用率,提高爬取效率。在配置文件中編寫:COOKIES_ENABLED = False 禁止重試: 對失敗的HTTP進行從新請求(重試)會減慢爬取速度,所以能夠禁止重試。在配置文件中編寫:RETRY_ENABLED = False 減小下載超時: 若是對一個很是慢的連接進行爬取,減小下載超時能夠能讓卡住的連接快速被放棄,從而提高效率。在配置文件中進行編寫:DOWNLOAD_TIMEOUT = 10 超時時間爲10s

  爬蟲文件.py服務器

# -*- coding: utf-8 -*-
import scrapy from xiaohua.items import XiaohuaItem class XiahuaSpider(scrapy.Spider): name = 'xiaohua' allowed_domains = ['www.521609.com'] start_urls = ['http://www.521609.com/daxuemeinv/'] pageNum = 1 url = 'http://www.521609.com/daxuemeinv/list8%d.html'

    def parse(self, response): li_list = response.xpath('//div[@class="index_img list_center"]/ul/li') for li in li_list: school = li.xpath('./a/img/@alt').extract_first() img_url = li.xpath('./a/img/@src').extract_first() item = XiaohuaItem() item['school'] = school item['img_url'] = 'http://www.521609.com' + img_url yield item if self.pageNum < 10: self.pageNum += 1 url = format(self.url % self.pageNum) #print(url)
            yield scrapy.Request(url=url,callback=self.parse)

  items.pycookie

# -*- coding: utf-8 -*-

# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html

import scrapy class XiaohuaItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    school=scrapy.Field() img_url=scrapy.Field()

  pipline.py併發

# -*- coding: utf-8 -*-

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json import os import urllib.request class XiaohuaPipeline(object): def __init__(self): self.fp = None def open_spider(self,spider): print('開始爬蟲') self.fp = open('./xiaohua.txt','w') def download_img(self,item): url = item['img_url'] fileName = item['school']+'.jpg'
        if not os.path.exists('./xiaohualib'): os.mkdir('./xiaohualib') filepath = os.path.join('./xiaohualib',fileName) urllib.request.urlretrieve(url,filepath) print(fileName+"下載成功") def process_item(self, item, spider): obj = dict(item) json_str = json.dumps(obj,ensure_ascii=False) self.fp.write(json_str+'\n') #下載圖片
 self.download_img(item) return item def close_spider(self,spider): print('結束爬蟲') self.fp.close()

  settings.py框架

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100 COOKIES_ENABLED = False LOG_LEVEL = 'ERROR' RETRY_ENABLED = False DOWNLOAD_TIMEOUT = 3
# Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16
DOWNLOAD_DELAY = 3

 

三.反爬處理

1.設置download_delay:設置下載的等待時間,大規模集中的訪問對服務器的影響最大。download_delay能夠設置在settings.py中,也能夠在spider中設置 2.禁止cookies(參考 COOKIES_ENABLED),有些站點會使用cookies來發現爬蟲的軌跡。在settings.py中設置COOKIES_ENABLES=False 3.使用user agent池,輪流選擇之一來做爲user agent。須要編寫本身的UserAgentMiddle中間件 4.使用IP池。Crawlera 第三方框架能夠解決爬網站的ip限制問題,下載 scrapy-crawlera Crawlera 中間件能夠輕鬆集成該功能 5.分佈式爬取。使用 Scrapy-Redis 能夠實現分佈式爬取蚊

 

四.多pipeline的處理

  爲何須要有多pipeline的狀況 :dom

1.可能有多個spider,不一樣的pipeline處理不一樣的內容 2.一個spider可能要作不一樣的操做,好比,存入不一樣的數據庫當中

  方式一:在pipeleine中判斷爬蟲的名字

class InsCrawlPipeline(object): def process_item(self, item, spider): if spider.name=='jingdong': pass
        return item class InsCrawlPipeline1(object): def process_item(self, item, spider): if spider.name=='taobao': pass
        return item class InsCrawlPipeline2(object): def process_item(self, item, spider): if spider.name=='baidu': pass
        return item

  或者:一個pipine類中判斷爬蟲的名字:scrapy

class InsCrawlPipeline(object): def process_item(self, item, spider): if spider.name=='jingdong': pass
        elif spider.name=='taobao': pass
        elif spider.name=='baidu': pass
        return item

  方式二:在爬蟲文件的item中加一個key

  spider.py

def parse(): item={} item['come_from']='badiu'

  pipline.py

class InsCrawlPipeline(object): def process_item(self, item, spider): if item['come_from']=='taobao': pass
        elif item['come_from']=='jingdong': pass
        elif item['come_from']=='badiu': pass
        return item

 

五.多item的處理

  在item.py

class YangguangItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field() href = scrapy.Field() publish_date = scrapy.Field() content = scrapy.Field() content_img = scrapy.Field() class TaobaoItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class JingDongItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    pass

  pipeline.py

class YangguangPipeline(object): def process_item(self, item, spider): if isinstance(item,TaobaoItem): collection.insert(dict(item))
      elif
isinstance(item,JingDongItem)
    return item

   

相關文章
相關標籤/搜索