scrapy簡單分佈式爬蟲

  通過一段時間的折騰,終於整明白scrapy分佈式是怎麼個搞法了,特記錄一點心得。html

  雖然scrapy能作的事情不少,可是要作到大規模的分佈式應用則捉襟見肘。有能人改變了scrapy的隊列調度,將起始的網址從start_urls裏分離出來,改成從redis讀取,多個客戶端能夠同時讀取同一個redis,從而實現了分佈式的爬蟲。就算在同一臺電腦上,也能夠多進程的運行爬蟲,在大規模抓取的過程當中很是有效。python

準備:  linux

  一、windows一臺(從:scrapy)redis

  二、linux一臺(主:scrapy\redis\mongo)mongodb

      ip:192.168.184.129數據庫

  三、python3.6windows

 

linux下scrapy的配置步驟:數據結構

1、安裝python3.6

    yum install openssl-devel -y   解決pip3不能使用的問題(pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available)

    下載python軟件包,Python-3.6.1.tar.xz,解壓後

      ./configure --prefix=/python3

      make

      make install  

    加上環境變量:

      PATH=/python3/bin:$PATH:$HOME/bin

      export PATH

    安裝完成後,pip3默認也已經安裝完成了(安裝前須要先yum gcc)



  2、安裝Twisted

    下載Twisted-17.9.0.tar.bz2,解壓後 cd Twisted-17.9.0, python3 setup.py install

  3、安裝scrapy

    pip3 install scrapy

    pip3 install scrapy-redis
  4、安裝redis

    見博文redis安裝與簡單使用  
    錯誤:You need tcl 8.5 or newer in order to run the Redis test
      1、wget http://downloads.sourceforge.net/tcl/tcl8.6.1-src.tar.gz

      2、tar -xvf tcl8.6.1-src.tar.gz
      3、cd tcl8.6.1/unix ; make; make install
    
    cp /root/redis-3.2.11/redis.conf /etc/
    啓動:/root/redis-3.2.11/src/redis-server /etc/redis.conf &   
5、pip3 install redis   6、安裝mongodb     參考菜鳥教程:http://www.runoob.com/mongodb/mongodb-linux-install.html
    啓動:# mongod --bind_ip 192.168.184.129 &
  7、pip3 install pymongo

 

windows上scrapy的部署步驟:併發

1、安裝wheel
        pip install wheel
    2、安裝lxml
        https://pypi.python.org/pypi/lxml/4.1.0
    3、安裝pyopenssl
        https://pypi.python.org/pypi/pyOpenSSL/17.5.0
    4、安裝Twisted
        https://www.lfd.uci.edu/~gohlke/pythonlibs/
    5、安裝pywin32
        https://sourceforge.net/projects/pywin32/files/
    6、安裝scrapy
        pip install scrapy

 

  部署代碼:dom

  我以美劇天堂的電影爬取爲簡單例子,說一下分佈式的實現,代碼linux和windows上各放一份,配置同樣,二者可同時運行爬取。

只列出須要修改的地方:

  settings

    設置爬取數據的存儲數據庫(mongodb),指紋和queue存儲的數據庫(redis)

ROBOTSTXT_OBEY = False  # 禁止robot
CONCURRENT_REQUESTS = 1  # scrapy調試queue的最大併發,默認16
ITEM_PIPELINES = {
   'meiju.pipelines.MongoPipeline': 300,
}
MONGO_URI = '192.168.184.129'  # mongodb鏈接信息
MONGO_DATABASE = 'mj'
SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 使用scrapy_redis的調度
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  # 在redis庫中去重(url)
# REDIS_URL = 'redis://root:kongzhagen@localhost:6379'  # 若是redis有密碼,使用這個配置
REDIS_HOST = '192.168.184.129'  #redisdb鏈接信息
REDIS_PORT = 6379
SCHEDULER_PERSIST = True  # 不清空指紋

  piplines

    存儲到MongoDB的代碼

import pymongo

class MeijuPipeline(object):
    def process_item(self, item, spider):
        return item

class MongoPipeline(object):

    collection_name = 'movies'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

  items

    數據結構

import scrapy


class MeijuItem(scrapy.Item):
    movieName = scrapy.Field()
    status = scrapy.Field()
    english = scrapy.Field()
    alias = scrapy.Field()
    tv = scrapy.Field()
    year = scrapy.Field()
    type = scrapy.Field()

  爬蟲腳本mj.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request

class MjSpider(scrapy.Spider):
    name = 'mj'
    allowed_domains = ['meijutt.com']
    # start_urls = ['http://www.meijutt.com/file/list1.html']
    def start_requests(self):
        yield Request(url='http://www.meijutt.com/file/list1.html', callback=self.parse)

    def parse(self, response):
        from meiju.items import MeijuItem
        movies = response.xpath('//div[@class="cn_box2"]')
        for movie in movies:
            item = MeijuItem()
            item['movieName'] = movie.xpath('./ul[@class="list_20"]/li[1]/a/text()').extract_first()
            item['status'] = movie.xpath('./ul[@class="list_20"]/li[2]/span/font/text()').extract_first()
            item['english'] = movie.xpath('./ul[@class="list_20"]/li[3]/font[2]/text()').extract_first()
            item['alias'] = movie.xpath('./ul[@class="list_20"]/li[4]/font[2]/text()').extract_first()
            item['tv'] = movie.xpath('./ul[@class="list_20"]/li[5]/font[2]/text()').extract_first()
            item['year'] = movie.xpath('./ul[@class="list_20"]/li[6]/font[2]/text()').extract_first()
            item['type'] = movie.xpath('./ul[@class="list_20"]/li[7]/font[2]/text()').extract_first()
            yield item
        for i in response.xpath('//div[@class="cn_box2"]/ul[@class="list_20"]/li[1]/a/@href').extract():
            yield Request(url='http://www.meijutt.com' + i)
        # next = 'http://www.meijutt.com' + response.xpath("//a[contains(.,'下一頁')]/@href")[1].extract()
        # print(next)
        # yield Request(url=next, callback=self.parse)

   

 

看一下redis中的狀況:

  

 

看看mongodb中的數據:

相關文章
相關標籤/搜索