Scrapy-Cluster結合Spiderkeeper管理分佈式爬蟲

Scrapy-cluster 建設

  • 基於Scrapy-cluster庫的kafka-monitor能夠實現分佈式爬蟲
  • Scrapyd+Spiderkeeper實現爬蟲的可視化管理

環境

IP Role
168.*.*.118 Scrapy-cluster,scrapyd,spiderkeeper
168.*.*.119 Scrapy-cluster,scrapyd,kafka,redis,zookeeper
# cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core) 
# python -V
Python 2.7.5
# java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Zookeeper 單機配置

  • 下載並配置
# wget http://mirror.bit.edu.cn/apache/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
# tar -zxvf zookeeper-3.4.13.tar.gz
# cd zookeeper-3.4.13/conf
# cp zoo_sample.cfg zoo.cfg
# cd ..
# PATH=/opt/zookeeper-3.4.13/bin:$PATH
# echo 'export PATH=/opt/zookeeper-3.4.13/bin:$PATH' > /etc/profile.d/zoo.sh
  • 單節點啓動
# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper-3.4.13/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

# zkServer.sh start

kafka 單機配置

  • 下載
# wget http://mirrors.hust.edu.cn/apache/kafka/2.0.0/kafka_2.12-2.0.0.tgz
# tar -zxvf kafka_2.12-2.0.0.tgz
# cd kafka_2.12-2.0.0/
  • 配置
# vim config/server.properties

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0                     # kafka的機器編號,
host.name = 168.*.*.119         # 綁定ip
port=9092                        # 默認端口9092,
# Switch to enable topic deletion or not, default value is false
delete.topic.enable=true
############################# Zookeeper #############################
zookeeper.connect=localhost:2181
  • 啓動
nohup bin/kafka-server-start.sh config/server.properties &

中止命令bin/kafka-server-stop.sh config/server.propertiesjava

redis 單機配置

  • 安裝配置
# yum -y install redis
# vim /etc/redis.conf
bind 168.*.*.119
  • 啓動
# systemctl start redis.service

scrapy-cluster 單機配置

# git clone https://github.com/istresearch/scrapy-cluster.git
# cd scrapy-cluster
# pip install -r requirements.txt
  • 離線運行單元測試,以確保一切彷佛正常
# ./run_offline_tests.sh
  • 修改配置
# vim kafka-monitor/settings.py
# vim redis-monitor/settings.py
# vim crawlers/crawling/settings.py
  • 修改如下
# Redis host configuration
REDIS_HOST = '168.*.*.119'
REDIS_PORT = 6379
REDIS_DB = 0

KAFKA_HOSTS = '168.*.*.119:9092'
KAFKA_TOPIC_PREFIX = 'demo'
KAFKA_CONN_TIMEOUT = 5
KAFKA_APPID_TOPICS = False
KAFKA_PRODUCER_BATCH_LINGER_MS = 25  # 25 ms before flush
KAFKA_PRODUCER_BUFFER_BYTES = 4 * 1024 * 1024  # 4MB before blocking

# Zookeeper Settings
ZOOKEEPER_ASSIGN_PATH = '/scrapy-cluster/crawler/'
ZOOKEEPER_ID = 'all'
ZOOKEEPER_HOSTS = '168.*.*.119:2181'
  • 啓動監聽
# nohup python kafka_monitor.py run >> /root/scrapy-cluster/kafka-monitor/kafka_monitor.log 2>&1 &
# nohup python redis_monitor.py >> /root/scrapy-cluster/redis-monitor/redis_monitor.log 2>&1 &

scrapyd 爬蟲管理工具配置

  • 安裝
# pip install scrapyd
  • 配置
# sudo mkdir /etc/scrapyd
# sudo vi /etc/scrapyd/scrapyd.conf
[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 10
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port   = 6800
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
  • 啓動
# nohup scrapyd >> /root/scrapy-cluster/scrapyd.log 2>&1 &
建議作Nginx反向代理
  • 啓動異常
File "/usr/local/lib/python3.6/site-packages/scrapyd-1.2.0-py3.6.egg/scrapyd/app.py", line 2, in <module>
from twisted.application.internet import TimerService, TCPServer
File "/usr/local/lib64/python3.6/site-packages/twisted/application/internet.py", line 54, in <module>
from automat import MethodicalMachine
File "/usr/local/lib/python3.6/site-packages/automat/__init__.py", line 2, in <module>
from ._methodical import MethodicalMachine
File "/usr/local/lib/python3.6/site-packages/automat/_methodical.py", line 210, in <module>
    class MethodicalInput(object):
File "/usr/local/lib/python3.6/site-packages/automat/_methodical.py", line 220, in MethodicalInput
    @argSpec.default
builtins.TypeError: '_Nothing' object is not callable


Failed to load application: '_Nothing' object is not callable
  • 解決:Automat降級
pip install Automat==0.6.0

Spiderkeeper 爬蟲管理界面配置

  • 安裝
pip install SpiderKeeper
  • 啓動
mkdir /root/spiderkeeper/
nohup spiderkeeper --server=http://168.*.*.118:6800 --username=admin --password=admin --database-url=sqlite:////root/spiderkeeper/SpiderKeeper.db >> /root/scrapy-cluster/spiderkeeper.log 2>&1 &
  • 瀏覽器訪問http://168.*.*.118:5000

使用Spiderkeeper 管理爬蟲

使用scrapyd-deploy部署爬蟲項目

  • 修改scrapy.cfg配置
vim /root/scrapy-cluster/crawler/scrapy.cfg
[settings]
default = crawling.settings

[deploy]
url = http://168.*.*.118:6800/
project = crawling
  • 添加新的spider
cd /root/scrapy-cluster/crawler/crawling/spider
  • 使用scrapyd-deploy部署項目
# cd /root/scrapy-cluster/crawler
# scrapyd-deploy 
Packing version 1536225989
Deploying to project "crawling" in http://168.*.*.118:6800/addversion.json
Server response (200):
{"status": "ok", "project": "crawling", "version": "1536225989", "spiders": 3, "node_name": "ambari"}

spiderkeeper 配置爬蟲項目

  • 登陸Spiderkeeper建立項目

使用scrapy.cfg中配置的項目名node

clipboard.png

建立後再Spiders->Dashboard中看到全部spiderpython

clipboard.png

Scrapy-cluster 分佈式爬蟲

Scrapy Cluster須要在不一樣的爬蟲服務器之間進行協調,以確保最大的內容吞吐量,同時控制集羣服務器爬取網站的速度。git

Scrapy Cluster提供了兩種主要策略來控制爬蟲對不一樣域名的攻擊速度。這由爬蟲的類型與IP地址肯定,但他們都做用於不一樣的域名隊列。github

Scrapy-cluster分佈式爬蟲,分發網址是基於IP地址。在不一樣的機器上啓動集羣,不一樣服務器上的每一個爬蟲去除隊列中的全部連接。web

部署集羣中第二個scrapy-cluster

配置一臺新的服務器參照scrapy-cluster 單機配置,同時使用第一臺服務器配置kafka-monitor/settings.py redis-monitor/settings.py crawling/settings.pyredis

Current public ip 問題

因爲兩臺服務器同時部署在相同內網,spider運行後即獲取相同Current public ip,致使scrapy-cluster調度器沒法根據IP分發連接sql

2018-09-07 16:08:29,684 [sc-crawler] DEBUG: Current public ip: b'110.*.*.1'

參考代碼/root/scrapy-cluster/crawler/crawling/distributed_scheduler.py第282行:apache

try:
    obj = urllib.request.urlopen(settings.get('PUBLIC_IP_URL',
                                  'http://ip.42.pl/raw'))
    results = self.ip_regex.findall(obj.read())
    if len(results) > 0:
        # results[0] 獲取IP地址即爲110.90.122.1
        self.my_ip = results[0]
    else:
        raise IOError("Could not get valid IP Address")
    obj.close()
    self.logger.debug("Current public ip: {ip}".format(ip=self.my_ip))
except IOError:
    self.logger.error("Could not reach out to get public ip")
    pass

建議修改代碼,獲取本機IPjson

self.my_ip = [(s.connect(('8.8.8.8', 53)), s.getsockname()[0], s.close()) 
                for s in [socket.socket(socket.AF_INET, socket.SOCK_DGRAM)]][0][1]

運行分佈式爬蟲

在兩個scrapy-cluster中運行相同Spider

execute(['scrapy', 'runspider', 'crawling/spiders/link_spider.py'])

使用python kafka_monitor.py feed投遞多個連接,使用DEBUG便可觀察到連接分配狀況

使用SpiderKeeper管理分佈式爬蟲

配置scrapyd管理集羣第二個scrapy-cluster

在第二臺scrapy-cluster服務器上安裝配置scrapyd,參考scrapyd 爬蟲管理工具配置並修改配置

[settings]
default = crawling.settings

[deploy]
url = http://168.*.*.119:6800/
project = crawling

啓動scrapyd後使用scrapyd-deploy工具部署兩個scrapy-cluster上的爬蟲項目。

使用Spiderkeeper鏈接多個scrapy-cluster

  • 從新啓動spiderkeeper,對接兩個scrapy-cluster的管理工具scrapyd。
nohup spiderkeeper --server=http://168.*.*.118:6800 --server=http://168.*.*.119:6800 --username=admin --password=admin --database-url=sqlite:////root/spiderkeeper/SpiderKeeper.db >> /root/scrapy-cluster/spiderkeeper.log 2>&1 &
注意:要使用spiderkeeper管理同一個集羣,爬蟲項目名稱須一致,同時集羣中scrapy-cluster配置相同spider任務
  • 瀏覽器訪問http://168.*.*.118:5000 啓動爬蟲時便可看見兩個scrapy-cluster集羣配置,啓動同名爬蟲開始scrapy-cluster分佈式爬蟲

clipboard.png

  • 啓動分佈式爬蟲後狀態

clipboard.png

相關文章
相關標籤/搜索