IP | Role |
---|---|
168.*.*.118 | Scrapy-cluster,scrapyd,spiderkeeper |
168.*.*.119 | Scrapy-cluster,scrapyd,kafka,redis,zookeeper |
# cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core) # python -V Python 2.7.5 # java -version openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
# wget http://mirror.bit.edu.cn/apache/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz # tar -zxvf zookeeper-3.4.13.tar.gz # cd zookeeper-3.4.13/conf # cp zoo_sample.cfg zoo.cfg # cd .. # PATH=/opt/zookeeper-3.4.13/bin:$PATH # echo 'export PATH=/opt/zookeeper-3.4.13/bin:$PATH' > /etc/profile.d/zoo.sh
# zkServer.sh status ZooKeeper JMX enabled by default Using config: /opt/zookeeper-3.4.13/bin/../conf/zoo.cfg Error contacting service. It is probably not running. # zkServer.sh start
# wget http://mirrors.hust.edu.cn/apache/kafka/2.0.0/kafka_2.12-2.0.0.tgz # tar -zxvf kafka_2.12-2.0.0.tgz # cd kafka_2.12-2.0.0/
# vim config/server.properties ############################# Server Basics ############################# # The id of the broker. This must be set to a unique integer for each broker. broker.id=0 # kafka的機器編號, host.name = 168.*.*.119 # 綁定ip port=9092 # 默認端口9092, # Switch to enable topic deletion or not, default value is false delete.topic.enable=true ############################# Zookeeper ############################# zookeeper.connect=localhost:2181
nohup bin/kafka-server-start.sh config/server.properties &
中止命令bin/kafka-server-stop.sh config/server.properties
java
# yum -y install redis # vim /etc/redis.conf bind 168.*.*.119
# systemctl start redis.service
# git clone https://github.com/istresearch/scrapy-cluster.git # cd scrapy-cluster # pip install -r requirements.txt
# ./run_offline_tests.sh
# vim kafka-monitor/settings.py # vim redis-monitor/settings.py # vim crawlers/crawling/settings.py
# Redis host configuration REDIS_HOST = '168.*.*.119' REDIS_PORT = 6379 REDIS_DB = 0 KAFKA_HOSTS = '168.*.*.119:9092' KAFKA_TOPIC_PREFIX = 'demo' KAFKA_CONN_TIMEOUT = 5 KAFKA_APPID_TOPICS = False KAFKA_PRODUCER_BATCH_LINGER_MS = 25 # 25 ms before flush KAFKA_PRODUCER_BUFFER_BYTES = 4 * 1024 * 1024 # 4MB before blocking # Zookeeper Settings ZOOKEEPER_ASSIGN_PATH = '/scrapy-cluster/crawler/' ZOOKEEPER_ID = 'all' ZOOKEEPER_HOSTS = '168.*.*.119:2181'
# nohup python kafka_monitor.py run >> /root/scrapy-cluster/kafka-monitor/kafka_monitor.log 2>&1 & # nohup python redis_monitor.py >> /root/scrapy-cluster/redis-monitor/redis_monitor.log 2>&1 &
# pip install scrapyd
# sudo mkdir /etc/scrapyd # sudo vi /etc/scrapyd/scrapyd.conf
[scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 10 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus
# nohup scrapyd >> /root/scrapy-cluster/scrapyd.log 2>&1 &
建議作Nginx反向代理
File "/usr/local/lib/python3.6/site-packages/scrapyd-1.2.0-py3.6.egg/scrapyd/app.py", line 2, in <module> from twisted.application.internet import TimerService, TCPServer File "/usr/local/lib64/python3.6/site-packages/twisted/application/internet.py", line 54, in <module> from automat import MethodicalMachine File "/usr/local/lib/python3.6/site-packages/automat/__init__.py", line 2, in <module> from ._methodical import MethodicalMachine File "/usr/local/lib/python3.6/site-packages/automat/_methodical.py", line 210, in <module> class MethodicalInput(object): File "/usr/local/lib/python3.6/site-packages/automat/_methodical.py", line 220, in MethodicalInput @argSpec.default builtins.TypeError: '_Nothing' object is not callable Failed to load application: '_Nothing' object is not callable
pip install Automat==0.6.0
pip install SpiderKeeper
mkdir /root/spiderkeeper/ nohup spiderkeeper --server=http://168.*.*.118:6800 --username=admin --password=admin --database-url=sqlite:////root/spiderkeeper/SpiderKeeper.db >> /root/scrapy-cluster/spiderkeeper.log 2>&1 &
http://168.*.*.118:5000
vim /root/scrapy-cluster/crawler/scrapy.cfg
[settings] default = crawling.settings [deploy] url = http://168.*.*.118:6800/ project = crawling
cd /root/scrapy-cluster/crawler/crawling/spider
# cd /root/scrapy-cluster/crawler # scrapyd-deploy Packing version 1536225989 Deploying to project "crawling" in http://168.*.*.118:6800/addversion.json Server response (200): {"status": "ok", "project": "crawling", "version": "1536225989", "spiders": 3, "node_name": "ambari"}
使用scrapy.cfg中配置的項目名node
建立後再Spiders->Dashboard中看到全部spiderpython
Scrapy Cluster須要在不一樣的爬蟲服務器之間進行協調,以確保最大的內容吞吐量,同時控制集羣服務器爬取網站的速度。git
Scrapy Cluster提供了兩種主要策略來控制爬蟲對不一樣域名的攻擊速度。這由爬蟲的類型與IP地址肯定,但他們都做用於不一樣的域名隊列。github
Scrapy-cluster分佈式爬蟲,分發網址是基於IP地址。在不一樣的機器上啓動集羣,不一樣服務器上的每一個爬蟲去除隊列中的全部連接。web
配置一臺新的服務器參照scrapy-cluster 單機配置,同時使用第一臺服務器配置kafka-monitor/settings.py
redis-monitor/settings.py
crawling/settings.py
redis
因爲兩臺服務器同時部署在相同內網,spider運行後即獲取相同Current public ip
,致使scrapy-cluster調度器沒法根據IP分發連接sql
2018-09-07 16:08:29,684 [sc-crawler] DEBUG: Current public ip: b'110.*.*.1'
參考代碼/root/scrapy-cluster/crawler/crawling/distributed_scheduler.py
第282行:apache
try: obj = urllib.request.urlopen(settings.get('PUBLIC_IP_URL', 'http://ip.42.pl/raw')) results = self.ip_regex.findall(obj.read()) if len(results) > 0: # results[0] 獲取IP地址即爲110.90.122.1 self.my_ip = results[0] else: raise IOError("Could not get valid IP Address") obj.close() self.logger.debug("Current public ip: {ip}".format(ip=self.my_ip)) except IOError: self.logger.error("Could not reach out to get public ip") pass
建議修改代碼,獲取本機IPjson
self.my_ip = [(s.connect(('8.8.8.8', 53)), s.getsockname()[0], s.close()) for s in [socket.socket(socket.AF_INET, socket.SOCK_DGRAM)]][0][1]
在兩個scrapy-cluster中運行相同Spider
execute(['scrapy', 'runspider', 'crawling/spiders/link_spider.py'])
使用python kafka_monitor.py feed
投遞多個連接,使用DEBUG便可觀察到連接分配狀況
在第二臺scrapy-cluster服務器上安裝配置scrapyd,參考scrapyd 爬蟲管理工具配置並修改配置
[settings] default = crawling.settings [deploy] url = http://168.*.*.119:6800/ project = crawling
啓動scrapyd後使用scrapyd-deploy工具部署兩個scrapy-cluster上的爬蟲項目。
nohup spiderkeeper --server=http://168.*.*.118:6800 --server=http://168.*.*.119:6800 --username=admin --password=admin --database-url=sqlite:////root/spiderkeeper/SpiderKeeper.db >> /root/scrapy-cluster/spiderkeeper.log 2>&1 &
注意:要使用spiderkeeper管理同一個集羣,爬蟲項目名稱須一致,同時集羣中scrapy-cluster配置相同spider任務
http://168.*.*.118:5000
啓動爬蟲時便可看見兩個scrapy-cluster集羣配置,啓動同名爬蟲開始scrapy-cluster分佈式爬蟲