文章開始,先摘錄一下文中各軟件的官方定義
Scrapyhtml
An open source and collaborative framework for extracting the data you
need from websites.In a fast, simple, yet extensible way.
Scrapydpython
Scrapy comes with a built-in service, called 「Scrapyd」, which allows
you to deploy (aka. upload) your projects and control their spiders
using a JSON web service.
Scrapydwebgit
A full-featured web UI for Scrapyd cluster management,
with Scrapy log analysis & visualization supported.
Dockergithub
Docker Container: A container is a standard unit of software that packages up code and
all its dependencies so the application runs quickly and reliably from
one computing environment to another. A Docker container image is a
lightweight, standalone, executable package of software that includes
everything needed to run an application: code, runtime, system tools,
system libraries and settings.
整套系統的運行並不依賴docker, docker爲咱們提供的是標準化的系統運行環境,下降了運維成本, 而且能夠在未來分佈式部署的時候提供快速統一的方案;scrapyd+scrapydweb的做用也僅僅是能夠提供一個UI界面來觀察測試web
scrapy,scrapyd,scrapydweb也能夠拆分紅三個獨立的鏡像,不過這裏爲了解釋方便就統一使用了一個docker鏡像配置redis
scrapy工程向scrapyd部署的時候可使用命令行工具scrapyd-deploy, 也能夠在scrapydweb管理後臺的deploy控制檯進行,但前提都是要啓動scrapyd監聽服務(默認6800端口)sql
scrapyd的服務能夠只運行在內網環境中,scrapydweb能夠經過內網地址訪問到SCRAPYD_SERVERS設定的服務,而自身向外網暴露監聽端口(默認5000)便可docker
dockerfile的內容基於 aciobanu/scrapy 修改json
FROM alpine:latest RUN echo "https://mirror.tuna.tsinghua.edu.cn/alpine/latest-stable/main/" > /etc/apk/repositories #RUN apk update && apk upgrade RUN apk -U add \ gcc \ bash \ bash-doc \ bash-completion \ libffi-dev \ libxml2-dev \ libxslt-dev \ libevent-dev \ musl-dev \ openssl-dev \ python-dev \ py-imaging \ py-pip \ redis \ curl ca-certificates \ && update-ca-certificates \ && rm -rf /var/cache/apk/* RUN pip install --upgrade pip \ && pip install Scrapy RUN pip install scrapyd \ && pip install scrapyd-client \ && pip install scrapydweb RUN pip install fake_useragent \ && pip install scrapy_proxies \ && pip install sqlalchemy \ && pip install mongoengine \ && pip install redis WORKDIR /runtime/app EXPOSE 5000 COPY launch.sh /runtime/launch.sh RUN chmod +x /runtime/launch.sh # 測試正常後能夠打開下面的註釋 # ENTRYPOINT ["/runtime/launch.sh"]
若是是把scrapy+scrapyd+scrapydweb拆分紅三個獨立的鏡像,就把下面啓動服務的部分拆分便可,經過容器啓動時的link選項來通訊bash
#!/bin/sh # kill any existing scrapyd process if any kill -9 $(pidof scrapyd) # enter directory where configure file lies and launch scrapyd cd /runtime/app/scrapyd && nohup /usr/bin/scrapyd > ./scrapyd.log 2>&1 & cd /runtime/app/scrapydweb && /usr/bin/scrapydweb
/runtime/app的目錄結構爲
根目錄(/usr/local/src/scrapy-d-web【實際目錄】:/runtime/app【容器內的目錄】)
Dockerfile - 編輯完後須要執行[docker build -t scrapy-d-web:v1 .]生成鏡像, 筆者編譯的時候一開始使用了阿里雲1cpu-1G內存的實例,可是lxml始終報錯,後來升級爲2G內存便可正常編譯 scrapyd - 存放scrapyd的配置文件和其餘目錄 scrapydweb - 存放scrapydweb的配置文件 knowsmore - scrapy startproject 新建的工程目錄1 pxn - scrapy startproject 新建的工程目錄2
如今咱們手動啓動各個服務來逐步解釋, 首先啓動容器並進入bash
docker network create --subnet=192.168.0.0/16 mynetwork #新建一個自定義網絡(若是容器沒拆分這一步能夠忽略,由於監聽的是localhost,若是拆分後,就須要設定IP地址,方便下文中scrapyd+scrapydweb的配置) docker run -it --rm --net mynetwork --ip 192.168.1.100 --name scrapyd -p 5000:5000 -v /usr/local/src/scrapy-d-web/:/runtime/app scrapy-d-web:v1 /bin/sh #定義網絡地址,容器名稱;創建目錄映射,端口映射
進入scrapyd.conf文件所在目錄(/runtime/app/scrapyd),這裏我選擇了當前目錄中的scarpyd.conf, 至於啓動scrapyd配置文件的生效順序請查閱scrapyd官方文檔,下文爲官方配置文件示例
[scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 4 finished_to_keep = 100 poll_interval = 5.0 bind_address = 127.0.0.1(由於不須要外網訪問,因此沒有改爲0.0.0.0) http_port = 6800(這裏若是修改了端口號,要記得同時修改scrapydweb的配置) debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs
再次打開一個終端進入上面的docker容器, 進入scrapydweb配置文件所在的目錄(/runtime/app/scrapydweb), 啓動scrapydweb
docker exec -it scrapyd /bin/bash
scrapydweb的項目詳細內容請查看github地址,下文爲個人部分配置內容
############################## ScrapydWeb ##################################### # Setting SCRAPYDWEB_BIND to '0.0.0.0' or IP-OF-CURRENT-HOST would make # ScrapydWeb server visible externally, otherwise, set it to '127.0.0.1'. # The default is '0.0.0.0'. SCRAPYDWEB_BIND = '0.0.0.0' # Accept connections on the specified port, the default is 5000. SCRAPYDWEB_PORT = 5000 # The default is False, set it to True to enable basic auth for web UI. ENABLE_AUTH = True # In order to enable basic auth, both USERNAME and PASSWORD should be non-empty strings. USERNAME = 'user' PASSWORD = 'pass' ############################## Scrapy ######################################### # ScrapydWeb is able to locate projects in the SCRAPY_PROJECTS_DIR, # so that you can simply select a project to deploy, instead of eggifying it in advance. # e.g., 'C:/Users/username/myprojects/' or '/home/username/myprojects/' SCRAPY_PROJECTS_DIR = '/runtime/app/' ############################## Scrapyd ######################################## # Make sure that [Scrapyd](https://github.com/scrapy/scrapyd) has been installed # and started on all of your hosts. # Note that for remote access, you have to manually set 'bind_address = 0.0.0.0' # in the configuration file of Scrapyd and restart Scrapyd to make it visible externally. # Check out 'https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file' for more info. # ------------------------------ Chinese -------------------------------------- # 請先確保全部主機都已經安裝和啓動 [Scrapyd](https://github.com/scrapy/scrapyd)。 # 如需遠程訪問 Scrapyd,則需在 Scrapyd 配置文件中設置 'bind_address = 0.0.0.0',而後重啓 Scrapyd。 # 詳見 https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file # - the string format: username:password@ip:port#group # - The default port would be 6800 if not provided, # - Both basic auth and group are optional. # - e.g., '127.0.0.1' or 'username:password@192.168.123.123:6801#group' # - the tuple format: (username, password, ip, port, group) # - When the username, password, or group is too complicated (e.g., contains ':@#'), # - or if ScrapydWeb fails to parse the string format passed in, # - it's recommended to pass in a tuple of 5 elements. # - e.g., ('', '', '127.0.0.1', '', '') or ('username', 'password', '192.168.123.123', '6801', 'group') SCRAPYD_SERVERS = [ '192.168.1.100:6800',# 若是是同個容器,直接使用localhost便可,這裏是演示了不一樣容器或主機下的狀況 # 'username:password@localhost:6801#group', # ('username', 'password', 'localhost', '6801', 'group'), ] # If the IP part of a Scrapyd server is added as '127.0.0.1' in the SCRAPYD_SERVERS above, # ScrapydWeb would try to read Scrapy logs directly from disk, instead of making a request # to the Scrapyd server. # Check out this link to find out where the Scrapy logs are stored: # https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir # e.g., 'C:/Users/username/logs/' or '/home/username/logs/' SCRAPYD_LOGS_DIR = '/runtime/app/scrapyd/logs/'
訪問 http://[YOUR IP ADDRESS]:5000 便可