如何構建一個分佈式爬蟲：基礎篇

時間 2019-11-10

原文原文鏈接

繼上篇咱們談論了Celery的基本知識後，本篇繼續講解如何一步步使用Celery構建分佈式爬蟲。此次咱們抓取的對象定爲celery官方文檔。html

首先，咱們新建目錄distributedspider，而後再在其中新建文件workers.py,裏面內容以下python

from celery import Celery
app = Celery('crawl_task', include=['tasks'], broker='redis://223.129.0.190:6379/1', backend='redis://223.129.0.190:6379/2')
# 官方推薦使用json做爲消息序列化方式
app.conf.update(
    CELERY_TIMEZONE='Asia/Shanghai',
    CELERY_ENABLE_UTC=True,
    CELERY_ACCEPT_CONTENT=['json'],
    CELERY_TASK_SERIALIZER='json',
    CELERY_RESULT_SERIALIZER='json',
)

上述代碼主要是作Celery實例的初始化工做，include是在初始化celery app的時候須要引入的內容，主要就是註冊爲網絡調用的函數所在的文件。而後咱們再編寫任務函數，新建文件tasks.py,內容以下git

import requests
from bs4 import BeautifulSoup
from workers import app
@app.task
def crawl(url):
    print('正在抓取連接{}'.format(url))
    resp_text = requests.get(url).text
    soup = BeautifulSoup(resp_text, 'html.parser')
    return soup.find('h1').text

它的做用很簡單，就是抓取指定的url，而且把標籤爲h1的元素提取出來github

最後，咱們新建文件task_dispatcher.py，內容以下redis

from workers import app
url_list = [
    'http://docs.celeryproject.org/en/latest/getting-started/introduction.html',
    'http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html',
    'http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html',
    'http://docs.celeryproject.org/en/latest/getting-started/next-steps.html',
    'http://docs.celeryproject.org/en/latest/getting-started/resources.html',
    'http://docs.celeryproject.org/en/latest/userguide/application.html',
    'http://docs.celeryproject.org/en/latest/userguide/tasks.html',
    'http://docs.celeryproject.org/en/latest/userguide/canvas.html',
    'http://docs.celeryproject.org/en/latest/userguide/workers.html',
    'http://docs.celeryproject.org/en/latest/userguide/daemonizing.html',
    'http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html'
]
def manage_crawl_task(urls):
    for url in urls:
        app.send_task('tasks.crawl', args=(url,))
if __name__ == '__main__':
    manage_crawl_task(url_list)

這段代碼的做用主要就是給worker發送任務，任務是tasks.crawl，參數是url(元祖的形式)json

如今，讓咱們在節點A(hostname爲resolvewang的主機)上啓動workercanvas