scrapy-redis分佈式爬取tencent社招信息

時間 2019-11-11

標籤 scrapy redis 分佈式 tencent 信息欄目 Python 简体版

原文原文鏈接

- scrapy-redis分佈式爬取tencent社招信息

什麼是scrapy-redis

雖然 scrapy 框架是異步加多線程的，可是咱們只能在一臺主機上運行，爬取效率仍是有限的，scrapy-redis 庫是基於 scrapy 修改，爲咱們提供了 scrapy分佈式的隊列，調度器，去重等等功能，而且原有的 scrapy 單機版爬蟲代碼只需作很小的改動。有了它，就能夠將多臺主機組合起來，共同完成一個爬取任務，抓取的效率又提升了。再配合 Scrapyd 與 Gerapy 能夠很方便的實現爬蟲的分佈式部署與運行。php

目標任務

使用scrapy-redis爬取 https://hr.tencent.com/position.php?&start= 招聘信息，爬取的內容包括：職位名、詳情鏈接、職位類別、招聘人數、工做地點、發佈時間、具體要求信息。html

安裝爬蟲

pip install scrapy pip install scrapy-redis

python 版本 3.7， scrapy 版本 1.6.0， scrapy-redis 版本 0.6.8

建立爬蟲

# 建立工程 scrapy startproject TencentSpider  # 建立爬蟲 cd TencentSpider scrapy genspider -t crawl tencent tencent.com

爬蟲名稱 tencent , 做用域 tencent.com，爬蟲類型 crawl

編寫 `items.py`

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 職位名 positionname = scrapy.Field() # 詳情鏈接 positionlink = scrapy.Field() # 職位類別 positionType = scrapy.Field() # 招聘人數 peopleNum = scrapy.Field() # 工做地點 workLocation = scrapy.Field() # 發佈時間 publishTime = scrapy.Field() # 職位詳情 positiondetail = scrapy.Field()

定義需求爬取的 item 項

編寫 `spiders/tencent.py`

# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import RedisCrawlSpider
# 導入CrawlSpider類和Rule from scrapy.spiders import CrawlSpider, Rule # 導入連接規則匹配類，用來提取符合規則的鏈接 from scrapy.linkextractors import LinkExtractor from TencentSpider.items import TencentspiderItem class TencentSpider(RedisCrawlSpider): # 普通的scrapy爬蟲繼承自CrawlSpider name = 'tencent' # allowed_domains = ['tencent.com'] allowed_domains = ['hr.tencent.com'] # 普通的scrapy爬蟲須要在這裏定義start_urls，而且無redis_key變量 # start_urls = ['https://hr.tencent.com/position.php?&start=0#a'] redis_key = 'tencent:start_urls' # Response裏連接的提取規則，返回的符合匹配規則的連接匹配對象的列表 pagelink = LinkExtractor(allow=("start=\d+")) rules = ( # 獲取這個列表裏的連接，依次發送請求，而且繼續跟進，調用指定回調函數處理 Rule(pagelink, callback='parse_item', follow=True), ) # CrawlSpider的rules屬性是直接從response對象的文本中提取url，而後自動建立新的請求。 # 與Spider不一樣的是，CrawlSpider已經重寫了parse函數 # scrapy crawl spidername開始運行，程序自動使用start_urls構造Request併發送請求， # 而後調用parse函數對其進行解析，在這個解析過程當中使用rules中的規則從html（或xml）文本中提取匹配的連接， # 經過這個連接再次生成Request，如此不斷循環，直到返回的文本中再也沒有匹配的連接，或調度器中的Request對象用盡，程序才中止。 # 若是起始的url解析方式有所不一樣，那麼能夠重寫CrawlSpider中的另外一個函數parse_start_url(self, response)用來解析第一個url返回的Response，但這不是必須的。 def parse_item(self, response): # print(response.request.headers) items = [] url1 = "https://hr.tencent.com/" for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"): # 初始化模型對象 item = TencentspiderItem() # 職位名稱 try: item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0].strip() except BaseException: item['positionname'] = "" # 詳情鏈接 try: item['positionlink'] = "{0}{1}".format(url1, each.xpath("./td[1]/a/@href").extract()[0].strip()) except BaseException: item['positionlink'] = "" # 職位類別 try: item['positionType'] = each.xpath("./td[2]/text()").extract()[0].strip() except BaseException: item['positionType'] = "" # 招聘人數 try: item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0].strip() except BaseException: item['peopleNum'] = "" # 工做地點 try: item['workLocation'] = each.xpath("./td[4]/text()").extract()[0].strip() except BaseException: item['workLocation'] = "" # 發佈時間 try: item['publishTime'] = each.xpath("./td[5]/text()").extract()[0].strip() except BaseException: item['publishTime'] = "" items.append(item) # yield item for item in items: yield scrapy.Request(url=item['positionlink'], meta={'meta_1': item}, callback=self.second_parseTencent) def second_parseTencent(self, response): item = TencentspiderItem() meta_1 = response.meta['meta_1'] item['positionname'] = meta_1['positionname'] item['positionlink'] = meta_1['positionlink'] item['positionType'] = meta_1['positionType'] item['peopleNum'] = meta_1['peopleNum'] item['workLocation'] = meta_1['workLocation'] item['publishTime'] = meta_1['publishTime'] tmp = [] tmp.append(response.xpath("//tr[@class='c']")[0]) tmp.append(response.xpath("//tr[@class='c']")[1]) positiondetail = '' for i in tmp: positiondetail_title = i.xpath("./td[1]/div[@class='lightblue']/text()").extract()[0].strip() positiondetail = positiondetail + positiondetail_title positiondetail_detail = i.xpath("./td[1]/ul[@class='squareli']/li/text()").extract() positiondetail = positiondetail + ' '.join(positiondetail_detail) + ' ' # positiondetail_title = response.xpath("//div[@class='lightblue']").extract() # positiondetail_detail = response.xpath("//ul[@class='squareli']").extract() # positiondetail = positiondetail_title[0] + '\n' + positiondetail_detail[0] + '\n' + positiondetail_title[1] + '\n' + positiondetail_detail[1] item['positiondetail'] = positiondetail.strip() yield item

爬蟲的主邏輯

編寫 `pipelines.py`

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class TencentspiderPipeline(object):
    """
    功能：保存item數據 """ def __init__(self): self.filename = open("tencent.json", "w", encoding='utf-8') def process_item(self, item, spider): try: text = json.dumps(dict(item), ensure_ascii=False) + "\n" self.filename.write(text) except BaseException as e: print(e) return item def close_spider(self, spider): self.filename.close()

處理每一個頁面爬取獲得的 item 項

編寫 `middlewares.py`

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

import scrapy
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random


class TencentspiderSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class TencentspiderDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class MyUserAgentMiddleware(UserAgentMiddleware): """ 隨機設置User-Agent """ def __init__(self, user_agent): self.user_agent = user_agent @classmethod def from_crawler(cls, crawler): return cls( user_agent=crawler.settings.get('MY_USER_AGENT') ) def process_request(self, request, spider): agent = random.choice(self.user_agent) request.headers['User-Agent'] = agent

編寫 `settings.py`

# -*- coding: utf-8 -*-

# Scrapy settings for TencentSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'TencentSpider'

SPIDER_MODULES = ['TencentSpider.spiders']
NEWSPIDER_MODULE = 'TencentSpider.spiders'

# 普通scrapy無下面5項關於redis的配置 # 使用了scrapy_redis的去重組件，在redis數據庫裏作去重（必須） DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用了scrapy_redis的調度器，在redis裏分配請求（必須） SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 在redis中保持scrapy-redis用到的各個隊列，從而容許暫停和暫停後恢復，也就是不清理redis queues（可選參數） SCHEDULER_PERSIST = True # 指定redis數據庫的鏈接參數（必須） REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 DUPEFILTER_DEBUG = True # scrapy-redis在redis中都是用key-value形式存儲數據，其中有幾個常見的key-value形式： # 1、 「項目名:items」 -->list 類型，保存爬蟲獲取到的數據item 內容是 json 字符串 # 2、 「項目名:dupefilter」 -->set類型，用於爬蟲訪問的URL去重 內容是 40個字符的 url 的hash字符串 # 3、 「項目名:start_urls」 -->List 類型，用於獲取spider啓動時爬取的第一個url # 4、 「項目名:requests」 -->zset類型，用於scheduler調度處理 requests 內容是 request 對象的序列化 字符串 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'TencentSpider (+http://www.yourdomain.com)' # 設置useragent隨機列表 MY_USER_AGENT = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36", "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21", "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)", "Mozilla/5.0 (Windows NT 6.2; rv:30.0) Gecko/20150101 Firefox/32.0", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko", "Mozilla/4.0 (compatib1e; MSIE 6.1; Windows NT)", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618)", "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; Media Center PC 6.0)", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0", "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20100101 Firefox/17.0", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36", "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0)", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.10 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.21 (KHTML, like Gecko) Version/9.2 Safari/602.1.21", "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36" ] # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip,deflate,br', 'accept-language': 'zh-CN,zh;q=0.9', 'cache-control': 'no-cache', 'pragma': 'no-cache', 'upgrade-insecure-requests': '1', 'host': 'hr.tencent.com' } # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'TencentSpider.middlewares.TencentspiderSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'TencentSpider.middlewares.TencentspiderDownloaderMiddleware': None, 'TencentSpider.middlewares.MyUserAgentMiddleware': 543， 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None } # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'TencentSpider.pipelines.TencentspiderPipeline': 300, # 經過配置RedisPipeline將item寫入key爲 spider.name : items 的redis的list中，供後面的分佈式處理item 這個已經由 scrapy-redis 實現，不須要咱們寫代碼，直接使用便可 'scrapy_redis.pipelines.RedisPipeline': 100 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' LOG_LEVEL = 'DEBUG'

搭建 `redis`

這裏搭建單機版 windows 版本，須要 linux 版本的自行百度。下載地址：https://github.com/rgl/redis/downloads 選擇最新版和你電腦的對應版本下載安裝，這裏我選擇 redis-2.4.6-setup-64-bit.exe，雙擊安裝，而後將 C:\Program Files\Redis 加入系統環境變量。配置文件爲 C:\Program Files\Redis\conf\redis.conf 運行 redis 服務器的命令： redis-server 運行 redis 客戶端的命令： redis-clipython

運行爬蟲

啓動爬蟲linux

cd TencentSpider scrapy crawl tencent

TencentSpider 爲項目文件夾， tencent 爲爬蟲名
這時候爬蟲會處於等待狀態。
能夠在本機或者其餘主機啓動多個爬蟲實例，只有所處的主機可以鏈接 redis 便可

設置 start_urlsgit

# redis-cli
redis 127.0.0.1:6379> lpush tencent:start_urls https://hr.tencent.com/position.php?&start=0#a (integer) 1 redis 127.0.0.1:6379>

或者運行如下腳本:github

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import redis

if __name__ == '__main__':
    conn = redis.Redis(host='127.0.0.1',port=6379)
    # settings 中 REDIS_START_URLS_AS_SET = False #默認是false，true的話，就是集合，false的話，就爲列表 # 列表 conn.lpush('tencent:start_urls','https://hr.tencent.com/position.php?&start=0#a') # 集合 # conn.sadd('tencent:start_urls','https://hr.tencent.com/position.php?&start=0#a') # conn.close() 無需關閉鏈接

tencent:start_urls 爲 spiders/tencent.py 中變量 redis_key 的值
稍等片刻後，全部爬蟲會運行，爬取完成後 ctrl + c 中止

結果會保存在 redis 數據庫的key tencent:items 中與項目文件夾根目錄下的 tencent.json 文件中，內容以下：web



此爬蟲不保證時效性，若是源站調整就會失效。{"positionname": "29302-服務採購商務崗", "positionlink": "https://hr.tencent.com/position_detail.php?id=49345&keywords=&tid=0&lid=0", "positionType": "職能類", "peopleNum": "1", "workLocation": "深圳", "publishTime": "2019-04-12", "positiondetail": "工做職責：• 負責相關產品和品類採購策略的制訂及實施； • 負責相關產品及品類的採購運做管理，包括但不限於需求理解、供應商開發及選擇、供應資源有效管理、商務談判、成本控制、交付管理、組織驗收等 • 支持業務部門的採購需求； • 收集、分析市場及行業相關信息，爲採購決策提供依據。 工做要求：• 認同騰訊企業文化理念，正直、進取、盡責； • 本科或以上學歷，管理、傳媒、經濟或其餘相關專業，市場營銷及內容類產品運營工做背景者優先； • 五年以上工做經驗，對採購理念和採購過程管理具備清晰的認知和深入的理解；擁有二年以上營銷/設計採購、招標相關類管理經驗； • 熟悉採購運做及管理，具備獨立管理重大采購項目的經驗，具備較深厚的採購專業知識； • 具有良好的組織協調和溝通能力、學習能力和團隊合做精神強，具備敬業精神，具有較強的分析問題和解決問題的能力； • 瞭解IP及新文創行業現狀及發展，熟悉市場營銷相關行業知識和行業運做特色； • 具備良好的英語據說讀寫能力，英語可做爲工做語言；同時有日語據說讀寫能力的優先； • 具有良好的文檔撰寫能力。計算機操做能力強，熟練使用MS OFFICE辦公軟件和 ERP 等軟件的熟練使用。"} {"positionname": "CSIG16-自動駕駛高精地圖（地圖編譯）", "positionlink": "https://hr.tencent.com/position_detail.php?id=49346&keywords=&tid=0&lid=0", "positionType": "技術類", "peopleNum": "1", "workLocation": "北京", "publishTime": "2019-04-12", "positiondetail": "工做職責：地圖數據編譯工具軟件開發 工做要求： 碩士以上學歷，2年以上工做經驗，計算機、測繪、GIS、數學等相關專業；  精通C++編程，編程基礎紮實；  熟悉常見數據結構，有較複雜算法設計經驗；  精通數據庫編程，如MySQL、sqlite等；  有實際的地圖項目經驗，如地圖tile、大地座標系、OSM等；  至少熟悉一種地圖數據規格，如MIF、NDS、OpenDrive等；  有較好的數學基礎，熟悉幾何和圖形學基本算法，；  具有較好的溝通表達能力和團隊合做意識。"} {"positionname": "32032-資深特效美術設計師（上海）", "positionlink": "https://hr.tencent.com/position_detail.php?id=49353&keywords=&tid=0&lid=0", "positionType": "設計類", "peopleNum": "1", "workLocation": "上海", "publishTime": "2019-04-12", "positiondetail": "工做職責：負責遊戲3D和2D特效製做，製做規範和技術標準的制定； 與項目組開發人員深刻溝通，準確實現項目開發需求。 工做要求：5年以上端遊、手遊特效製做經驗，熟悉UE4引擎； 能熟練使用相關軟件和引擎工具製做高品質的3D特效； 善於使用第三方軟件製做高品質序列資源，用於引擎特效； 能夠總結本身的方法論和經驗用於新人和帶領團隊； 對遊戲開發和技術有熱情和追求，有責任心，善於團隊合做，溝通能力良好，應聘簡歷須附帶做品。"} ...... ...... ......