利用scrapy爬取騰訊的招聘信息

時間 2019-11-08

標籤利用 scrapy 騰訊招聘信息欄目 Python 简体版

原文原文鏈接

利用scrapy框架抓取騰訊的招聘信息，爬取地址爲：https://hr.tencent.com/position.phpphp

抓取字段包括：招聘崗位，人數，工做地點，發佈時間，及具體的工做要求和工做任務html

最終結果保存爲兩個文件，一個文件放前面的四個字段信息，一個放具體內容信息node

1.網頁分析

經過網頁源碼和F12顯示的代碼對比發現，該網頁屬於靜態網頁。git

能夠採用xpath解析網頁源碼，獲取tr標籤下的相關內容，具體見代碼部分。github

2.編輯items.py文件

經過scrapy startproject + 項目名稱生成項目後，來到items.py文件下，首先定義爬取的字段。json

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 職位名稱
    position_name = scrapy.Field()
    # 職位類別
    position_type = scrapy.Field()
    # 招聘人數
    wanted_number = scrapy.Field()
    # 工做地點
    work_location = scrapy.Field()
    # 發佈時間
    publish_time = scrapy.Field()
    # 詳情信息
    position_link = scrapy.Field()

class DetailsItem(scrapy.Item):
    """
    將詳情頁提取到的數據另外保存到一個文件中
    """
    # 工做職責
    work_duties = scrapy.Field()
    # 工做要求
    work_skills = scrapy.Field()

3.編寫爬蟲部分

使用scrapy genspiders + 名稱+初始url，生成爬蟲後，來到spiders文件夾下的爬蟲文件，編寫爬蟲邏輯，具體代碼以下：框架

# -*- coding: utf-8 -*-
import scrapy

# 導入待爬取字段名
from tencent.items import TencentItem, DetailsItem

class TencentWantedSpider(scrapy.Spider):
    name = 'tencent_wanted'
    allowed_domains = ['hr.tencent.com']
    start_urls = ['https://hr.tencent.com/position.php']

    base_url = 'https://hr.tencent.com/'

    def parse(self, response):

        # 獲取頁面中招聘信息在網頁中位置節點
        node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')

        # 匹配到下一頁的按鈕
        next_page = response.xpath('//a[@id="next"]/@href').extract_first()

        # 遍歷節點，進入詳情頁，獲取其餘信息
        for node in node_list:
            # 實例化，填寫數據
            item = TencentItem()

            item['position_name'] = node.xpath('./td[1]/a/text()').extract_first()
            item['position_link'] = node.xpath('./td[1]/a/@href').extract_first()
            item['position_type'] = node.xpath('./td[2]/text()').extract_first()
            item['wanted_number'] = node.xpath('./td[3]/text()').extract_first()
            item['work_location'] = node.xpath('./td[4]/text()').extract_first()
            item['publish_time' ] = node.xpath('./td[5]/text()').extract_first()

            yield item
            yield scrapy.Request(url=self.base_url + item['position_link'], callback=self.details)

        # 訪問下一頁信息
        yield scrapy.Request(url=self.base_url + next_page, callback=self.parse)

    def details(self, response):
        """
        對詳情頁信息進行抽取和解析
        :return:
        """
        item = DetailsItem()
        # 從詳情頁獲取工做責任和工做技能兩個字段名
        item['work_duties'] = ''.join(response.xpath('//ul[@class="squareli"]')[0].xpath('./li/text()').extract())
        item['work_skills'] = ''.join(response.xpath('//ul[@class="squareli"]')[1].xpath('./li/text()').extract())
        yield item

4.編寫pipelines.py文件，對抓取數據進行保存。

對爬取的數據進行保存，首先要在settings.py文件裏，註冊爬蟲的管道信息，如:dom

具體代碼以下：scrapy

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json
from tencent.items import TencentItem, DetailsItem
class TencentPipeline(object):
    def open_spider(self, spider):
        """
        爬蟲運行時，執行的方法
        :param spider:
        :return:
        """
        self.file = open('tenc_wanted_2.json', 'w', encoding='utf-8')
        self.file_detail = open('tenc_wanted_detail.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        content = json.dumps(dict(item), ensure_ascii=False)

        # 判斷數據來源於哪裏（是哪一個類的實例），寫入對應的文件
        if isinstance(item, TencentItem):
            self.file.write(content + '\n')

        if isinstance(item, DetailsItem):
            self.file_detail.write(content + '\n')

        return item

    def close_spider(self, spider):
        """
        爬蟲運行結束後執行的方法
        :param spider:
        :return:
        """
        self.file.close()
        self.file_detail.close()