scrapy爬取--騰訊社招的網站

時間 2019-11-06

標籤 scrapy 騰訊網站欄目 Python 简体版

原文原文鏈接

需求：獲得相應的職位、職位類型、職位的連接、招聘人數、工做地點、發佈時間php

1、建立Scrapy項目的流程編程

1）使用命令建立爬蟲騰訊招聘的職位項目：scrapy startproject tencentjson

2）進程項目命令：cd tencent,而且建立爬蟲：scrapy genspider tencentPosition hr.tencent.comdom

3) 使用PyCharm打開項目scrapy

4）根據需求分析，完成items.py文件的字段ide

5）完成爬蟲的編寫post

6）管道文件的編程url

7）settings.py文件的配置信息spa

8）pycharm打開文件的效果圖：blog

2、編寫各個文件的代碼：

1.tencentPosition.py文件

import scrapy

from tencent.items import TencentItem


class TencentpositionSpider(scrapy.Spider):
    name = 'tencentPosition'
    allowed_domains = ['hr.tencent.com']
    offset = 0
    url = "https://hr.tencent.com/position.php?&start="
    start_urls = [url + str(offset) + '#a', ]

    def parse(self, response):
        position_lists = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
        for postion in position_lists:
            item = TencentItem()
            position_name = postion.xpath("./td[1]/a/text()").extract()[0]
            position_link = postion.xpath("./td[1]/a/@href").extract()[0]
            position_type = postion.xpath("./td[2]/text()").get()
            people_num = postion.xpath("./td[3]/text()").extract()[0]
            work_address = postion.xpath("./td[4]/text()").extract()[0]
            publish_time = postion.xpath("./td[5]/text()").extract()[0]

            item["position_name"] = position_name
            item["position_link"] = position_link
            item["position_type"] = position_type
            item["people_num"] = people_num
            item["work_address"] = work_address
            item["publish_time"] = publish_time
            yield item

            # 下一頁的數據
            total_page = response.xpath('//div[@class="left"]/span/text()').extract()[0]
            print(total_page)

            if self.offset < int(total_page):
                self.offset += 10
            new_url = self.url + str(self.offset) + "#a"
            yield scrapy.Request(new_url, callback=self.parse)

2.items.py 文件

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    position_name = scrapy.Field()
    position_link = scrapy.Field()
    position_type = scrapy.Field()
    people_num = scrapy.Field()
    work_address = scrapy.Field()
    publish_time = scrapy.Field()

*****切記字段和TencentpositionSpider.py文件保持一致

3.pipelines.py文件

import json


class TencentPipeline(object):
    def __init__(self):
        print("=======start========")
        self.file = open("tencent.json", "w", encoding="utf-8")

    def process_item(self, item, spider):
        print("=====ing=======")
        dict_item = dict(item)  # 轉換成字典
        json_text = json.dumps(dict_item, ensure_ascii=False) + "\n"
        self.file.write(json_text)
        return item

    def close_spider(self, spider):
        print("=======end===========")
        self.file.close()

4.settings.py文件

5.運行文件：

1）在根目錄下建立一個main.py

2)main.py文件

from scrapy import cmdline

cmdline.execute("scrapy crawl tencentPosition".split())

3、運行效果：

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。