需求:獲得相應的職位、職位類型、職位的連接 、招聘人數、工做地點、發佈時間php
1、建立Scrapy項目的流程編程
1)使用命令建立爬蟲騰訊招聘的職位項目:scrapy startproject tencentjson
2)進程項目命令:cd tencent,而且建立爬蟲:scrapy genspider tencentPosition hr.tencent.comdom
3) 使用PyCharm打開項目scrapy
4)根據需求分析,完成items.py文件的字段ide
5)完成爬蟲的編寫post
6)管道文件的編程url
7)settings.py文件的配置信息spa
8)pycharm打開文件的效果圖:blog
2、編寫各個文件的代碼:
1.tencentPosition.py文件
import scrapy from tencent.items import TencentItem class TencentpositionSpider(scrapy.Spider): name = 'tencentPosition' allowed_domains = ['hr.tencent.com'] offset = 0 url = "https://hr.tencent.com/position.php?&start=" start_urls = [url + str(offset) + '#a', ] def parse(self, response): position_lists = response.xpath('//tr[@class="even"] | //tr[@class="odd"]') for postion in position_lists: item = TencentItem() position_name = postion.xpath("./td[1]/a/text()").extract()[0] position_link = postion.xpath("./td[1]/a/@href").extract()[0] position_type = postion.xpath("./td[2]/text()").get() people_num = postion.xpath("./td[3]/text()").extract()[0] work_address = postion.xpath("./td[4]/text()").extract()[0] publish_time = postion.xpath("./td[5]/text()").extract()[0] item["position_name"] = position_name item["position_link"] = position_link item["position_type"] = position_type item["people_num"] = people_num item["work_address"] = work_address item["publish_time"] = publish_time yield item # 下一頁的數據 total_page = response.xpath('//div[@class="left"]/span/text()').extract()[0] print(total_page) if self.offset < int(total_page): self.offset += 10 new_url = self.url + str(self.offset) + "#a" yield scrapy.Request(new_url, callback=self.parse)
2.items.py 文件
import scrapy class TencentItem(scrapy.Item): # define the fields for your item here like: position_name = scrapy.Field() position_link = scrapy.Field() position_type = scrapy.Field() people_num = scrapy.Field() work_address = scrapy.Field() publish_time = scrapy.Field()
*****切記字段和TencentpositionSpider.py文件保持一致
3.pipelines.py文件
import json class TencentPipeline(object): def __init__(self): print("=======start========") self.file = open("tencent.json", "w", encoding="utf-8") def process_item(self, item, spider): print("=====ing=======") dict_item = dict(item) # 轉換成字典 json_text = json.dumps(dict_item, ensure_ascii=False) + "\n" self.file.write(json_text) return item def close_spider(self, spider): print("=======end===========") self.file.close()
4.settings.py文件
5.運行文件:
1)在根目錄下建立一個main.py
2)main.py文件
from scrapy import cmdline cmdline.execute("scrapy crawl tencentPosition".split())
3、運行效果: