利用scrapy框架抓取騰訊的招聘信息,爬取地址爲:https://hr.tencent.com/position.phpphp
抓取字段包括:招聘崗位,人數,工做地點,發佈時間,及具體的工做要求和工做任務html
最終結果保存爲兩個文件,一個文件放前面的四個字段信息,一個放具體內容信息node
經過網頁源碼和F12顯示的代碼對比發現,該網頁屬於靜態網頁。git
能夠採用xpath解析網頁源碼,獲取tr標籤下的相關內容,具體見代碼部分。github
經過scrapy startproject + 項目名稱 生成項目後,來到items.py文件下,首先定義爬取的字段。json
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class TencentItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 職位名稱 position_name = scrapy.Field() # 職位類別 position_type = scrapy.Field() # 招聘人數 wanted_number = scrapy.Field() # 工做地點 work_location = scrapy.Field() # 發佈時間 publish_time = scrapy.Field() # 詳情信息 position_link = scrapy.Field() class DetailsItem(scrapy.Item): """ 將詳情頁提取到的數據另外保存到一個文件中 """ # 工做職責 work_duties = scrapy.Field() # 工做要求 work_skills = scrapy.Field()
使用scrapy genspiders + 名稱+初始url,生成爬蟲後,來到spiders文件夾下的爬蟲文件,編寫爬蟲邏輯,具體代碼以下:框架
# -*- coding: utf-8 -*- import scrapy # 導入待爬取字段名 from tencent.items import TencentItem, DetailsItem class TencentWantedSpider(scrapy.Spider): name = 'tencent_wanted' allowed_domains = ['hr.tencent.com'] start_urls = ['https://hr.tencent.com/position.php'] base_url = 'https://hr.tencent.com/' def parse(self, response): # 獲取頁面中招聘信息在網頁中位置節點 node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]') # 匹配到下一頁的按鈕 next_page = response.xpath('//a[@id="next"]/@href').extract_first() # 遍歷節點,進入詳情頁,獲取其餘信息 for node in node_list: # 實例化,填寫數據 item = TencentItem() item['position_name'] = node.xpath('./td[1]/a/text()').extract_first() item['position_link'] = node.xpath('./td[1]/a/@href').extract_first() item['position_type'] = node.xpath('./td[2]/text()').extract_first() item['wanted_number'] = node.xpath('./td[3]/text()').extract_first() item['work_location'] = node.xpath('./td[4]/text()').extract_first() item['publish_time' ] = node.xpath('./td[5]/text()').extract_first() yield item yield scrapy.Request(url=self.base_url + item['position_link'], callback=self.details) # 訪問下一頁信息 yield scrapy.Request(url=self.base_url + next_page, callback=self.parse) def details(self, response): """ 對詳情頁信息進行抽取和解析 :return: """ item = DetailsItem() # 從詳情頁獲取工做責任和工做技能兩個字段名 item['work_duties'] = ''.join(response.xpath('//ul[@class="squareli"]')[0].xpath('./li/text()').extract()) item['work_skills'] = ''.join(response.xpath('//ul[@class="squareli"]')[1].xpath('./li/text()').extract()) yield item
對爬取的數據進行保存,首先要在settings.py文件裏,註冊爬蟲的管道信息,如:dom
具體代碼以下:scrapy
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json from tencent.items import TencentItem, DetailsItem class TencentPipeline(object): def open_spider(self, spider): """ 爬蟲運行時,執行的方法 :param spider: :return: """ self.file = open('tenc_wanted_2.json', 'w', encoding='utf-8') self.file_detail = open('tenc_wanted_detail.json', 'w', encoding='utf-8') def process_item(self, item, spider): content = json.dumps(dict(item), ensure_ascii=False) # 判斷數據來源於哪裏(是哪一個類的實例),寫入對應的文件 if isinstance(item, TencentItem): self.file.write(content + '\n') if isinstance(item, DetailsItem): self.file_detail.write(content + '\n') return item def close_spider(self, spider): """ 爬蟲運行結束後執行的方法 :param spider: :return: """ self.file.close() self.file_detail.close()