安裝 pythonhtml
這個就不用我說了吧,網上教程一大堆python
安裝 scrapy 包json
pip install scrapy
建立 scrapy 項目dom
scrapy startproject aliSpider
進入項目目錄下,建立爬蟲文件scrapy
cmd 進入項目目錄,執行命令:ide
scrapy genspider -t crawl alispi job.alibaba.com
編寫 items.py 文件url
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class AlispiderItem(scrapy.Item): # define the fields for your item here like: detail = scrapy.Field() workPosition = scrapy.Field() jobclass = scrapy.Field()
編寫 alispi.py 文件spa
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from aliSpider.items import AlispiderItem class AlispiSpider(CrawlSpider): name = 'alispi' allowed_domains = ['job.alibaba.com'] start_urls = ['https://job.alibaba.com/zhaopin/positionList.html#page/0'] pagelink = LinkExtractor(allow=("\d+")) rules = ( Rule(pagelink, callback='parse_item', follow=True), ) def parse_item(self, response): # for each in response.xpath("//tr[@style='display:none']"): for each in response.xpath("//tr"): item = AlispiderItem() # 職位名稱 item['detail'] = each.xpath("./td[1]/span/a/@href").extract() # # # 詳情鏈接 item['workPosition'] = each.xpath("./td[3]/span/text()").extract() # # # 職位類別 item['jobclass'] = each.xpath("./td[2]/span/text()").extract() yield item
執行code
scrapy crawl alispi
輸出到文件 items.jsonhtm
scrapy crawl alispi -o items.json
執行成功會顯示以下內容
版本說明
python 3.5.5