個人第一個 scrapy 爬蟲軟件

安裝 pythonhtml

這個就不用我說了吧,網上教程一大堆python

安裝 scrapy 包json

pip install scrapy

建立 scrapy 項目dom

scrapy startproject aliSpider

進入項目目錄下,建立爬蟲文件scrapy

cmd 進入項目目錄,執行命令:ide

scrapy genspider -t crawl alispi job.alibaba.com

編寫 items.py 文件url

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class AlispiderItem(scrapy.Item):
    # define the fields for your item here like:
    detail = scrapy.Field()
    workPosition = scrapy.Field()
    jobclass = scrapy.Field()

編寫 alispi.py 文件spa

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from aliSpider.items import AlispiderItem


class AlispiSpider(CrawlSpider):
    name = 'alispi'
    allowed_domains = ['job.alibaba.com']
    start_urls = ['https://job.alibaba.com/zhaopin/positionList.html#page/0']
    pagelink = LinkExtractor(allow=("\d+"))
    rules = (
        Rule(pagelink, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        # for each in response.xpath("//tr[@style='display:none']"):
        for each in response.xpath("//tr"):
            item = AlispiderItem()
            # 職位名稱
            item['detail'] = each.xpath("./td[1]/span/a/@href").extract()
            # # # 詳情鏈接
            item['workPosition'] = each.xpath("./td[3]/span/text()").extract()
            # # # 職位類別
            item['jobclass'] = each.xpath("./td[2]/span/text()").extract()
            yield item

執行code

scrapy crawl alispi

輸出到文件 items.jsonhtm

scrapy crawl alispi -o items.json

執行成功會顯示以下內容

在這裏插入圖片描述

版本說明

python 3.5.5
相關文章
相關標籤/搜索