基於CrawlSpider全棧數據爬取

  • CrawlSpider就是爬蟲類Spider的一個子類php

    使用流程

  1. 建立一個基於CrawlSpider的一個爬蟲文件 :scrapy genspider -t crawl spider_name www.xxx.com
  2. 構造連接提取器和規則解析器
    • 連接提取器:
      • 做用:能夠根據指定的規則進行指定鏈接的提取
      • 提取的規則: allow = "正則表達式"
      • 會先在全局匹配全部的url,而後根據參數allow的規則匹配須要的連接
    • 規則解析器
      • 做用:獲取連接提取器提取到的連接,對其進行請求發送,根據指定的規則對請求道的頁面源碼數據進行數據解析.-
      • fllow = True 參數的做用: 將連接提取器繼續做用到連接提取器提取到的頁碼連接所對應的頁面中
  3. 注意事項:
    • 連接提取器和規則解析器是一一對應關係

示例代碼

  • 基於CrawlSpider實現深度數據爬取html

    • spider文件
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from sunspider.items import SunspiderItem, SunspiderItemSecond
    
    
    class SunSpiderSpider(CrawlSpider):
        name = 'sun_spider'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']
        # 連接提取器  兩層數據爬取,寫兩個連接提取器,連接提取器和規則解析器是一一對應關係
        link = LinkExtractor(allow=r'type=4&page=\d+')
        link_detail = LinkExtractor(allow=r'question/\d+/\d+\.shtml')
        rules = (
            # 實例化Rule(規則解析器)的對象
            Rule(link, callback='parse_item', follow=True),
            Rule(link_detail, callback='parse_item_content', follow=True),
    
        )
    
        def parse_item(self, response):
            tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
            for tr in tr_list:
                title = tr.xpath('./td[2]/a[2]/@title').extract_first()
                status = tr.xpath('./td[3]/span/text()').extract_first()
                num = tr.xpath('./td[1]/text()').extract_first()
                item = SunspiderItem()
                item['title'] = title
                item['status'] = status
                item['num'] = num
                yield item
    
        def parse_detail(self, response):
            content = response.xpath('/html/body/div[9]/table[2]/tbody/tr[1]//text()').extract()
            content = ''.join(content)
            num = response.xpath('/html/body/div[9]/table[1]/tbody/tr/td[2]/span[2]/text()').extract_first()
            if num:
                num = num.split(':')[-1]
                item = SunspiderItemSecond()
                item['content'] = content
                item['num'] = num
                yield item
    • items.py文件python

      import scrapy
      # 定義兩個類,而且經過某種方式(num)標識兩個類之間的對應關係
      class SunspiderItem(scrapy.Item):
          title = scrapy.Field()
          status = scrapy.Field()
          num = scrapy.Field()
      
      class SunspiderItemSecond(scrapy.Item):
          content = scrapy.Field()
          num = scrapy.Field()
    • pipelines.py文件正則表達式

      • 存儲數據
      class SunspiderPipeline(object):
          def process_item(self, item, spider):
              # 判斷item是哪個類封裝
              if item.__class__.__name__ == "SunspiderItemSecond":
                  content = item['content']
                  num = item['num']
                  print(content, num)
              else:
                  title = item['title']
                  status = item['status']
                  num = item['num']
      
                  print(title, status, num)
              return item
相關文章
相關標籤/搜索