Scrapy框架中分兩類爬蟲,Spider類和CrawlSpider類。php
此案例採用的是CrawlSpider類實現爬蟲。html
它是Spider的派生類,Spider類的設計原則是隻爬取start_url列表中的網頁,而CrawlSpider類定義了一些規則(rule)來提供跟進link的方便的機制,從爬取的網頁中獲取link並繼續爬取的工做更適合。如爬取大型招聘網站正則表達式
建立項目json
scrapy startproject tencent #建立項目
建立模板app
scrapy genspider crawl -t tencent 'hr.tencent.com' #tencent爲爬蟲名稱 hr.tencent.com爲限制域
建立完會模板後會生成一個tencent.py的文件 框架
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class TencentSpider(CrawlSpider): name = 'tencent' allowed_domains = ['tencent.com'] start_urls = ['http://tencent.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response): i = {} #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() return i
在rules中包含一個或多個Rule對象,每一個Rule對爬取網站的動做定義了特定操做。若是多個rule匹配了相同的連接,則根據規則在本集合中被定義的順序,第一個會被使用。dom
參數介紹:
LinkExtractor:是一個Link Extractor對象,用於定義須要提取的連接。scrapy
import scrapy class TencentItem(scrapy.Item): # 職位 name = scrapy.Field() # 詳情連接 positionlink = scrapy.Field() #職位類別 positiontype = scrapy.Field() # 人數 peoplenum = scrapy.Field() # 工做地點 worklocation = scrapy.Field() # 發佈時間 publish = scrapy.Field()
import json class TencentPipeline(object): def __init__(self): self.filename = open("tencent.json", "w") def process_item(self, item, spider): text = json.dumps(dict(item), ensure_ascii = False) + ",\n" self.filename.write(text.encode("utf-8")) return item def close_spider(self, spider): self.filename.close()
BOT_NAME = 'tencent' SPIDER_MODULES = ['tencent.spiders'] NEWSPIDER_MODULE = 'tencent.spiders' LOG_FILE = 'tenlog.log' LOG_LEVEL = 'DEBUG' LOG_ENCODING = 'utf-8' ROBOTSTXT_OBEY = True DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', } ITEM_PIPELINES = { 'tencent.pipelines.TencentPipeline': 300, }
# -*- coding: utf-8 -*- import scrapy # 導入連接匹配規則類,用來提取符合規則的連接 from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from tencent.items import TencentItem class TenecntSpider(CrawlSpider): name = 'tencent1' # 可選,加上會有一個爬去的範圍 allowed_domains = ['hr.tencent.com'] start_urls = ['http://hr.tencent.com/position.php?&start=0#a'] # response中提取 連接的匹配規則,得出是符合的連接 pagelink = LinkExtractor(allow=('start=\d+')) print (pagelink) # 能夠寫多個rule規則 rules = [ # follow = True須要跟進的時候加上這句。 # 有callback的時候就有follow # 只要符合匹配規則,在rule中都會發送請求,同是調用回調函數處理響應 # rule就是批量處理請求 Rule(pagelink, callback='parse_item', follow=True), ] # 不能寫parse方法,由於源碼中已經有了,回覆蓋致使程序不能跑 def parse_item(self, response): for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"): # 把數據保存在建立的對象中,用字典的形式 item = TencentItem() # 職位 # each.xpath('./td[1]/a/text()')返回的是列表,extract轉爲unicode字符串,[0]取第一個 item['name'] = each.xpath('./td[1]/a/text()').extract()[0] # 詳情連接 item['positionlink'] = each.xpath('./td[1]/a/@href').extract()[0] # 職位類別 item['positiontype'] = each.xpath("./td[2]/text()").extract()[0] # 人數 item['peoplenum'] = each.xpath('./td[3]/text()').extract()[0] # 工做地點 item['worklocation'] = each.xpath('./td[4]/text()').extract()[0] # 發佈時間 item['publish'] = each.xpath('./td[5]/text()').extract()[0] # 把數據交給管道文件 yield item
這個樣就實現了一個簡單的CrawlSpider類爬蟲ide