項目名稱:qidiancss
項目描述:利用scrapy抓取七點中文網的「完本榜」總榜的500本小說,抓取內容包括:小說名稱,做者,類別,而後保存爲CSV文件python
目標URL:https://www.qidian.com/rank/fin?style=1shell
項目需求:scrapy
1.小說名稱ide
2.做者url
3.小說類別spa
第一步:在shell中建立項目code
scrapy startproject qidian
第二步:根據項目需求編輯items.pyblog
1 #-*- coding: utf-8 -*- 2 import scrapy 3 4 class QidianItem(scrapy.Item): 5 name = scrapy.Field() 6 author = scrapy.Field() 7 category = scrapy.Field()
第三步:進行頁面分析,利用xpath或者css提取數據,建立並編輯spider.pyutf-8
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import QidianItem 4 5 class QidianSpider(scrapy.Spider): 6 name = 'qidian' 7 start_urls = ['https://www.qidian.com/rank/fin?style=1&dateType=3'] 8 9 def parse(self, response): 10 sel = response.xpath('//div[@class="book-mid-info"]') 11 for i in sel: 12 name = i.xpath('./h4/a/text()').extract_first() 13 author = i.xpath('./p[@class="author"]/a[1]/text()').extract_first() 14 category = i.xpath('./p[@class="author"]/a[last()]/text()').extract_first() 15 item = QidianItem() 16 item['name'] = name 17 item['author'] = author 18 item['category'] = category 19 yield item
上面這裏是一頁的數據,接下來抓取一下頁的鏈接(由於項目過於小巧,我認爲不必用到一些高大上的方法來實現,直接觀察URL的構造規律就能夠簡單寫出代碼),下面是spider.py的完整代碼
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import QidianItem 4 5 class QidianSpider(scrapy.Spider): 6 name = 'qidian' 7 start_urls = ['https://www.qidian.com/rank/fin?style=1&dateType=3'] 8 n = 1 #第一頁 9 10 def parse(self, response): 11 sel = response.xpath('//div[@class="book-mid-info"]') 12 for i in sel: 13 name = i.xpath('./h4/a/text()').extract_first() 14 author = i.xpath('./p[@class="author"]/a[1]/text()').extract_first() 15 category = i.xpath('./p[@class="author"]/a[last()]/text()').extract_first() 16 item = QidianItem() 17 item['name'] = name 18 item['author'] = author 19 item['category'] = category 20 yield item
21 22 if self.n < 25: 23 self.n += 1 #n表示頁碼 24 next_url = 'https://www.qidian.com/rank/fin?style=1&dateType=3&page=%d' % self.n 25 yield scrapy.Request(next_url, callback = parse)
第四步:啓動爬蟲並保存數據
scrapy crawl qidian -o qidian.csv