實例demo教程 中文教程文檔
第一步:建立項目目錄html
scrapy startproject tutorial
第二步:進入tutorial建立spider爬蟲json
scrapy genspider baidu www.baidu.com
第三步:建立存儲容器,複製項目下的items.py重命名爲BaiduItemssegmentfault
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class BaiduItems(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field() pass
第四步:修改spiders/baidu.py xpath提取數據api
# -*- coding: utf-8 -*- import scrapy # 引入數據容器 from tutorial.BaiduItems import BaiduItems class BaiduSpider(scrapy.Spider): name = 'baidu' allowed_domains = ['www.readingbar.net'] start_urls = ['http://www.readingbar.net/'] def parse(self, response): for sel in response.xpath('//ul/li'): item = BaiduItems() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item pass
第五步:解決百度首頁網站抓取空白問題,設置setting.pydom
# 設置用戶代理 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' # 解決 robots.txt 相關debug ROBOTSTXT_OBEY = False # scrapy 解決數據保存亂碼問題 FEED_EXPORT_ENCODING = 'utf-8'
最後一步:開始爬取數據命令並保存數據爲指定的文件
執行的時候可能報錯:No module named 'win32api' 能夠下載指定版本安裝scrapy
scrapy crawl baidu -o baidu.json
深度爬取百度首頁及導航菜單相關頁內容ide
# -*- coding: utf-8 -*- import scrapy from scrapyProject.BaiduItems import BaiduItems class BaiduSpider(scrapy.Spider): name = 'baidu' # 因爲tab包含其餘域名,須要添加域名不然沒法爬取 allowed_domains = [ 'www.baidu.com', 'v.baidu.com', 'map.baidu.com', 'news.baidu.com', 'tieba.baidu.com', 'xueshu.baidu.com' ] start_urls = ['https://www.baidu.com/'] def parse(self, response): item = BaiduItems() item['title'] = response.xpath('//title/text()').extract() yield item for sel in response.xpath('//a[@class="mnav"]'): item = BaiduItems() item['nav'] = sel.xpath('text()').extract() item['href'] = sel.xpath('@href').extract() yield item # 根據提取的nav地址創建新的請求並執行回調函數 yield scrapy.Request(item['href'][0],callback=self.parse_newpage) pass # 深度提取tab網頁標題信息 def parse_newpage(self, response): item = BaiduItems() item['title'] = response.xpath('//title/text()').extract() yield item pass