【學習筆記】Scrapy

時間 2019-11-18

原文原文鏈接

　　剛剛接觸Scrapycss

　　根據書上的內容作練習python

　　爬取書上的連接裏的書本信息scrapy

 1 import scrapy
 2 class FuzxSpider(scrapy.Spider):
 3     name = "Fuzx"
 4     start_urls=['http://books.toscrape.com/']
 5     def parse(self, response):
 6         for book in response.css('article.product_pod'):
 7             name=book.xpath('./h3/a/@title').extract_first()
 8             price=book.css('p.price_color::text').extract_first()
 9             yield {
10                 'name':name,
11                 'price':price,
12             }
13         next_url=response.css('ul.pager li.next a::attr(href)').extract_first()
14         if next_url:
15             next_url=response.urljoin(next_url)
16             yield scrapy.Request(next_url,callback=self.parse)

　　命令：scrapy crawl Fuzx -o fuzx.csvide

　　將內容保存在fuzx.csv裏url

　　看樣子爬成功了spa

　　延伸一下網上說練爬蟲豆瓣比較容易code

　　我打開的是豆瓣閱讀的小說界面 https://read.douban.com/kind/100blog

　　爬取圖書名和做者requests

　　而後改了css選擇器的內容it

　　報錯403

　　而後我加了個User-Agent

　　第二次出現問題的地方是在這裏有空格

for book in response.css('li.item.store-item'):

　　參考別人的文章，能夠用上面這種寫法

　　完整代碼：

 1 import scrapy
 2 class DoubSpider(scrapy.Spider):
 3     name = "doub"
 4     headers = {
 5         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
 6     }
 7     def start_requests(self):
 8         url = 'https://read.douban.com/kind/100'
 9         yield scrapy.Request(url, headers=self.headers)
10 
11     def parse(self, response):
12         for book in response.css('li.item.store-item'):
13             name=book.css('div.title a::text').extract_first()
14             author=book.css('p span.labeled-text a.author-item::text').extract_first()
15             yield {
16                 'name':name,
17                 'author':author,
18             }
19         next_url=response.css('div.pagination li.next a::attr(href)').extract_first()
20         if next_url:
21             next_url=response.urljoin(next_url)
22             yield scrapy.Request(next_url,headers=self.headers,callback=self.parse)