【學習筆記】Scrapy

  剛剛接觸Scrapycss

  根據書上的內容作練習python

  爬取書上的連接裏的書本信息scrapy

  

 1 import scrapy
 2 class FuzxSpider(scrapy.Spider):
 3     name = "Fuzx"
 4     start_urls=['http://books.toscrape.com/']
 5     def parse(self, response):
 6         for book in response.css('article.product_pod'):
 7             name=book.xpath('./h3/a/@title').extract_first()
 8             price=book.css('p.price_color::text').extract_first()
 9             yield {
10                 'name':name,
11                 'price':price,
12             }
13         next_url=response.css('ul.pager li.next a::attr(href)').extract_first()
14         if next_url:
15             next_url=response.urljoin(next_url)
16             yield scrapy.Request(next_url,callback=self.parse)

 

  命令:scrapy crawl Fuzx -o fuzx.csvide

  將內容保存在fuzx.csv裏url

  

  看樣子爬成功了spa

  延伸一下 網上說練爬蟲豆瓣比較容易code

  我打開的是豆瓣閱讀的小說界面 https://read.douban.com/kind/100blog

  爬取圖書名和做者requests

  

 

  而後改了css選擇器的內容it

  報錯403

  而後我加了個User-Agent

  第二次出現問題的地方是在這裏有空格

  

for book in response.css('li.item.store-item'):

  參考別人的文章,能夠用上面這種寫法

  完整代碼:

 1 import scrapy
 2 class DoubSpider(scrapy.Spider):
 3     name = "doub"
 4     headers = {
 5         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
 6     }
 7     def start_requests(self):
 8         url = 'https://read.douban.com/kind/100'
 9         yield scrapy.Request(url, headers=self.headers)
10 
11     def parse(self, response):
12         for book in response.css('li.item.store-item'):
13             name=book.css('div.title a::text').extract_first()
14             author=book.css('p span.labeled-text a.author-item::text').extract_first()
15             yield {
16                 'name':name,
17                 'author':author,
18             }
19         next_url=response.css('div.pagination li.next a::attr(href)').extract_first()
20         if next_url:
21             next_url=response.urljoin(next_url)
22             yield scrapy.Request(next_url,headers=self.headers,callback=self.parse)

  而後scrapy crawl doub -o db.csv

  爬完查看db.csv

相關文章
相關標籤/搜索