剛剛接觸Scrapycss
根據書上的內容作練習python
爬取書上的連接裏的書本信息scrapy
1 import scrapy 2 class FuzxSpider(scrapy.Spider): 3 name = "Fuzx" 4 start_urls=['http://books.toscrape.com/'] 5 def parse(self, response): 6 for book in response.css('article.product_pod'): 7 name=book.xpath('./h3/a/@title').extract_first() 8 price=book.css('p.price_color::text').extract_first() 9 yield { 10 'name':name, 11 'price':price, 12 } 13 next_url=response.css('ul.pager li.next a::attr(href)').extract_first() 14 if next_url: 15 next_url=response.urljoin(next_url) 16 yield scrapy.Request(next_url,callback=self.parse)
命令:scrapy crawl Fuzx -o fuzx.csvide
將內容保存在fuzx.csv裏url
看樣子爬成功了spa
延伸一下 網上說練爬蟲豆瓣比較容易code
我打開的是豆瓣閱讀的小說界面 https://read.douban.com/kind/100blog
爬取圖書名和做者requests
而後改了css選擇器的內容it
報錯403
而後我加了個User-Agent
第二次出現問題的地方是在這裏有空格
for book in response.css('li.item.store-item'):
參考別人的文章,能夠用上面這種寫法
完整代碼:
1 import scrapy 2 class DoubSpider(scrapy.Spider): 3 name = "doub" 4 headers = { 5 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0', 6 } 7 def start_requests(self): 8 url = 'https://read.douban.com/kind/100' 9 yield scrapy.Request(url, headers=self.headers) 10 11 def parse(self, response): 12 for book in response.css('li.item.store-item'): 13 name=book.css('div.title a::text').extract_first() 14 author=book.css('p span.labeled-text a.author-item::text').extract_first() 15 yield { 16 'name':name, 17 'author':author, 18 } 19 next_url=response.css('div.pagination li.next a::attr(href)').extract_first() 20 if next_url: 21 next_url=response.urljoin(next_url) 22 yield scrapy.Request(next_url,headers=self.headers,callback=self.parse)
而後scrapy crawl doub -o db.csv
爬完查看db.csv