仿寫原創——單頁面爬取
爬取網站:聯合早報網左側的標題,鏈接,內容
1.item.py定義爬取內容dom
import scrapy class MaiziItem(scrapy.Item): title = scrapy.Field() link=scrapy.Field() desc =scrapy.Field()
2.spider文件編寫scrapy
# -*- coding: utf-8 -*- #encoding=utf-8 import scrapy from LianHeZaoBao.items import LianhezaobaoItem reload(__import__('sys')).setdefaultencoding('utf-8') class MaimaiSpider(scrapy.Spider): name = "lianhe" allowed_domains = ["http://www.zaobao.com/news/china//"] start_urls = ( 'http://www.zaobao.com/news/china//', ) def parse(self, response): for li in response.xpath('//*[@id="l_title"]/ul/li'): item = LianhezaobaoItem() item['title'] = li.xpath('a[1]/p/text()').extract() item['link']=li.xpath('a[1]/@href').extract() item['desc'] = li.xpath('a[2]/p/text()').extract() yield item
3.保存文件:命令scrapy crawl lianhe -o lianhe.csv
備註:excel打開出現亂碼,用記事本轉換成ANSI編碼,excel打開中文可正常。
4.完成樣式:
ide