scrapy簡單學習3—簡單爬取中文網站（仿寫向）

時間 2019-12-05

原文原文鏈接

仿寫原創——單頁面爬取
爬取網站：聯合早報網左側的標題，鏈接，內容
1.item.py定義爬取內容dom

import scrapy


class MaiziItem(scrapy.Item):
    title = scrapy.Field()
    link=scrapy.Field()
    desc =scrapy.Field()

2.spider文件編寫scrapy

# -*- coding: utf-8 -*-
#encoding=utf-8
import scrapy
from LianHeZaoBao.items import LianhezaobaoItem
reload(__import__('sys')).setdefaultencoding('utf-8') 

class MaimaiSpider(scrapy.Spider):
    name = "lianhe"
    allowed_domains = ["http://www.zaobao.com/news/china//"]
    start_urls = (
        'http://www.zaobao.com/news/china//',
    )

    def parse(self, response):
        
        for li in response.xpath('//*[@id="l_title"]/ul/li'):
            item = LianhezaobaoItem()
            item['title'] = li.xpath('a[1]/p/text()').extract()
            item['link']=li.xpath('a[1]/@href').extract()
            item['desc'] = li.xpath('a[2]/p/text()').extract()
            
            yield item

3.保存文件:命令scrapy crawl lianhe -o lianhe.csv
備註：excel打開出現亂碼，用記事本轉換成ANSI編碼，excel打開中文可正常。
4.完成樣式：
ide