python網絡爬蟲之scrapy 調試以及爬取網頁

時間 2019-11-18

標籤 python 網絡爬蟲 scrapy 調試以及網頁欄目 Python 简体版

原文原文鏈接

Shell調試：shell

進入項目所在目錄，scrapy shell 「網址」json

以下例中的：app

scrapy shell http://www.w3school.com.cn/xml/xml_syntax.aspdom

能夠在以下終端界面調用過程代碼以下所示：scrapy

相關的網頁代碼：ide

咱們用scrapy來爬取一個具體的網站。以迅讀網站爲例。網站

以下是首頁的內容，我想要獲得文章列表以及對應的做者名稱。
url

首先在items.py中定義title, author. 這裏的Test1Item和Django中的modul做用相似。這裏能夠將Test1Item看作是一個容器。這個容器繼承自scrapy.Item.spa

而Item又繼承自DictItem。所以能夠認爲Test1Item就是一個字典的功能。其中title和author能夠認爲是item中的2個關鍵字。也就是字典中的key3d

class Item(DictItem):

class Test1Item(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title=Field()

    author=Field()

下面就在test_spider.py中開始寫網頁解析代碼

from scrapy.spiders import Spider

from scrapy.selector import Selector

from test1.items import Test1Item

class testSpider(Spider):

    name="test1"    #這裏的name必須和建立工程的名字一致，不然會提示找不到爬蟲項目

    allowd_domains=['http://www.xunsee.com']

    start_urls=["http://www.xunsee.com/"]

    def parse(self, response):

        items=[]

        sel=Selector(response)

        sites = sel.xpath('//*[@id="content_1"]/div')  #這裏是全部數據的入口。下面全部的div都是存儲的文章列表和做者

        for site in sites:

          item=Test1Item()

          title=site.xpath('span[@class="title"]/a/text()').extract()

          h=site.xpath('span[@class="title"]/a/@href').extract()

          item['title']=[t.encode('utf-8') for t in title]

        author=site.xpath('span[@class="author"]/a/text()').extract()

          item['author']=[a.encode('utf-8') for a in author]

          items.append(item)

         return items

獲取到title以及author的內容後，存儲到item中。再將全部的item存儲在items的列表中

在pipelines.py中修改Test1Pipeline以下。這個類中實現的是處理在testSpider中返回的items數據。也就是存儲數據的地方。咱們將items數據存儲到json文件中去

class Test1Pipeline(object):

    def __init__(self):

        self.file=codecs.open('xundu.json','wb',encoding='utf-8')

    def process_item(self, item, spider):

        line=json.dumps(dict(item)) + '\n'

        self.file.write(line.decode("unicode_escape"))

        return item