scrapy簡單學習2—解析簡單的spider代碼

時間 2021-06-05

標籤 python 網絡 dom scrapy ide 學習網站 url spa .net 欄目 Python 简体版

原文原文鏈接

import scrapy

from tutorial.items import DmItem

class DmozSpider(scrapy.Spider):
    name = "dm" #爬蟲名
    allowed_domains = ["dmoz.org"]#allow_domains是搜索的域名範圍，也就是爬蟲的約束區域，規定爬蟲只爬取這個域名下的網頁。
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]#要爬取的網站
    
    
        #parse解析的方法，
        #調用的時候傳入從每個URL傳回的Response對象做爲惟一參數，
        #負責解析並匹配抓取的數據(解析爲item)，跟蹤更多的URL。
    def parse(self, response):
        
        #爬取網頁全部的ul標籤下li標籤
        for li in response.xpath('//*[@id="bd-cross"]/fieldset[3]/ul/li'):
            #項目=載入DmItem()類
            item = DmItem()
            #項目['標題']=li標籤裏面的a標籤的文子（）
            item['title'] = li.xpath('a/text()').extract()
            #鏈接=li標籤裏a標籤的href屬性
            item['link'] = li.xpath('a/@href').extract()
            #描述=li標籤裏的text()
            item['desc'] = li.xpath('text()').extract()
            
            yield item#返回項目

備註：簡單的羅列一下有用的xpath路徑表達式
網絡