爬蟲7:Scrapy-爬網頁

時間 2019-11-10

標籤爬蟲 scrapy 爬網欄目網絡爬蟲简体版

原文原文鏈接

用Scrapy作爬蟲分爲四步html

新建項目 (Project)：新建一個新的爬蟲項目
明確目標（Items）：明確你想要抓取的目標
製做爬蟲（Spider）：製做爬蟲開始爬取網頁
存儲內容（Pipeline）：設計管道存儲爬取內容

上一章節作了建立項目，接着用上一次建立的項目來爬取網頁python

網上不少教程都是用的dmoz.org這個網站來作實驗，因此我也用這個來作了實驗api

明確目標

在Scrapy中，items是用來加載抓取內容的容器dom

咱們想要的內容是python2.7

名稱（name）
連接（url）
描述（description）

在tutorial目錄下會有items.py文件，在默認的代碼後面添加上咱們的代碼scrapy

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

#下面是我本身加的
class DmozItem(Item):    
    title = Field()
    link = Field()
    desc = Field()

製做爬蟲

爬蟲仍是老規矩，先爬再取。也就是說先獲取整個網頁的內容，而後取出你須要的部分ide

在tutorial\spiders目錄下創建python文件，命名爲dmoz_spider.py網站

目前的代碼以下url

from scrapy.spiders import Spider

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls= [
         "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
         "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
    def parse(self,response):
        filename=response.url.split("/")[-2]
        open(filename,'wb').write(response.body)