scrapy實例:爬取天氣、氣溫等

1.建立項目

scrapy startproject weather # weather是項目名稱

 

scrapy crawl spidername開始運行,程序自動使用start_urls構造Request併發送請求,而後調用parse函數對其進行解析,html

在這個解析過程當中使用rules中的規則從html(或xml)文本中提取匹配的連接,經過這個連接再次生成Request,如此不斷循環,直到返回的文本中再也沒有匹配的連接,或調度器中的Request對象用盡,程序才中止。web

 

2.肯定爬取目標:

scrapy構建的爬蟲的爬取過程:正則表達式

scrapy crawl spidername開始運行,程序自動使用start_urls構造Request併發送請求,而後調用parse函數對其進行解析,在這個解析過程當中使用rules中的規則從html(或xml)文本中提取匹配的連接,shell

經過這個連接再次生成Request,如此不斷循環,直到返回的文本中再也沒有匹配的連接,或調度器中的Request對象用盡,程序才中止。

json

allowed_domains:顧名思義,容許的域名,爬蟲只會爬取該域名下的urlcookie

rule:定義爬取規則,爬蟲只會爬取符合規則的url併發

  rule有allow屬性,使用正則表達式書寫匹配規則.正則表達式不熟悉的話能夠寫好後在網上在線校驗,嘗試幾回後,簡單的正則仍是比較容易的,咱們要用的也不復雜.app

  rule有callback屬性能夠指定回調函數,爬蟲在發現符合規則的url後就會調用該函數,注意要和默認的回調函數parse做區分.(爬取的數據在命令行裏均可以看到)dom

  rule有follow屬性.爲True時會爬取網頁裏全部符合規則的url,反之不會.  我這裏設置爲了False,由於True的話要爬好久.大約兩千多條天氣信息

scrapy

import scrapy
from weather.items import WeatherItem
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class Spider(CrawlSpider):
    name = 'weatherSpider'
    #allowed_domains = "www.weather.com.cn"
    start_urls = [
        #"http://www.weather.com.cn/weather1d/101020100.shtml#search"
        "http://www.weather.com.cn/forecast/"
    ]
    rules = (
        #Rule(LinkExtractor(allow=('http://www.weather.com.cn/weather1d/101\d{6}.shtml#around2')), follow=False, callback='parse_item'),
        Rule(LinkExtractor(allow=('http://www.weather.com.cn/weather1d/101\d{6}.shtml$')), follow=True,callback='parse_item'),
    )
    
    
    #多頁面爬取時須要自定義方法名稱,不能用parse
    def parse_item(self, response):
        item = WeatherItem()
        #city = response.xpath("//div[@class='crumbs fl']/a[2]/text()").extract_first()
        item['city'] = response.xpath("//div[@class='crumbs fl']/a[2]/text()").extract_first()  # 獲取省或者直轄市名稱
        #if city == '>':
        #item['city'] = response.xpath("//div[@class='crumbs fl']/a[last()-1]/text()").extract_first()#獲取非直轄省
        #item['city'] = response.xpath("//div[@class ='crumbs fl']/a[2]/text()").extract_first()#獲取直轄市

        #item['city_addition'] = response.xpath("//div[@class ='crumbs fl']/a[last()]/text()").extract_first()#獲取直轄市
        #city_addition = response.xpath("//div[@class ='crumbs fl']/a[last()]/text()").extract_first() #獲取>字符
        #print("aaaaa"+city)
        #print("nnnnn"+city_addition)
        #if city_addition != city:
            #item['city_addition'] = response.xpath("//div[@class='crumbs fl']/a[2]/text()").extract_first()
        item['city_addition'] = response.xpath("//div[@class ='crumbs fl']/a[last()]/text()").extract_first()  # 獲取城市名或者直轄市名稱
        #else:
            #item['city_addition'] = ''

        #item['city_addition2'] = response.xpath("//div[@class='crumbs fl']/span[3]/text()").extract_first()


        weatherData = response.xpath("//div[@class='today clearfix']/input[1]/@value").extract_first() #獲取當前的氣溫
        item['data'] = weatherData[0:6] #獲取日期
        print("data:"+item['data'])
        item['weather'] = response.xpath("//p[@class='wea']/text()").extract_first() #獲取天氣
        item['temperatureMax'] = response.xpath("//ul[@class='clearfix']/li[1]/p[@class='tem']/span[1]/text()").extract_first() #最高溫度
        item['temperatureMin'] = response.xpath("//ul[@class='clearfix']/li[2]/p[@class='tem']/span[1]/text()").extract_first() #最低溫度
        yield item


spider.py顧名思義就是爬蟲文件

在填寫spider.py以前,咱們先看看如何獲取須要的信息

剛纔的命令行應該沒有關吧,關了也不要緊

win+R在打開cmd,鍵入:scrapy shell http://www.weather.com.cn/weather1d/101020100.shtml#search #網址是你要爬取的url

這是scrapy的shell命令,能夠在不啓動爬蟲的狀況下,對網站的響應response進行處理調試等,主要是調試xpath獲取元素的

 

 

3.填寫Items.py

Items.py只用於存放你要獲取的字段:

給本身要獲取的信息取個名字:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy

class WeatherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    city = scrapy.Field()
    city_addition = scrapy.Field()
    city_addition2 = scrapy.Field()
    weather = scrapy.Field()
    data = scrapy.Field()
    temperatureMax = scrapy.Field()
    temperatureMin = scrapy.Field()
    pass

 

這裏寫了管道文件,還要在settings.py設置文件裏啓用這個pipeline:

6.填寫settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for weather project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'weather'

SPIDER_MODULES = ['weather.spiders']
NEWSPIDER_MODULE = 'weather.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'weather (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'weather.middlewares.WeatherSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'weather.middlewares.WeatherDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'weather.pipelines.TxtPipeline': 600, #'weather.pipelines.JsonPipeline': 6,
    #'weather.pipelines.ExcelPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

 

5.填寫pipeline.py

但要保存爬取的數據的話,還需寫下pipeline.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import codecs
import json
import csv
from scrapy.exporters import JsonItemExporter
from openpyxl import Workbook

base_dir = os.getcwd() filename = base_dir + '\\' + 'weather.txt' with open(filename,'w+') as f:#打開文件 f.truncate()#清空文件內容


class JsonPipeline(object):
    # 使用FeedJsonItenExporter保存數據
    def __init__(self):
        self.file = open('weather1.json','wb')
        self.exporter = JsonItemExporter(self.file,ensure_ascii =False)
        self.exporter.start_exporting()

    def process_item(self,item,spider):
        print('Write')
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        print('Close')
        self.exporter.finish_exporting()
        self.file.close()

        
class TxtPipeline(object): def process_item(self, item, spider): #獲取當前工做目錄 #base_dir = os.getcwd() #filename = base_dir + 'weather.txt' #print('建立Txt') print("city:"+item['city']) print("city_addition:"+item['city_addition']) #從內存以追加方式打開文件,並寫入對應的數據 with open(filename, 'a') as f: #追加 if item['city'] != item['city_addition']: f.write('城市:' + item['city'] + '>') f.write(item['city_addition'] + '\n') else: f.write('城市:' + item['city'] + '\n') #f.write(item['city_addition'] + '\n') f.write('日期:' + item['data'] + '\n') f.write('天氣:' + item['weather'] + '\n') f.write('溫度:' + item['temperatureMin'] + '~' + item['temperatureMax'] + '℃\n') class ExcelPipeline(object):
    #建立EXCEL,填寫表頭
    def __init__(self):
        self.wb = Workbook()
        self.ws = self.wb.active
        #設置表頭
        self.ws.append(['', '', '縣(鄉)', '日期', '天氣', '最高溫', '最低溫'])
    
    def process_item(self, item, spider):
        line = [item['city'], item['city_addition'], item['city_addition2'], item['data'], item['weather'], item['temperatureMax'], item['temperatureMin']]
        self.ws.append(line) #將數據以行的形式添加僅xlsx中
        self.wb.save('weather.xlsx')
        return item
    '''def process_item(self, item, spider):
        base_dir = os.getcwd()
        filename = base_dir + 'weather.csv'
        print('建立EXCEL')
        with open(filename,'w') as f:
            fieldnames = ['省','市', '縣(鄉)', '天氣', '日期', '最高溫','最低溫'] # 定義字段的名稱
            writer = csv.DictWriter(f,fieldnames=fieldnames) # 初始化一個字典對象
            write.writeheader() # 調用writeheader()方法寫入頭信息
            # 傳入相應的字典數據
            write.writerow(dict(item))
    '''

爬蟲效果:

 

 

 

肯定爬取目標:

這裏選擇中國天氣網作爬取素材,爬取網頁以前必定要先分析網頁,要獲取那些信息,怎麼獲取更加方便,網頁源代碼這裏只展現部分:

<div class="ctop clearfix">
            <div class="crumbs fl">
                <a href="http://js.weather.com.cn" target="_blank">江蘇</a>
                <span>></span>
                <a href="http://www.weather.com.cn/weather/101190801.shtml" target="_blank">徐州</a><span>></span>  <span>鼓樓</span>
            </div>
            <div class="time fr"></div>
        </div>

 

 

若是是非直轄市:獲取省名稱

 

 

 

 

 //div[@class='crumbs fl']/a[last()-1]/text()

取xpath最後一個book元素

book[last()]

取xpath最後第二個book元素

book[last()-1]

相關文章
相關標籤/搜索