Pandas系列（六）-時間序列詳解

時間 2019-12-09

標籤 pandas 系列時間序列詳解简体版

原文原文鏈接

內容目錄html

1. 基礎概述
2. 轉換時間戳
3. 生成時間戳範圍
4. DatetimeIndex
5. DateOffset對象
6. 與時間序列相關的方法
6.1 移動
6.2 頻率轉換
6.3 重採樣

　在處理時間序列的的過程當中，咱們常常會去作如下一些任務：python

生成固定頻率日期和時間跨度的序列
將時間序列整合或轉換爲特定頻率
基於各類非標準時間增量（例如，在一年的最後一個工做日以前的5個工做日）計算「相對」日期，或向前或向後「滾動」日期

使用 Pandas 能夠輕鬆完成以上任務。ide

1、基礎概述

下面列出了 Pandas中 和時間日期相關經常使用的類以及建立方法。
類	              備註	         建立方法
Timestamp	     時刻數據	         to_datetime，Timestamp
DatetimeIndex	Timestamp的索引	  to_datetime，date_range，DatetimeIndex
Period	         時期數據	          Period
PeriodIndex	  Period	             period_range， PeriodIndex
Pandas 中關於時間序列最多見的類型就是時間戳（Timestamp）了，建立時間戳的方法有不少種，咱們分別來看一看。

　基本方法函數

pd.Timestamp(2018,5,21) 
Out[12]: Timestamp('2018-05-21 00:00:00')
pd.Timestamp('2018-5-21')
Out[13]: Timestamp('2018-05-21 00:00:00')
#除了時間戳以外，另外一個常見的結構是時間跨度（Period）。
pd.Period("2018-01")
Out[14]: Period('2018-01', 'M')
pd.Period("2018-05", freq="D")
Out[15]: Period('2018-05-01', 'D')
#索引後會自動強制轉爲爲 DatetimeIndex 和 PeriodIndex。
dates = [pd.Timestamp("2018-05-01"), pd.Timestamp("2018-05-02"), pd.Timestamp("2018-05-03"), pd.Timestamp("2018-05-04")]
ts = pd.Series(data=["Tom", "Bob", "Mary", "James"], index=dates)
ts.index
Out[16]: DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04'], dtype='datetime64[ns]', freq=None)
periods = [pd.Period("2018-01"), pd.Period("2018-02"), pd.Period("2018-03"), pd.Period("2018-4")]
ts = pd.Series(data=["Tom", "Bob", "Mary", "James"], index=periods)
ts.index
Out[17]: PeriodIndex(['2018-01', '2018-02', '2018-03', '2018-04'], dtype='period[M]', freq='M')

2、轉換時間戳

　　你可能會想到，咱們常常要和文本數據（字符串）打交道，可否快速將文本數據轉爲時間戳呢？
　　答案是能夠的，經過 to_datetime 能快速將字符串轉換爲時間戳。當傳遞一個Series時，它會返回一個Series（具備相同的索引），而相似列表的則轉換爲DatetimeIndex。spa

pd.to_datetime(pd.Series(["Jul 31, 2018", "2018-05-10", None]))
Out[18]: 
0   2018-07-31
1   2018-05-10
2          NaT
dtype: datetime64[ns]
pd.to_datetime(["2005/11/23", "2010.12.31"])
Out[19]: DatetimeIndex(['2005-11-23', '2010-12-31'], dtype='datetime64[ns]', freq=None)
#除了能夠將文本數據轉爲時間戳外，還能夠將 unix 時間轉爲時間戳。
pd.to_datetime([1349720105, 1349806505, 1349892905], unit="s")
Out[20]: 
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
               '2012-10-10 18:15:05'],
              dtype='datetime64[ns]', freq=None)
pd.to_datetime([1349720105100, 1349720105200, 1349720105300], unit="ms")
Out[21]: 
DatetimeIndex(['2012-10-08 18:15:05.100000', '2012-10-08 18:15:05.200000',
               '2012-10-08 18:15:05.300000'],
              dtype='datetime64[ns]', freq=None)

3、生成時間戳範圍

　　有時候，咱們可能想要生成某個範圍內的時間戳。例如，我想要生成 "2018-6-26" 這一天以後的8天時間戳，如何完成呢？咱們可使用 date_range 和 bdate_range 來完成時間戳範圍的生成。3d

pd.date_range("2018-6-26", periods=8)
Out[22]: 
DatetimeIndex(['2018-06-26', '2018-06-27', '2018-06-28', '2018-06-29',
               '2018-06-30', '2018-07-01', '2018-07-02', '2018-07-03'],
              dtype='datetime64[ns]', freq='D')
pd.bdate_range("2018-6-26", periods=8)
Out[23]: 
DatetimeIndex(['2018-06-26', '2018-06-27', '2018-06-28', '2018-06-29',
               '2018-07-02', '2018-07-03', '2018-07-04', '2018-07-05'],
              dtype='datetime64[ns]', freq='B')
#能夠看出，date_range 默認使用的頻率是 日曆日，而 bdate_range 默認使用的頻率是 營業日。固然了，咱們能夠本身指定頻率，好比，咱們能夠按周來生成時間戳範圍。
pd.date_range("2018-6-26", periods=8, freq="W")
Out[24]: 
DatetimeIndex(['2018-07-01', '2018-07-08', '2018-07-15', '2018-07-22',
               '2018-07-29', '2018-08-05', '2018-08-12', '2018-08-19'],
              dtype='datetime64[ns]', freq='W-SUN')

四. DatetimeIndex

DatetimeIndex 的主要做用是之一是用做 Pandas 對象的索引，使用它做爲索引除了擁有普通索引對象的所
有基本功能外，還擁有簡化頻率處理的高級時間序列方法。unix

rng = pd.date_range("2018-6-24", periods=4, freq="W")
ts = pd.Series(range(len(rng)), index=rng)
ts
Out[25]: 
2018-06-24    0
2018-07-01    1
2018-07-08    2
2018-07-15    3
Freq: W-SUN, dtype: int64
# 經過日期訪問數據
ts["2018-07-08"]
Out[26]: 2
# 經過日期區間訪問數據切片
ts["2018-07-08": "2018-07-22"]
Out[27]: 
2018-07-08    2
2018-07-15    3
Freq: W-SUN, dtype: int64
#傳入年份
ts["2018"]
Out[28]: 
2018-06-24    0
2018-07-01    1
2018-07-08    2
2018-07-15    3
Freq: W-SUN, dtype: int64
# 傳入年份和月份
ts["2018-07"]
Out[29]: 
2018-07-01    1
2018-07-08    2
2018-07-15    3
Freq: W-SUN, dtype: int64
#除了可使用字符串對 DateTimeIndex 進行索引外，還可使用 datetime（日期時間）對象來進行索引。
from datetime import datetime
ts[datetime(2018, 7, 8) : datetime(2018, 7, 22)]
Out[30]: 
2018-07-08    2
2018-07-15    3
Freq: W-SUN, dtype: int64
# 獲取年份
ts.index.year
Out[31]: Int64Index([2018, 2018, 2018, 2018], dtype='int64')
# 獲取星期幾
ts.index.dayofweek
Out[32]: Int64Index([6, 6, 6, 6], dtype='int64')
# 獲取一年中的第幾個星期
ts.index.weekofyear
Out[33]: Int64Index([25, 26, 27, 28], dtype='int64')

五.DateOffset對象

DateOffset 從名稱中就能夠看出來是要作日期偏移的，它的參數與 dateutil.relativedelta基本相同，工做方式以下：code

from pandas.tseries.offsets import *
d = pd.Timestamp("2018-06-25")
d + DateOffset(weeks=2, days=5)
Out[34]: Timestamp('2018-07-14 00:00:00')
#除了可使用 DateOffset 完成上面的功能外，還可使用偏移量實例來完成。
d + Week(2) + Day(5)
Out[35]: Timestamp('2018-07-14 00:00:00')

6、與時間序列相關的方法

在作時間序列相關的工做時，常常要對時間作一些移動/滯後、頻率轉換、採樣等相關操做，咱們來看下這些操做如何使用吧。htm

　　6.1 移動

　　若是你想移動或滯後時間序列，你可使用 shift 方法。對象

　　能夠看到，Series 全部的值都都移動了 2 個距離。若是不想移動值，而是移動日期索引，可使用 freq 參數，它能夠接受一個 DateOffset 類或其餘 timedelta 類對象或一個 offset 別名，全部別名詳細介紹見：Offset Aliases（http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases）。

ts.shift(2)
Out[36]: 
2018-06-24    NaN
2018-07-01    NaN
2018-07-08    0.0
2018-07-15    1.0
Freq: W-SUN, dtype: float64
ts.shift(2, freq=Day())
Out[37]: 
2018-06-26    0
2018-07-03    1
2018-07-10    2
2018-07-17    3
Freq: W-TUE, dtype: int64
#能夠看到，如今日期索引移動了 2 天的間隔。經過 tshift 一樣能夠達到相同的效果。
ts.tshift(2, freq=Day())
Out[38]: 
2018-06-26    0
2018-07-03    1
2018-07-10    2
2018-07-17    3
Freq: W-TUE, dtype: int64

　　6.2頻率轉換

　　頻率轉換可使用 asfreq 函數來實現。下面演示了將頻率由週轉爲了天。

ts.asfreq(Day())
Out[39]: 
2018-06-24    0.0
2018-06-25    NaN
2018-06-26    NaN
2018-06-27    NaN
2018-06-28    NaN
2018-06-29    NaN
2018-06-30    NaN
2018-07-01    1.0
2018-07-02    NaN
2018-07-03    NaN
2018-07-04    NaN
2018-07-05    NaN
2018-07-06    NaN
2018-07-07    NaN
2018-07-08    2.0
2018-07-09    NaN
2018-07-10    NaN
2018-07-11    NaN
2018-07-12    NaN
2018-07-13    NaN
2018-07-14    NaN
2018-07-15    3.0
Freq: D, dtype: float64
#聰明的你會發現出現了缺失值，所以 Pandas 爲你提供了 method 參數來填充缺失值。幾種不一樣的填充方法參考 Pandas 缺失值處理 中 fillna 介紹。
ts.asfreq(Day(), method="pad")
Out[40]: 
2018-06-24    0
2018-06-25    0
2018-06-26    0
2018-06-27    0
2018-06-28    0
2018-06-29    0
2018-06-30    0
2018-07-01    1
2018-07-02    1
2018-07-03    1
2018-07-04    1
2018-07-05    1
2018-07-06    1
2018-07-07    1
2018-07-08    2
2018-07-09    2
2018-07-10    2
2018-07-11    2
2018-07-12    2
2018-07-13    2
2018-07-14    2
2018-07-15    3
Freq: D, dtype: int64

　　6.3 重採樣

　　resample 表示根據日期維度進行數據聚合，能夠按照分鐘、小時、工做日、周、月、年等來做爲日期維度，更多的日期維度見 Offset Aliases（http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases）。這裏咱們先以月來做爲時間維度來進行聚合。

# 求出每月的數值之和
ts.resample("1M").sum()
Out[41]: 
2018-06-30    0
2018-07-31    6
Freq: M, dtype: int64
# 求出每月的數值平均值
ts.resample("1M").mean()
Out[42]: 
2018-06-30    0
2018-07-31    2
Freq: M, dtype: int64

案例

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

pd.set_option('display.max_columns',None)

df = pd.read_csv('911.csv')

df.timeStamp = pd.to_datetime(df.timeStamp)  #時間字符串轉時間格式

df.set_index('timeStamp',inplace=True)  #設置時間格式爲索引
# print(df.head())

#統計出911數據中不一樣月份電話次數
count_by_month = df.resample('M').count()['title']
print(count_by_month)

#畫圖
_x = count_by_month.index
_y = count_by_month.values

plt.figure(figsize=(15,8),dpi=80)

plt.plot(range(len(_x)),_y)

plt.xticks(range(len(_x)),_x.strftime('%Y-%m-%d'),rotation=45)

plt.show()

示例1：統計出911數據中不一樣月份電話次數的變化狀況

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np


pd.set_option('display.max_columns',None)

df = pd.read_csv('911.csv')
#把時間字符串轉化爲時間類型設置爲索引
df.timeStamp = pd.to_datetime(df.timeStamp)

#添加列，表示分類
temp_list = df.title.str.split(':').tolist()
cate_list = [i[0] for i in temp_list]
df['cate'] = pd.DataFrame(np.array(cate_list).reshape(df.shape[0],1))

df.set_index('timeStamp',inplace=True)

plt.figure(figsize=(15, 8), dpi=80)

#分組
for group_name,group_data in df.groupby(by='cate'):
    #對不一樣的分類都進行繪圖
    count_by_month = group_data.resample('M').count()['title']
    # 畫圖
    _x = count_by_month.index
    _y = count_by_month.values
    plt.plot(range(len(_x)),_y,label=group_name)

plt.xticks(range(len(_x)), _x.strftime('%Y-%m-%d'), rotation=45)

plt.legend(loc='best')
plt.show()

示例2：統計出911數據中不一樣月份不一樣類型的電話的次數的變化狀況

import pandas as pd
from matplotlib import pyplot as plt

pd.set_option('display.max_columns',None)

df = pd.read_csv('PM2.5/BeijingPM20100101_20151231.csv')
# print(df.head())

#把分開的時間字符串經過periodIndex的方法轉化爲pandas的時間類型
period = pd.PeriodIndex(year=df.year,month=df.month,day=df.day,hour=df.hour,freq='H')
df['datetime'] = period
print(df.head(10))

#把datetime設置爲索引
df.set_index('datetime',inplace=True)

#進行降採樣
df = df.resample('7D').mean()

#處理缺失值，刪除缺失數據
# data = df['PM_US Post'].dropna()
# china_data = df['PM_Nongzhanguan'].dropna()
data = df['PM_US Post']
china_data = df['PM_Nongzhanguan']

#畫圖
_x = data.index
_y = data.values

_x_china = china_data.index
_y_china = china_data.values

plt.figure(figsize=(13,8),dpi=80)

plt.plot(range(len(_x)),_y,label='US_POST',alpha=0.7)
plt.plot(range(len(_x_china)),_y_china,label='CN_POST',alpha=0.7)

plt.xticks(range(0,len(_x_china),10),list(_x_china.strftime('%Y%m%d'))[::10],rotation=45)

plt.show()

示例3：繪製美國和中國PM2.5隨時間的變化狀況

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。