Pandas入門學習筆記4

5 層次化索引

層次化索引是pandas的重要功能。以低維度的形式處理高維度數據。python

In [185]: data = Series(np.random.randn(10),index=[list('aaabbbccdd'),[1,2,3,1,2,3,2,3,2,3]])

In [186]: data
Out[186]:
a  1    0.458553
   2    0.077532
   3   -1.561180
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
d  2    1.131352
   3   -1.292136
dtype: float64

獲取索引。git

In [187]: data.index
Out[187]:
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 1, 2, 1, 2]])

In [188]: data['b']
Out[188]:
1    2.498391
2    0.243617
3   -0.818542
dtype: float64

In [189]: data['b':'c']
Out[189]:
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
dtype: float64

In [190]: data[:,2]  # 獲取內層索引
Out[190]:
a    0.077532
b    0.243617
c   -1.222213
d    1.131352
dtype: float64


In [191]: data.unstack()  # unstack來從新安排到dataframe中。
Out[191]:
          1         2         3
a  0.458553  0.077532 -1.561180
b  2.498391  0.243617 -0.818542
c       NaN -1.222213 -0.797079
d       NaN  1.131352 -1.292136

In [192]: data.unstack().stack()  # 逆運算--stack
Out[192]:
a  1    0.458553
   2    0.077532
   3   -1.561180
b  1    2.498391
   2    0.243617
   3   -0.818542
c  2   -1.222213
   3   -0.797079
d  2    1.131352
   3   -1.292136
dtype: float64

DataFrame每條軸均可以分層索引。github

5.1 重排分級順序

能夠重排調整某條軸上的索引順序,swaplevel能夠互換兩個索引值,並範圍一個新的對象。web

In [193]: frame = DataFrame(np.random.randn(4,3),index=[list('aabb'),[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
     ...:

In [194]: frame
Out[194]:
         Ohio            Colorado
        Green       Red     Green
a 1  0.368997  0.670430  1.056365
  2 -0.352259 -0.656101  0.018544
b 1 -0.574535 -0.531988  0.295466
  2 -0.973587  0.225511 -0.250887

In [198]: frame.index.names = ['key1','key2']

In [199]: frame.columns.names = ['state','color']

In [200]: frame
Out[200]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
     2    -0.352259 -0.656101  0.018544
b    1    -0.574535 -0.531988  0.295466
     2    -0.973587  0.225511 -0.250887

In [201]: frame.swaplevel('key1','key2')
Out[201]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
2    a    -0.352259 -0.656101  0.018544
1    b    -0.574535 -0.531988  0.295466
2    b    -0.973587  0.225511 -0.250887

In [202]: frame.sortlevel(1)
Out[202]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
b    1    -0.574535 -0.531988  0.295466
a    2    -0.352259 -0.656101  0.018544
b    2    -0.973587  0.225511 -0.250887

In [203]: frame.swaplevel(0,1)
Out[203]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
2    a    -0.352259 -0.656101  0.018544
1    b    -0.574535 -0.531988  0.295466
2    b    -0.973587  0.225511 -0.250887

In [204]: frame.swaplevel(0,1).sortlevel(0)
Out[204]:
state          Ohio            Colorado
color         Green       Red     Green
key2 key1                              
1    a     0.368997  0.670430  1.056365
     b    -0.574535 -0.531988  0.295466
2    a    -0.352259 -0.656101  0.018544
     b    -0.973587  0.225511 -0.250887

5.2 根據級別彙總統計

許多DataFrame和Series彙總和統計方法都有level選項,指定在某個軸。bash

In [205]: frame
Out[205]:
state          Ohio            Colorado
color         Green       Red     Green
key1 key2                              
a    1     0.368997  0.670430  1.056365
     2    -0.352259 -0.656101  0.018544
b    1    -0.574535 -0.531988  0.295466
     2    -0.973587  0.225511 -0.250887


In [207]: frame.sum(level='key2')
Out[207]:
state      Ohio            Colorado
color     Green       Red     Green
key2                               
1     -0.205538  0.138443  1.351831
2     -1.325846 -0.430590 -0.232343

In [209]: frame.sum(level='color',axis=1)
Out[209]:
color         Green       Red
key1 key2                    
a    1     1.425362  0.670430
     2    -0.333715 -0.656101
b    1    -0.279069 -0.531988
     2    -1.224474  0.225511

5.3 使用DataFrame的列

常常須要用DataFrame的列做爲索引,或者但願將索引變成DataFrame的列。數據結構

In [210]: df = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one']*7,'d':[0,1,2,0,1,2,3]})

In [211]: df
Out[211]:
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  one  0
4  4  3  one  1
5  5  2  one  2
6  6  1  one  3

In [212]: df2 = df.set_index(['c','d']) #默認狀況下,會將轉換的這兩列刪除掉;

In [213]: df2
Out[213]:
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
    0  3  4
    1  4  3
    2  5  2
    3  6  1

    In [215]: df2 = df.set_index(['c','d'],drop=False) # 仍然保留這兩列

    In [216]: df2
    Out[216]:
           a  b    c  d
    c   d              
    one 0  0  7  one  0
        1  1  6  one  1
        2  2  5  one  2
        0  3  4  one  0
        1  4  3  one  1
        2  5  2  one  2
        3  6  1  one  3

用reset_index能夠將索引合併到DataFrame中。app

In [217]: df2 = df.set_index(['c','d'])

In [218]: df2
Out[218]:
       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
    0  3  4
    1  4  3
    2  5  2
    3  6  1

In [219]: df2.reset_index()
Out[219]:
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  one  0  3  4
4  one  1  4  3
5  one  2  5  2
6  one  3  6  1

6 其餘

6.1 整數索引

先看一個例子:咱們很難判斷是要經過位置仍是經過標籤的索引來獲取數據。dom

In [220]: ser = Series(np.arange(3))

In [221]: ser
Out[221]:
0    0
1    1
2    2
dtype: int64

In [222]: ser[-1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...

這樣對於使用字母索引的Series就不存在這個問題。python2.7

若是須要可靠的、不考慮索引類型的、基於位置的索引,能夠使用:code

  • Series:iget_value
  • DataFrame:irow和icol

新的版本有些變化:都是用iloc來經過位置準確獲取。

In [231]: ser3 = Series(np.arange(3),index=[-5,1,3])

In [232]: ser3.iget_value(2)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[232]: 2

In [236]: ser3.iloc[2]
Out[236]: 2

In [237]: ser3.iat[2]
Out[237]: 2

In [239]: frame = DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])

In [241]: frame
Out[241]:
   0  1
2  0  1
0  2  3
1  4  5

In [242]: frame.irow(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[242]:
0    2
1    3
Name: 0, dtype: int64

In [243]: frame.icol(1)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: icol(i) is deprecated. Please use .iloc[:,i]
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[243]:
2    1
0    3
1    5
Name: 1, dtype: int64

In [245]: frame.iloc[1]  # 按行位置獲取
Out[245]:
0    2
1    3
Name: 0, dtype: int64

In [246]: frame.iloc[:,1]  #按列位置獲取
Out[246]:
2    1
0    3
1    5
Name: 1, dtype: int64

6.2 面板數據

Panel數據結構,能夠當作是一個三維的DataFrame數據結構。
Panel中的每一項都是一個DataFrame。
一樣使用堆積式(層次化索引的)的DataFrame能夠表示一個panel。

In [247]: import pandas.io.data as web
/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py:35: FutureWarning:
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
  FutureWarning)

In [248]: web
Out[248]: <module 'pandas.io.data' from '/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py'>

In [249]: pdata = pd.Panel(dict((stk ,web.get_data_yahoo(stk,'1/1/2009','6/1/2012')) for stk in ['AAPL','GOOG','MSFT','DELL']))

In [250]: pdata
Out[250]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: Open to Adj Close

In [252]: pdata = pdata.swapaxes('items','minor')

In [253]: pdata['Adj Close']
Out[253]:
                 AAPL      DELL        GOOG       MSFT
Date                                                  
2009-01-02  11.808505  10.39902  160.499779  16.501303
...
2012-05-30  75.362333  12.14992  293.821674  25.878448
2012-05-31  75.174961  11.92743  290.140354  25.746145
2012-06-01  72.996726  11.67592  285.205295  25.093451

[868 rows x 4 columns]

In [256]: pdata.ix[:,'6/1/2012',:]  # ix擴展爲三維
Out[256]:
            Open        High         Low       Close       Volume   Adj Close
AAPL  569.159996  572.650009  560.520012  560.989983  130246900.0   72.996726
DELL   12.150000   12.300000   12.045000   12.070000   19397600.0   11.675920
GOOG  571.790972  572.650996  568.350996  570.981000    6138700.0  285.205295
MSFT   28.760000   28.959999   28.440001   28.450001   56634300.0   25.093451

In [260]: pdata.ix[:,'5/30/2012':,:].to_frame()
Out[260]:
                        Open        High         Low       Close       Volume  \
Date       minor                                                                
2012-05-30 AAPL   569.199997  579.989990  566.559990  579.169998  132357400.0   
           DELL    12.590000   12.700000   12.460000   12.560000   19787800.0   
           GOOG   588.161028  591.901014  583.530999  588.230992    3827600.0   
           MSFT    29.350000   29.480000   29.120001   29.340000   41585500.0   
2012-05-31 AAPL   580.740021  581.499985  571.460022  577.730019  122918600.0   
           DELL    12.530000   12.540000   12.330000   12.330000   19955600.0   
           GOOG   588.720982  590.001032  579.001013  580.860990    5958800.0   
           MSFT    29.299999   29.420000   28.940001   29.190001   39134000.0   
2012-06-01 AAPL   569.159996  572.650009  560.520012  560.989983  130246900.0   
           DELL    12.150000   12.300000   12.045000   12.070000   19397600.0   
           GOOG   571.790972  572.650996  568.350996  570.981000    6138700.0   
           MSFT    28.760000   28.959999   28.440001   28.450001   56634300.0   

                   Adj Close  
Date       minor              
2012-05-30 AAPL    75.362333  
           DELL    12.149920  
           GOOG   293.821674  
           MSFT    25.878448  
2012-05-31 AAPL    75.174961  
           DELL    11.927430  
           GOOG   290.140354  
           MSFT    25.746145  
2012-06-01 AAPL    72.996726  
           DELL    11.675920  
           GOOG   285.205295  
           MSFT    25.093451  

# 能夠轉化爲DataFrame
In [261]: stacked = pdata.ix[:,'5/30/2012':,:].to_frame()

In [262]: stacked.to_panel() # 轉化爲panel
Out[262]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT
相關文章
相關標籤/搜索