pandas模塊

pandas

引入約定

>>> from pandas import Series,DataFrame
>>> import pandas as pd

Series

相似於一維數組的對象,由一組數據和相關的數據標籤(索引)組成python

>>> obj=Series([4,7,-5,3])
>>> obj
0    4
1    7
2   -5
3    3
dtype: int64

經過values和index屬性獲取其數組表示形式和索引對象數組

>>> obj.values
array([ 4,  7, -5,  3])
>>> obj.index
RangeIndex(start=0, stop=4, step=1)

對各個數據點進行標記的索引安全

>>> obj2=Series([4,7,-5,3],index=['d','b','a','c'])
>>> obj2
d    4
b    7
a   -5
c    3
dtype: int64

>>> obj2.index
Index([u'd', u'b', u'a', u'c'], dtype='object')

與普通Numpy數組相比,能夠經過索引的方式選取Series中的單個或一組值數據結構

>>> obj2['a']
-5

>>> obj2[['a','b','c']]
a   -5
b    7
c    3
dtype: int64

將Series當作一個定長的有序字典app

>>> 'b' in obj2
True
>>> 'e' in obj2
False

若是數據被存放在一個python字典中,能夠直接經過這個字典建立Seriesdom

>>> sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
>>> obj3=Series(sdata)
>>> obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       500

若是隻傳入一個字典,則結果Series中的索引就是原字典的鍵函數

sdate中跟states索引相匹配,按照傳入的states順序進行排列性能

>>> states=['California','Ohio','Oregon','Texas']
>>> obj4=Series(sdata,index=states)
>>> obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

pandas的isnull和notnull函數用於檢查缺失數據spa

>>> pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series也有相似的實例方法設計

>>> obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series最重要的功能是---算術運算中會自動對齊不一樣索引的數據;數據對齊功能

>>> obj3+obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series對象自己及其索引都有一個name屬性

>>> obj4.name='population'
>>> obj4.index.name='state'
>>> obj4
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

Series的索引能夠經過賦值的方式就地修改

>>> obj.index=['Bob','Steve','Jeff','Ryan']
>>> obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

構建DataFrame,直接傳入一個由等長列表或Numpy數組組成的字典

自動加上索引,且所有列會被有序排列

>>> data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}
>>> frame=DataFrame(data)
>>> frame
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

若是指定了列序列,則DataFrame的列就會按照指定順序進行排列

>>> DataFrame(data,columns=['year','state','pop'])
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

跟Series同樣,若是傳入的列在數據中找不到,就會產生NA值

>>> frame2=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five'])
>>> frame2
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

>>> frame2.columns
Index([u'year', u'state', u'pop', u'debt'], dtype='object')

經過相似字典標記的方式或屬性,能夠將DataFrame的列獲取爲一個Series,擁有原DataFrame相同的索引,其name屬性已經被相應地設置好

>>> frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

>>> frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

行也能夠經過位置或名稱的方式進行獲取,好比用索引字段ix

>>> frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

列能夠經過賦值的方式進行修改

>>> frame2['debt']=16.5

>>> frame2['debt']=np.arange(5.)
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0

將列表或數組賦值給某個列時,長度必須跟DataFrame的長度相匹配

若是賦值的是一個Series,就會精確匹配DataFrame的索引,全部的空位都將被填上缺省值

>>> val=Series([-1.2,-1.5,-1.7],index=['two','four','five'])
>>> frame2['debt']=val
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

爲不存在的列賦值會建立出一個新列

>>> frame2['eastern']=frame2.state == 'Ohio'
>>> frame2
       year   state  pop  debt eastern
one    2000    Ohio  1.5   NaN    True
two    2001    Ohio  1.7  -1.2    True
three  2002    Ohio  3.6   NaN    True
four   2001  Nevada  2.4  -1.5   False
five   2002  Nevada  2.9  -1.7   False

關鍵字del用於刪除列

>>> del frame2['eastern']
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

經過索引方式返回的列只是相應數據的視圖而已,並非副本。對返回的Series所作的任何就地修改所有會反映到源DataFrame上

另外一個常見的數據形式是嵌套字典(字典的字典)

外層字典的鍵做爲列,內層鍵則做爲行索引

>>> pop={'Nevada':{2001:2.4,2002:2.9},
...     'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
>>> frame3=DataFrame(pop)
>>> frame3
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

對結果進行轉置

>>> frame3.T
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6

內層字典的鍵會被合併,排序以造成最終的索引

>>> DataFrame(pop,index=[2001,2002,2003])
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

設置DataFrame的index和columns的name屬性

>>> frame3.index.name='year';frame3.columns.name='state'
>>> frame3
state  Nevada  Ohio
year
2000      NaN   1.5
2001      2.4   1.7
2002      2.9   3.6

跟Series同樣,values屬性也會以二維ndarray的形式返回DataFrame中的數據

>>> frame3.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

若是DataFrame各列的數據類型不一樣,則值數組的數據類型就會選用兼容全部列的數據的數據類型

>>> frame2.values
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

索引對象

pandas數據模型的重要組成部分

負責管理軸標籤和其它元數據。構建Series或DataFrame時,所用到的任何數組或其它序列的標籤都會被轉換成一個Index

>>> obj=Series(range(3),index=['a','b','c'])
>>> index=obj.index
>>> index
Index([u'a', u'b', u'c'], dtype='object')

Index對象是不可修改的(immutable),使index對象在多個數據結構之間安全共享

從新索引

>>> obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

調用該Series的reindex將會根據新索引進行重排,索引值不存在引入缺失值

>>> obj2=obj.reindex(['a','b','c','d','e'])
>>> obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

設定默認的缺失值

>>> obj2=obj.reindex(['a','b','c','d','e'],fill_value=0)
>>> obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

對於時間序列這樣的有序數據,從新索引時能夠須要作一些插值處理

method選項

>>> obj3=Series(['blue','purple','yellow'],index=[0,2,4])

reindex的插值method選項

ffill或pad;前向填充或搬運值

>>> obj3.reindex(range(6)method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

對於DataFrame,reindex能夠修改行索引,列,或兩個都修改。若是僅傳入一個序列,則會從新索引行

>>> frame=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California'])
>>> frame
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8


>>> frame2=frame.reindex(['a','b','c','d'])
>>> frame2
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

使用columns關鍵字便可從新索引列

>>> states=['Texas','Utah','California']
>>> frame.reindex(columns=states)
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

能夠同時對行和列進行從新索引,而插值則只能按行應用

>>> frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
   Texas  Utah  California
a      1   NaN           2
b      1   NaN           2
c      4   NaN           5
d      7   NaN           8

利用ix的標籤索引功能,從新索引任務能夠變得更簡潔

>>> frame.ix[['a','b','c','d'],states]
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0

reindex函數的參數

index       索引的新序列
method      插值填充方式
fill_value  在從新索引的過程當中,須要引入缺失值時使用的替代值
limit       前向或後向填充時的最大填充量
level       在MultiIndex的指定級別上匹配簡單索引,不然選取其子集
copy        默認爲true,不管如何都複製

丟棄指定軸上的項

>>> obj=Series(np.arange(5.),index=['a','b','c','d','e'])
>>> obj
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

>>> new_obj=obj.drop('c')
>>> new_obj
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

對於DataFrame,能夠刪除任意軸上的索引值

>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15 


>>> data.drop(['Colorado','Ohio'])
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15


>>> data.drop('two',axis=1)
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

>>> data.drop(['two','four'],axis=1)
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

索引,選取和過濾

相似於Numpy數組的索引,只不過Series的索引值不僅是整數

Series利用標籤的切片運算與普通的python切片運算不一樣,其末端是包含的

>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
>>> data
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

DataFrame的切片

>>> data[:2]
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7



>>> data<5
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

>>> data[data<5]=0
>>> data
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

>>> data.ix['Colorado',['two','three']]
two      5
three    6
Name: Colorado, dtype: int64

>>> data.ix[['Colorado','Utah'],[3,0,1]]
          four  one  two
Colorado     7    0    5
Utah        11    8    9

>>> data.ix[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

>>> data.ix[:'Utah','two']
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

算術對齊和數據對齊

自動的數據對齊操做在不重疊的索引處引入NA值

對於DataFrame,對齊操做會同時發生在行和列上

使用add方法,傳入加數以及一個fill_value參數:obj.add(obj2,fill_value=0)

DataFrame和Series之間的運算

>>> arr=np.arange(12.).reshape((3,4))
>>> arr
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])
>>> arr[0]
array([ 0.,  1.,  2.,  3.])
>>> arr-arr[0]
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

這叫作廣播(broadcasting)

>>> frame=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
>>> series=frame.ix[0]
>>> frame
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
>>> series
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

默認狀況下,DataFrame和Series之間的算術運算會將Series的索引匹配到DataFrame的行,而後沿着行一直向下廣播

>>> frame-series
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

若是某個索引值在DataFrame的列或Series的索引中找不到,則參與運算的兩個對象被從新索引以造成並集

>>> series2=Series(range(3),index=['b','e','f'])
>>> series2
b    0
e    1
f    2
dtype: int64

>>> frame+series2
          b   d     e   f
Utah    0.0 NaN   3.0 NaN
Ohio    3.0 NaN   6.0 NaN
Texas   6.0 NaN   9.0 NaN
Oregon  9.0 NaN  12.0 NaN

匹配行且在列上廣播,則必須使用算術運算方法

>>> frame.sub(series3,axis=0)
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

函數應用和映射

許多最爲常見的數組統計功能都被實現成DataFrame的方法

>>> frame=DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
>>> frame
               b         d         e
Utah   -1.120701 -0.772813 -1.183221
Ohio   -0.690566  0.610834  0.382371
Texas   0.287303 -0.001705 -1.055101
Oregon  1.149945  1.056177 -0.178909

>>> def f(x):
...     return Series([x.min(),x.max()],index=['min','max'])
... 
>>> frame.apply(f)
            b         d         e
min -1.120701 -0.772813 -1.183221
max  1.149945  1.056177  0.382371

frame中各個浮點值的格式化字符串

>>> format=lambda x:'%.2f' % x
>>> frame.applymap(format)
            b      d      e
Utah    -1.12  -0.77  -1.18
Ohio    -0.69   0.61   0.38
Texas    0.29  -0.00  -1.06
Oregon   1.15   1.06  -0.18

Series有一個用於元素級函數的map方法

>>> frame['e'].map(format)
Utah      -1.18
Ohio       0.38
Texas     -1.06
Oregon    -0.18
Name: e, dtype: object

排序和排名

>>> obj=Series(range(4),index=['d','a','b','c'])
>>> obj.sort_index()
a    1
b    2
c    3
d    0
dtype: int64

對於DataFrame,則能夠根據任意一個軸上的索引進行排序

數據默認是按升序排列的,但也能夠降序排列

>>> frame=DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
>>> frame.sort_index(axis=1)
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
>>> frame.sort_index(axis=1,ascending=False)
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

按值對Series進行排序,可以使用其order方法

>>> obj=Series([4,7,-3,2])
>>> obj.order()

>>> obj.sort_values()
2   -3
3    2
0    4
1    7
dtype: int64

在排序時,任何缺失值默認都會被放到Series的末尾

>>> frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
>>> frame
   a  b
0  0  4
1  1  7
2  0 -3
3  1  2

>>> frame.sort_values(by='b')
   a  b
2  0 -3
3  1  2
0  0  4
1  1  7

排名

根據某種規則破壞平級關係

>>> obj=Series([7,-5,7,4,2,0,4])
>>> obj
0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64
>>> obj.rank()
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

排名時用於破壞平級關係的method選項

average 默認,在相等分組中,爲各個值分配平均排名
min     使用整個分組的最小排名
max     使用整個分組的最大排名
first   按值在原始數據中的出現順序分配排名

帶有重複值的軸索引

許多pandas函數(eg:reindex)都要求標籤惟一,但並非強制性

索引的is_unique屬性

>>> obj.index.is_unique
False

某個索引對應多個值,則返回一個Series

>>> obj['a']
a    0
a    1
dtype: int64

對應單值,返回一個標量值

>>> obj['c']
4

彙總和計算描述統計

sum求和,傳入axis=1將會按行進行求和運算

NA值會自動被排除,除非整個切片(行或列)都是NA

經過skipna選項能夠禁用該功能,df.mean(axis=1,skipna=False)

describe,用於一次性產生多個彙總統計

相關係數與協方差

惟一值,值計數以及成員資格

能夠從一維Series的值中抽取信息

>>> obj=Series(['c','a','d','a','a','b','b','c','c'])
>>> uniques=obj.unique()
>>> uniques
array(['c', 'a', 'd', 'b'], dtype=object)

計算一個Series中各值出現的頻率

>>> obj.value_counts()
c    3
a    3
b    2
d    1
dtype: int64

矢量化集合的成員資格,可用於選取Series中或DataFrame列中數據的子集

>>> mask=obj.isin(['b','c'])
>>> mask
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

>>> obj[mask]
0    c
5    b
6    b
7    c
8    c
dtype: object

處理缺失數據

missing data在大部分數據分析應用中都很常見,pandas的設計目標是讓缺失數據的處理任務儘可能輕鬆

python內置的None值也會被當作NA處理

濾除缺失數據

>>> from numpy import nan as NA
>>> data=Series([1,NA,3.5,NA,7])
>>> data
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
>>> data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64

經過布爾型索引

>>> data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64

對於DataFrame,dropna默認丟棄任何含有缺失值的行

傳入how='all'將會丟棄全爲NA的那些行

>>> data=DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])
>>> data
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> data.dropna(how='all')
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0


>>> data[4]=NA
>>> data
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN

>>> data.dropna(axis=1,how='all')
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

時間序列數據,只想留下一部分觀測數據

>>> df=DataFrame(np.random.randn(7,3))
>>> df
          0         1         2
0  1.367974 -0.556556  0.679336
1 -0.480919 -1.535185 -0.299710
2  0.230583  0.140626  0.604209
3  0.437830 -0.467286 -0.859989
4 -0.254706 -0.227431 -0.956299
5  0.966204 -2.010860 -0.010693
6 -0.673721  1.497827 -0.257273

>>> df.ix[:4,1]=NA
>>> df.ix[:2,2]

>>> df
          0         1         2
0  1.367974       NaN       NaN
1 -0.480919       NaN       NaN
2  0.230583       NaN       NaN
3  0.437830       NaN -0.859989
4 -0.254706       NaN -0.956299
5  0.966204 -2.010860 -0.010693
6 -0.673721  1.497827 -0.257273

一行中至少有3個非NA值將其保留

>>> df.dropna(thresh=3)
          0         1         2
5  0.966204 -2.010860 -0.010693
6 -0.673721  1.497827 -0.257273

填充缺失數據

fillna方法是最主要的函數

>>> df.fillna(0)
          0         1         2
0  1.367974  0.000000  0.000000
1 -0.480919  0.000000  0.000000
2  0.230583  0.000000  0.000000
3  0.437830  0.000000 -0.859989
4 -0.254706  0.000000 -0.956299
5  0.966204 -2.010860 -0.010693
6 -0.673721  1.497827 -0.257273

一個字典調用fillna,就能夠實現對不一樣列填充不一樣的值

>>> df.fillna({1:0.5,3:-1})
          0         1         2
0  1.367974  0.500000       NaN
1 -0.480919  0.500000       NaN
2  0.230583  0.500000       NaN
3  0.437830  0.500000 -0.859989
4 -0.254706  0.500000 -0.956299
5  0.966204 -2.010860 -0.010693
6 -0.673721  1.497827 -0.257273

fillna默認會返回新對象,但也能夠對現有對象進行就地修改

返回被填充對象的引用

>>> _=df.fillna(0,inplace=True)
>>> df
          0         1         2
0  1.367974  0.000000  0.000000
1 -0.480919  0.000000  0.000000
2  0.230583  0.000000  0.000000
3  0.437830  0.000000 -0.859989
4 -0.254706  0.000000 -0.956299
5  0.966204 -2.010860 -0.010693
6 -0.673721  1.497827 -0.257273

對reindex有效的那些插值方法也能夠用fillna

>>> df=DataFrame(np.random.randn(6,3))
>>> df.ix[2:,1]=NA;df.ix[4:,2]=NA
>>> df
          0         1         2
0  0.647866  0.891312 -0.211922
1 -1.455856 -0.629213 -1.043685
2  2.078467       NaN -0.067846
3 -0.223047       NaN  0.513800
4  0.306559       NaN       NaN
5  0.404265       NaN       NaN

填充最靠近行的數值填充,列行爲

>>> df.fillna(method='ffill')
          0         1         2
0  0.647866  0.891312 -0.211922
1 -1.455856 -0.629213 -1.043685
2  2.078467 -0.629213 -0.067846
3 -0.223047 -0.629213  0.513800
4  0.306559 -0.629213  0.513800
5  0.404265 -0.629213  0.513800

>>> df.fillna(method='ffill',limit=2)
          0         1         2
0  0.647866  0.891312 -0.211922
1 -1.455856 -0.629213 -1.043685
2  2.078467 -0.629213 -0.067846
3 -0.223047 -0.629213  0.513800
4  0.306559       NaN  0.513800
5  0.404265       NaN  0.513800

層次化索引

hierachical indexing

一個軸上擁有多個(兩個以上)索引級別

>>> data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
>>> data
a  1   -0.521370
   2    0.658209
   3    0.841101
b  1    0.354237
   2   -0.426983
   3    0.835357
c  1   -0.246308
   2    0.709859
d  2   -1.215098
   3    0.400793
dtype: float64

這就是帶有MultiIndex索引的Series的格式化輸出形式

>>> data.index
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])


>>> data['b']
1    0.354237
2   -0.426983
3    0.835357
dtype: float64

內層進行選取

>>> data[:,2]
a    0.658209
b   -0.426983
c    0.709859
d   -1.215098
dtype: float64

層次化索引在數據重塑和基於分組的操做(如透視表生成)扮演重要的角色

>>> data.unstack()
          1         2         3
a -0.521370  0.658209  0.841101
b  0.354237 -0.426983  0.835357
c -0.246308  0.709859       NaN
d       NaN -1.215098  0.400793

unstack的逆運算是stack

對於DataFrame,每條軸均可以有分層索引

>>> frame=DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
>>> frame
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

各層均可以有名字(能夠是字符串,也能夠是別的python對象)

>>> frame.index.names=['key1','key2']
>>> frame
           Ohio     Colorado
          Green Red    Green
key1 key2 
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11


>>> frame.columns.names=['state','color']
>>> frame
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

有了分部的列索引,能夠輕鬆選取列分組

能夠單首創建MultiIndex而後複用

>>> MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])

重排分級順序

swaplevel接受兩個級別編號或名稱,並返回一個互換了級別的新對象

>>> frame.swaplevel('key1','key2')
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

sortlevel則根據單個級別中的值對數據進行排序

>>> frame
state      Ohio     Colorado
color     Green Red    Green
key1 key2 
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

>>> frame.sortlevel(1)
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

>>> frame.swaplevel(0,1).sortlevel(0)
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
     b        6   7        8
2    a        3   4        5
     b        9  10       11

在層次化索引的對象上,若是索引是按字典方式從外向內排序,即調用sortlevel(0)或sort_index()的結果,數據選取操做的性能要好的多

根據級別彙總統計

許多對DataFrame和Series的描述和彙總統計都有一個level選項,用於指定在某條軸上求和的級別

>>> frame.sum(level='key2')
state  Ohio     Colorado
color Green Red    Green
key2
1         6   8       10
2        12  14       16

>>> frame.sum(level='color',axis=1)
color      Green  Red
key1 key2
a    1         2    1
     2         8    4
b    1        14    7
     2        20   10

使用DataFrame的列

想要將DataFrame的一個或多個列當作行索引來用,或者將行索引當成DataFrame的列

>>> frame=DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})
>>> frame
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3

set_index()函數會將其一個或多個列轉換爲行索引,並建立一個新的DataFrame

>>> frame2=frame.set_index(['c','d'])
>>> frame2
       a  b
c   d
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

默認狀況下,那些列會從DataFrame中移除,但也能夠將其保留下來

>>> frame.set_index(['c','d'],drop=False)
       a  b    c  d
c   d
one 0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
two 0  3  4  two  0
    1  4  3  two  1
    2  5  2  two  2
    3  6  1  two  3

reset_index的功能相反,層次化索引的級別會被轉移到列裏面

>>> frame2.reset_index()
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1
相關文章
相關標籤/搜索