>>> from pandas import Series,DataFrame >>> import pandas as pd
相似於一維數組的對象,由一組數據和相關的數據標籤(索引)組成python
>>> obj=Series([4,7,-5,3]) >>> obj 0 4 1 7 2 -5 3 3 dtype: int64
經過values和index屬性獲取其數組表示形式和索引對象數組
>>> obj.values array([ 4, 7, -5, 3]) >>> obj.index RangeIndex(start=0, stop=4, step=1)
對各個數據點進行標記的索引安全
>>> obj2=Series([4,7,-5,3],index=['d','b','a','c']) >>> obj2 d 4 b 7 a -5 c 3 dtype: int64 >>> obj2.index Index([u'd', u'b', u'a', u'c'], dtype='object')
與普通Numpy數組相比,能夠經過索引的方式選取Series中的單個或一組值數據結構
>>> obj2['a'] -5 >>> obj2[['a','b','c']] a -5 b 7 c 3 dtype: int64
將Series當作一個定長的有序字典app
>>> 'b' in obj2 True >>> 'e' in obj2 False
若是數據被存放在一個python字典中,能夠直接經過這個字典建立Seriesdom
>>> sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000} >>> obj3=Series(sdata) >>> obj3 Ohio 35000 Oregon 16000 Texas 71000 Utah 500
若是隻傳入一個字典,則結果Series中的索引就是原字典的鍵函數
sdate中跟states索引相匹配,按照傳入的states順序進行排列性能
>>> states=['California','Ohio','Oregon','Texas'] >>> obj4=Series(sdata,index=states) >>> obj4 California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64
pandas的isnull和notnull函數用於檢查缺失數據spa
>>> pd.isnull(obj4) California True Ohio False Oregon False Texas False dtype: bool
Series也有相似的實例方法設計
>>> obj4.isnull() California True Ohio False Oregon False Texas False dtype: bool
Series最重要的功能是---算術運算中會自動對齊不一樣索引的數據;數據對齊功能
>>> obj3+obj4 California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah NaN dtype: float64
Series對象自己及其索引都有一個name屬性
>>> obj4.name='population' >>> obj4.index.name='state' >>> obj4 state California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: population, dtype: float64
Series的索引能夠經過賦值的方式就地修改
>>> obj.index=['Bob','Steve','Jeff','Ryan'] >>> obj Bob 4 Steve 7 Jeff -5 Ryan 3 dtype: int64
構建DataFrame,直接傳入一個由等長列表或Numpy數組組成的字典
自動加上索引,且所有列會被有序排列
>>> data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]} >>> frame=DataFrame(data) >>> frame pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002
若是指定了列序列,則DataFrame的列就會按照指定順序進行排列
>>> DataFrame(data,columns=['year','state','pop']) year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9
跟Series同樣,若是傳入的列在數據中找不到,就會產生NA值
>>> frame2=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five']) >>> frame2 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN >>> frame2.columns Index([u'year', u'state', u'pop', u'debt'], dtype='object')
經過相似字典標記的方式或屬性,能夠將DataFrame的列獲取爲一個Series,擁有原DataFrame相同的索引,其name屬性已經被相應地設置好
>>> frame2['state'] one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object >>> frame2.year one 2000 two 2001 three 2002 four 2001 five 2002 Name: year, dtype: int64
行也能夠經過位置或名稱的方式進行獲取,好比用索引字段ix
>>> frame2.ix['three'] year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
列能夠經過賦值的方式進行修改
>>> frame2['debt']=16.5 >>> frame2['debt']=np.arange(5.) >>> frame2 year state pop debt one 2000 Ohio 1.5 0.0 two 2001 Ohio 1.7 1.0 three 2002 Ohio 3.6 2.0 four 2001 Nevada 2.4 3.0 five 2002 Nevada 2.9 4.0
將列表或數組賦值給某個列時,長度必須跟DataFrame的長度相匹配
若是賦值的是一個Series,就會精確匹配DataFrame的索引,全部的空位都將被填上缺省值
>>> val=Series([-1.2,-1.5,-1.7],index=['two','four','five']) >>> frame2['debt']=val >>> frame2 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
爲不存在的列賦值會建立出一個新列
>>> frame2['eastern']=frame2.state == 'Ohio' >>> frame2 year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False
關鍵字del用於刪除列
>>> del frame2['eastern'] >>> frame2 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
經過索引方式返回的列只是相應數據的視圖而已,並非副本。對返回的Series所作的任何就地修改所有會反映到源DataFrame上
另外一個常見的數據形式是嵌套字典(字典的字典)
外層字典的鍵做爲列,內層鍵則做爲行索引
>>> pop={'Nevada':{2001:2.4,2002:2.9}, ... 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} >>> frame3=DataFrame(pop) >>> frame3 Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
對結果進行轉置
>>> frame3.T 2000 2001 2002 Nevada NaN 2.4 2.9 Ohio 1.5 1.7 3.6
內層字典的鍵會被合併,排序以造成最終的索引
>>> DataFrame(pop,index=[2001,2002,2003]) Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2003 NaN NaN
設置DataFrame的index和columns的name屬性
>>> frame3.index.name='year';frame3.columns.name='state' >>> frame3 state Nevada Ohio year 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
跟Series同樣,values屬性也會以二維ndarray的形式返回DataFrame中的數據
>>> frame3.values array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])
若是DataFrame各列的數據類型不一樣,則值數組的數據類型就會選用兼容全部列的數據的數據類型
>>> frame2.values array([[2000, 'Ohio', 1.5, nan], [2001, 'Ohio', 1.7, -1.2], [2002, 'Ohio', 3.6, nan], [2001, 'Nevada', 2.4, -1.5], [2002, 'Nevada', 2.9, -1.7]], dtype=object)
pandas數據模型的重要組成部分
負責管理軸標籤和其它元數據。構建Series或DataFrame時,所用到的任何數組或其它序列的標籤都會被轉換成一個Index
>>> obj=Series(range(3),index=['a','b','c']) >>> index=obj.index >>> index Index([u'a', u'b', u'c'], dtype='object')
Index對象是不可修改的(immutable),使index對象在多個數據結構之間安全共享
>>> obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) >>> obj d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64
調用該Series的reindex將會根據新索引進行重排,索引值不存在引入缺失值
>>> obj2=obj.reindex(['a','b','c','d','e']) >>> obj2 a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64
設定默認的缺失值
>>> obj2=obj.reindex(['a','b','c','d','e'],fill_value=0) >>> obj2 a -5.3 b 7.2 c 3.6 d 4.5 e 0.0 dtype: float64
對於時間序列這樣的有序數據,從新索引時能夠須要作一些插值處理
method選項
>>> obj3=Series(['blue','purple','yellow'],index=[0,2,4])
reindex的插值method選項
ffill或pad;前向填充或搬運值
>>> obj3.reindex(range(6)method='ffill') 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
對於DataFrame,reindex能夠修改行索引,列,或兩個都修改。若是僅傳入一個序列,則會從新索引行
>>> frame=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California']) >>> frame Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 >>> frame2=frame.reindex(['a','b','c','d']) >>> frame2 Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0
使用columns關鍵字便可從新索引列
>>> states=['Texas','Utah','California'] >>> frame.reindex(columns=states) Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
能夠同時對行和列進行從新索引,而插值則只能按行應用
>>> frame.reindex(index=['a','b','c','d'],method='ffill',columns=states) Texas Utah California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 7 NaN 8
利用ix的標籤索引功能,從新索引任務能夠變得更簡潔
>>> frame.ix[['a','b','c','d'],states] Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN c 4.0 NaN 5.0 d 7.0 NaN 8.0
index 索引的新序列
method 插值填充方式
fill_value 在從新索引的過程當中,須要引入缺失值時使用的替代值
limit 前向或後向填充時的最大填充量
level 在MultiIndex的指定級別上匹配簡單索引,不然選取其子集
copy 默認爲true,不管如何都複製
>>> obj=Series(np.arange(5.),index=['a','b','c','d','e']) >>> obj a 0.0 b 1.0 c 2.0 d 3.0 e 4.0 dtype: float64 >>> new_obj=obj.drop('c') >>> new_obj a 0.0 b 1.0 d 3.0 e 4.0 dtype: float64
對於DataFrame,能夠刪除任意軸上的索引值
>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) >>> data one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 >>> data.drop(['Colorado','Ohio']) one two three four Utah 8 9 10 11 New York 12 13 14 15 >>> data.drop('two',axis=1) one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15 >>> data.drop(['two','four'],axis=1) one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14
相似於Numpy數組的索引,只不過Series的索引值不僅是整數
Series利用標籤的切片運算與普通的python切片運算不一樣,其末端是包含的
>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) >>> data one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
DataFrame的切片
>>> data[:2] one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 >>> data<5 one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False >>> data[data<5]=0 >>> data one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 >>> data.ix['Colorado',['two','three']] two 5 three 6 Name: Colorado, dtype: int64 >>> data.ix[['Colorado','Utah'],[3,0,1]] four one two Colorado 7 0 5 Utah 11 8 9 >>> data.ix[2] one 8 two 9 three 10 four 11 Name: Utah, dtype: int64 >>> data.ix[:'Utah','two'] Ohio 0 Colorado 5 Utah 9 Name: two, dtype: int64
自動的數據對齊操做在不重疊的索引處引入NA值
對於DataFrame,對齊操做會同時發生在行和列上
使用add方法,傳入加數以及一個fill_value參數:obj.add(obj2,fill_value=0)
>>> arr=np.arange(12.).reshape((3,4)) >>> arr array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) >>> arr[0] array([ 0., 1., 2., 3.]) >>> arr-arr[0] array([[ 0., 0., 0., 0.], [ 4., 4., 4., 4.], [ 8., 8., 8., 8.]])
這叫作廣播(broadcasting)
>>> frame=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) >>> series=frame.ix[0] >>> frame b d e Utah 0.0 1.0 2.0 Ohio 3.0 4.0 5.0 Texas 6.0 7.0 8.0 Oregon 9.0 10.0 11.0 >>> series b 0.0 d 1.0 e 2.0 Name: Utah, dtype: float64
默認狀況下,DataFrame和Series之間的算術運算會將Series的索引匹配到DataFrame的行,而後沿着行一直向下廣播
>>> frame-series b d e Utah 0.0 0.0 0.0 Ohio 3.0 3.0 3.0 Texas 6.0 6.0 6.0 Oregon 9.0 9.0 9.0
若是某個索引值在DataFrame的列或Series的索引中找不到,則參與運算的兩個對象被從新索引以造成並集
>>> series2=Series(range(3),index=['b','e','f']) >>> series2 b 0 e 1 f 2 dtype: int64 >>> frame+series2 b d e f Utah 0.0 NaN 3.0 NaN Ohio 3.0 NaN 6.0 NaN Texas 6.0 NaN 9.0 NaN Oregon 9.0 NaN 12.0 NaN
匹配行且在列上廣播,則必須使用算術運算方法
>>> frame.sub(series3,axis=0) b d e Utah -1.0 0.0 1.0 Ohio -1.0 0.0 1.0 Texas -1.0 0.0 1.0 Oregon -1.0 0.0 1.0
許多最爲常見的數組統計功能都被實現成DataFrame的方法
>>> frame=DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) >>> frame b d e Utah -1.120701 -0.772813 -1.183221 Ohio -0.690566 0.610834 0.382371 Texas 0.287303 -0.001705 -1.055101 Oregon 1.149945 1.056177 -0.178909 >>> def f(x): ... return Series([x.min(),x.max()],index=['min','max']) ... >>> frame.apply(f) b d e min -1.120701 -0.772813 -1.183221 max 1.149945 1.056177 0.382371
frame中各個浮點值的格式化字符串
>>> format=lambda x:'%.2f' % x >>> frame.applymap(format) b d e Utah -1.12 -0.77 -1.18 Ohio -0.69 0.61 0.38 Texas 0.29 -0.00 -1.06 Oregon 1.15 1.06 -0.18
Series有一個用於元素級函數的map方法
>>> frame['e'].map(format) Utah -1.18 Ohio 0.38 Texas -1.06 Oregon -0.18 Name: e, dtype: object
>>> obj=Series(range(4),index=['d','a','b','c']) >>> obj.sort_index() a 1 b 2 c 3 d 0 dtype: int64
對於DataFrame,則能夠根據任意一個軸上的索引進行排序
數據默認是按升序排列的,但也能夠降序排列
>>> frame=DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c']) >>> frame.sort_index(axis=1) a b c d three 1 2 3 0 one 5 6 7 4 >>> frame.sort_index(axis=1,ascending=False) d c b a three 0 3 2 1 one 4 7 6 5
按值對Series進行排序,可以使用其order方法
>>> obj=Series([4,7,-3,2]) >>> obj.order() >>> obj.sort_values() 2 -3 3 2 0 4 1 7 dtype: int64
在排序時,任何缺失值默認都會被放到Series的末尾
>>> frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]}) >>> frame a b 0 0 4 1 1 7 2 0 -3 3 1 2 >>> frame.sort_values(by='b') a b 2 0 -3 3 1 2 0 0 4 1 1 7
根據某種規則破壞平級關係
>>> obj=Series([7,-5,7,4,2,0,4]) >>> obj 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 >>> obj.rank() 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64
排名時用於破壞平級關係的method選項
average 默認,在相等分組中,爲各個值分配平均排名
min 使用整個分組的最小排名
max 使用整個分組的最大排名
first 按值在原始數據中的出現順序分配排名
許多pandas函數(eg:reindex)都要求標籤惟一,但並非強制性
索引的is_unique屬性
>>> obj.index.is_unique
False
某個索引對應多個值,則返回一個Series
>>> obj['a'] a 0 a 1 dtype: int64
對應單值,返回一個標量值
>>> obj['c'] 4
sum求和,傳入axis=1將會按行進行求和運算
NA值會自動被排除,除非整個切片(行或列)都是NA
經過skipna選項能夠禁用該功能,df.mean(axis=1,skipna=False)
describe,用於一次性產生多個彙總統計
能夠從一維Series的值中抽取信息
>>> obj=Series(['c','a','d','a','a','b','b','c','c']) >>> uniques=obj.unique() >>> uniques array(['c', 'a', 'd', 'b'], dtype=object)
計算一個Series中各值出現的頻率
>>> obj.value_counts() c 3 a 3 b 2 d 1 dtype: int64
矢量化集合的成員資格,可用於選取Series中或DataFrame列中數據的子集
>>> mask=obj.isin(['b','c']) >>> mask 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool >>> obj[mask] 0 c 5 b 6 b 7 c 8 c dtype: object
missing data在大部分數據分析應用中都很常見,pandas的設計目標是讓缺失數據的處理任務儘可能輕鬆
python內置的None值也會被當作NA處理
>>> from numpy import nan as NA >>> data=Series([1,NA,3.5,NA,7]) >>> data 0 1.0 1 NaN 2 3.5 3 NaN 4 7.0 dtype: float64 >>> data.dropna() 0 1.0 2 3.5 4 7.0 dtype: float64
經過布爾型索引
>>> data[data.notnull()] 0 1.0 2 3.5 4 7.0 dtype: float64
對於DataFrame,dropna默認丟棄任何含有缺失值的行
傳入how='all'將會丟棄全爲NA的那些行
>>> data=DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]]) >>> data 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 >>> data.dropna(how='all') 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 >>> data[4]=NA >>> data 0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN >>> data.dropna(axis=1,how='all') 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
時間序列數據,只想留下一部分觀測數據
>>> df=DataFrame(np.random.randn(7,3)) >>> df 0 1 2 0 1.367974 -0.556556 0.679336 1 -0.480919 -1.535185 -0.299710 2 0.230583 0.140626 0.604209 3 0.437830 -0.467286 -0.859989 4 -0.254706 -0.227431 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273 >>> df.ix[:4,1]=NA >>> df.ix[:2,2] >>> df 0 1 2 0 1.367974 NaN NaN 1 -0.480919 NaN NaN 2 0.230583 NaN NaN 3 0.437830 NaN -0.859989 4 -0.254706 NaN -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
一行中至少有3個非NA值將其保留
>>> df.dropna(thresh=3) 0 1 2 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
fillna方法是最主要的函數
>>> df.fillna(0) 0 1 2 0 1.367974 0.000000 0.000000 1 -0.480919 0.000000 0.000000 2 0.230583 0.000000 0.000000 3 0.437830 0.000000 -0.859989 4 -0.254706 0.000000 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
一個字典調用fillna,就能夠實現對不一樣列填充不一樣的值
>>> df.fillna({1:0.5,3:-1}) 0 1 2 0 1.367974 0.500000 NaN 1 -0.480919 0.500000 NaN 2 0.230583 0.500000 NaN 3 0.437830 0.500000 -0.859989 4 -0.254706 0.500000 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
fillna默認會返回新對象,但也能夠對現有對象進行就地修改
返回被填充對象的引用
>>> _=df.fillna(0,inplace=True) >>> df 0 1 2 0 1.367974 0.000000 0.000000 1 -0.480919 0.000000 0.000000 2 0.230583 0.000000 0.000000 3 0.437830 0.000000 -0.859989 4 -0.254706 0.000000 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
對reindex有效的那些插值方法也能夠用fillna
>>> df=DataFrame(np.random.randn(6,3)) >>> df.ix[2:,1]=NA;df.ix[4:,2]=NA >>> df 0 1 2 0 0.647866 0.891312 -0.211922 1 -1.455856 -0.629213 -1.043685 2 2.078467 NaN -0.067846 3 -0.223047 NaN 0.513800 4 0.306559 NaN NaN 5 0.404265 NaN NaN
填充最靠近行的數值填充,列行爲
>>> df.fillna(method='ffill') 0 1 2 0 0.647866 0.891312 -0.211922 1 -1.455856 -0.629213 -1.043685 2 2.078467 -0.629213 -0.067846 3 -0.223047 -0.629213 0.513800 4 0.306559 -0.629213 0.513800 5 0.404265 -0.629213 0.513800 >>> df.fillna(method='ffill',limit=2) 0 1 2 0 0.647866 0.891312 -0.211922 1 -1.455856 -0.629213 -1.043685 2 2.078467 -0.629213 -0.067846 3 -0.223047 -0.629213 0.513800 4 0.306559 NaN 0.513800 5 0.404265 NaN 0.513800
hierachical indexing
一個軸上擁有多個(兩個以上)索引級別
>>> data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]]) >>> data a 1 -0.521370 2 0.658209 3 0.841101 b 1 0.354237 2 -0.426983 3 0.835357 c 1 -0.246308 2 0.709859 d 2 -1.215098 3 0.400793 dtype: float64
這就是帶有MultiIndex索引的Series的格式化輸出形式
>>> data.index MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]]) >>> data['b'] 1 0.354237 2 -0.426983 3 0.835357 dtype: float64
內層進行選取
>>> data[:,2] a 0.658209 b -0.426983 c 0.709859 d -1.215098 dtype: float64
層次化索引在數據重塑和基於分組的操做(如透視表生成)扮演重要的角色
>>> data.unstack() 1 2 3 a -0.521370 0.658209 0.841101 b 0.354237 -0.426983 0.835357 c -0.246308 0.709859 NaN d NaN -1.215098 0.400793
unstack的逆運算是stack
對於DataFrame,每條軸均可以有分層索引
>>> frame=DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']]) >>> frame Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
各層均可以有名字(能夠是字符串,也能夠是別的python對象)
>>> frame.index.names=['key1','key2'] >>> frame Ohio Colorado Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 >>> frame.columns.names=['state','color'] >>> frame state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
有了分部的列索引,能夠輕鬆選取列分組
能夠單首創建MultiIndex而後複用
>>> MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])
swaplevel接受兩個級別編號或名稱,並返回一個互換了級別的新對象
>>> frame.swaplevel('key1','key2') state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11
sortlevel則根據單個級別中的值對數據進行排序
>>> frame state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 >>> frame.sortlevel(1) state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11 >>> frame.swaplevel(0,1).sortlevel(0) state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 b 6 7 8 2 a 3 4 5 b 9 10 11
在層次化索引的對象上,若是索引是按字典方式從外向內排序,即調用sortlevel(0)或sort_index()的結果,數據選取操做的性能要好的多
許多對DataFrame和Series的描述和彙總統計都有一個level選項,用於指定在某條軸上求和的級別
>>> frame.sum(level='key2') state Ohio Colorado color Green Red Green key2 1 6 8 10 2 12 14 16 >>> frame.sum(level='color',axis=1) color Green Red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10
想要將DataFrame的一個或多個列當作行索引來用,或者將行索引當成DataFrame的列
>>> frame=DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]}) >>> frame a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3
set_index()函數會將其一個或多個列轉換爲行索引,並建立一個新的DataFrame
>>> frame2=frame.set_index(['c','d']) >>> frame2 a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1
默認狀況下,那些列會從DataFrame中移除,但也能夠將其保留下來
>>> frame.set_index(['c','d'],drop=False) a b c d c d one 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 two 0 3 4 two 0 1 4 3 two 1 2 5 2 two 2 3 6 1 two 3
reset_index的功能相反,層次化索引的級別會被轉移到列裏面
>>> frame2.reset_index() c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1