Serise是相似一維數組的對象,它由一組數據以及一組與之相關的數據標籤組成
In [15]: obj = pd.Series([4,7,-5,3]) In [16]: obj Out[16]: 0 4 1 7 2 -5 3 3 dtype: int64 In [17]: obj.values Out[17]: array([ 4, 7, -5, 3]) In [18]: obj.index Out[18]: RangeIndex(start=0, stop=4, step=1)
左邊是索引,右邊是值數組
In [21]: obj2 = pd.Series([4,7,-5,3],index=['a','b','c','d']) In [22]: obj2 Out[22]: a 4 b 7 c -5 d 3 dtype: int64 In [23]: obj2.index Out[23]: Index([u'a', u'b', u'c', u'd'], dtype='object') In [24]: obj2['a'] Out[24]: 4 In [25]: obj2[obj2>0] Out[25]: a 4 b 7 d 3 dtype: int64 In [26]: obj2*2 Out[26]: a 8 b 14 c -10 d 6 dtype: int64
本身能夠指定索引安全
In [27]: 'b' in obj2 Out[27]: True In [28]: 'f' in obj2 Out[28]: False In [29]: '4' in obj2 Out[29]: False
Series能夠當作定長的有序字典,它是索引值對數值的映射數據結構
In [31]: sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':500} In [32]: bj3 = pd.Series(sdata) In [33]: bj3 Out[33]: Ohio 35000 Oregon 16000 Texas 71000 Utah 500 dtype: int64 In [37]: sdata_index = {'a','b','c','Texas'} In [38]: obj4 = pd.Series(sdata,index=sdata_index) In [39]: obj4 Out[39]: a NaN c NaN b NaN Texas 71000.0 dtype: float64
In [40]: pd.isnull(obj4) Out[40]: a True c True b True Texas False dtype: bool In [41]: pd.isnull(obj4) Out[41]: a True c True b True Texas False dtype: bool
檢查數據是否缺失app
它是一個表格性的數據結構,它含有一組有序的列,每列能夠是不一樣類型的數值
In [44]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year' ...: :[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]} In [45]: data Out[45]: {'pop': [1.5, 1.7, 3.6, 2.4, 2.9], 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002]} In [46]: fram = pd.DataFrame(data) In [47]: fram Out[47]: pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 In [48]: pd.DataFrame(data,columns=['year','state','pop']) Out[48]: year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9
能夠指定序列排序dom
In [51]: fram2 = pd.DataFrame(data,columns=['year','state','pop','debt'] ...: ,index=['one','two','three','four','five']) In [52]: fram2 Out[52]: year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN In [53]: fram2['state'] Out[53]: one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object In [54]: fram2.year Out[54]: one 2000 two 2001 three 2002 four 2001 five 2002 Name: year, dtype: int64 In [56]: fram2.ix['four'] Out[56]: year 2001 state Nevada pop 2.4 debt NaN Name: four, dtype: object
DataFrame能夠根據字段訪問,還能夠用索引訪問函數
In [58]: fram2['debt'] = 16.5 In [59]: fram2 Out[59]: year state pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5 In [60]: fram2['debt'] = np.arange(5) In [61]: fram2 Out[61]: year state pop debt one 2000 Ohio 1.5 0 two 2001 Ohio 1.7 1 three 2002 Ohio 3.6 2 four 2001 Nevada 2.4 3 five 2002 Nevada 2.9 4 In [62]: va = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five']) In [64]: fram2['debt'] = va In [65]: fram2 Out[65]: year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7 In [67]: fram2['eastern'] = fram2.state == 'Ohio' In [68]: fram2 Out[68]: year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False
按照字段名賦值,也能夠按照索引值賦值,爲不存在的值設置NANcode
In [69]: del fram2['eastern'] In [70]: fram2 Out[70]: year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
刪除列 經過索引方式返回的列只是相應數據的視圖而已,並非副本,全部數據的修改都會反映到原數據orm
In [1]: obj = pd.Series(range(3),index=['a','b','c']) In [2]: obj Out[2]: a 0 b 1 c 2 dtype: int64 In [3]: index = obj.index In [4]: index Out[4]: Index([u'a', u'b', u'c'], dtype='object') In [5]: index[1:] Out[5]: Index([u'b', u'c'], dtype='object')
index對象是不修改的,由於這樣才能使Index對象在多個數據結構之間安全共享對象
In [7]: index = pd.Index(np.arange(3)) In [8]: index Out[8]: Int64Index([0, 1, 2], dtype='int64') In [9]: obj2 = pd.Series([1.5,-2.5,0],index=index) In [10]: obj2 Out[10]: 0 1.5 1 -2.5 2 0.0 dtype: float64 In [11]: obj2.index is index Out[11]: True
pandas庫中內置Index排序
In [12]: obj = pd.Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) In [13]: obj Out[13]: d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 In [14]: obj2 = obj.reindex(['a','b','c','d','e']) In [15]: obj2 Out[15]: a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64 In [16]: obj.reindex(['a','b','c','d','e'],fill_value=0) Out[16]: a -5.3 b 7.2 c 3.6 d 4.5 e 0.0
重建索引而且還能夠給空值補零
In [18]: obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4]) In [19]: obj3 Out[19]: 0 blue 2 purple 4 yellow dtype: object In [21]: obj3.reindex(range(6),method='ffill') Out[21]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object In [22]: obj3.reindex(range(6),method='pad') Out[22]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
ffill或者pad 先後填充值;bfill或者backfill向後填充值
In [31]: obj = pd.Series(np.arange(5),index=['a','b','c','d','e']) In [32]: obj Out[32]: a 0 b 1 c 2 d 3 e 4 dtype: int64 In [33]: new_obj = obj.drop('c') In [34]: new_obj Out[34]: a 0 b 1 d 3 e 4 dtype: int64 In [35]: obj.drop(['d','c']) Out[35]: a 0 b 1 e 4 dtype: int64
In [37]: s1 = pd.Series([7.3,-2.5,3.4,1.5],index=['a','b','c','d']) In [38]: s2 = pd.Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g ...: ']) In [39]: s1 Out[39]: a 7.3 b -2.5 c 3.4 d 1.5 dtype: float64 In [40]: s2 Out[40]: a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64 In [41]: s1 + s2 Out[41]: a 5.2 b NaN c 7.0 d NaN e NaN f NaN g NaN dtype: float64
In [44]: df1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns=list(' ...: abcd')) In [46]: df2 = pd.DataFrame(np.arange(20).reshape((4,5)),columns=list(' ...: abcde')) In [47]: df1 + df2 Out[47]: a b c d e 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaN In [48]: df1.add(df2,fill_value=0) Out[48]: a b c d e 0 0.0 2.0 4.0 6.0 4.0 1 9.0 11.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0
In [50]: frame = pd.DataFrame(np.random.rand(4,3),columns=list('bde'),i ...: ndex=['Uta','Ohio','Texas','Oreon']) In [51]: f = lambda x: x.max() - x.min() In [52]: frame.apply(f) Out[52]: b 0.173280 d 0.569717 e 0.584717 dtype: float64 In [53]: frame.apply(f,axis=1) Out[53]: Uta 0.558883 Ohio 0.117209 Texas 0.552407 Oreon 0.595551 dtype: float64 In [54]: def f(x): ...: return pd.Series([x.min(),x.max()],index=['min','max']) ...: In [55]: frame.apply(f) Out[55]: b d e min 0.67871 0.264895 0.239061 max 0.85199 0.834612 0.823778
In [59]: format = lambda x:'%.2f' % x In [60]: frame.applymap(format) Out[60]: b d e Uta 0.79 0.26 0.82 Ohio 0.73 0.71 0.61 Texas 0.85 0.70 0.30 Oreon 0.68 0.83 0.24