pandas基礎(一)

pandas數據結構--Serise

Serise是相似一維數組的對象,它由一組數據以及一組與之相關的數據標籤組成
In [15]: obj = pd.Series([4,7,-5,3])

In [16]: obj
Out[16]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [17]: obj.values
Out[17]: array([ 4,  7, -5,  3])

In [18]: obj.index
Out[18]: RangeIndex(start=0, stop=4, step=1)

左邊是索引,右邊是值數組

In [21]: obj2 = pd.Series([4,7,-5,3],index=['a','b','c','d'])

In [22]: obj2
Out[22]: 
a    4
b    7
c   -5
d    3
dtype: int64

In [23]: obj2.index
Out[23]: Index([u'a', u'b', u'c', u'd'], dtype='object')

In [24]: obj2['a']
Out[24]: 4
In [25]: obj2[obj2>0]
Out[25]: 
a    4
b    7
d    3
dtype: int64

In [26]: obj2*2
Out[26]: 
a     8
b    14
c   -10
d     6
dtype: int64

本身能夠指定索引安全

In [27]: 'b' in obj2
Out[27]: True

In [28]: 'f' in obj2
Out[28]: False

In [29]: '4' in obj2
Out[29]: False

Series能夠當作定長的有序字典,它是索引值對數值的映射數據結構

In [31]: sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':500}

In [32]: bj3 = pd.Series(sdata)

In [33]: bj3
Out[33]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah        500
dtype: int64
In [37]: sdata_index = {'a','b','c','Texas'}

In [38]: obj4 = pd.Series(sdata,index=sdata_index)

In [39]: obj4
Out[39]: 
a            NaN
c            NaN
b            NaN
Texas    71000.0
dtype: float64
In [40]: pd.isnull(obj4)
Out[40]: 
a         True
c         True
b         True
Texas    False
dtype: bool

In [41]: pd.isnull(obj4)
Out[41]: 
a         True
c         True
b         True
Texas    False
dtype: bool

檢查數據是否缺失app

DataFrame

它是一個表格性的數據結構,它含有一組有序的列,每列能夠是不一樣類型的數值
In [44]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year'
    ...: :[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}

In [45]: data
Out[45]: 
{'pop': [1.5, 1.7, 3.6, 2.4, 2.9],
 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002]}

In [46]: fram = pd.DataFrame(data)

In [47]: fram
Out[47]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

In [48]: pd.DataFrame(data,columns=['year','state','pop'])
Out[48]: 
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

能夠指定序列排序dom

In [51]: fram2 = pd.DataFrame(data,columns=['year','state','pop','debt']
    ...: ,index=['one','two','three','four','five'])

In [52]: fram2
Out[52]: 
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

In [53]: fram2['state']
Out[53]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [54]: fram2.year
Out[54]: 
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
In [56]: fram2.ix['four']
Out[56]: 
year       2001
state    Nevada
pop         2.4
debt        NaN
Name: four, dtype: object

DataFrame能夠根據字段訪問,還能夠用索引訪問函數

In [58]: fram2['debt'] = 16.5

In [59]: fram2
Out[59]: 
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
In [60]: fram2['debt'] = np.arange(5)

In [61]: fram2
Out[61]: 
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
In [62]: va = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
In [64]: fram2['debt'] = va

In [65]: fram2
Out[65]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
In [67]: fram2['eastern'] = fram2.state == 'Ohio'

In [68]: fram2
Out[68]: 
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False

按照字段名賦值,也能夠按照索引值賦值,爲不存在的值設置NANcode

In [69]: del fram2['eastern']

In [70]: fram2
Out[70]: 
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

刪除列 經過索引方式返回的列只是相應數據的視圖而已,並非副本,全部數據的修改都會反映到原數據orm

索引對象

In [1]: obj = pd.Series(range(3),index=['a','b','c'])

In [2]: obj
Out[2]: 
a    0
b    1
c    2
dtype: int64

In [3]: index = obj.index

In [4]: index
Out[4]: Index([u'a', u'b', u'c'], dtype='object')

In [5]: index[1:]
Out[5]: Index([u'b', u'c'], dtype='object')

index對象是不修改的,由於這樣才能使Index對象在多個數據結構之間安全共享對象

In [7]: index = pd.Index(np.arange(3))

In [8]: index
Out[8]: Int64Index([0, 1, 2], dtype='int64')

In [9]: obj2 = pd.Series([1.5,-2.5,0],index=index)

In [10]: obj2
Out[10]: 
0    1.5
1   -2.5
2    0.0
dtype: float64

In [11]: obj2.index is index
Out[11]: True

pandas庫中內置Index排序

In [12]: obj = pd.Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])

In [13]: obj
Out[13]: 
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [14]: obj2 = obj.reindex(['a','b','c','d','e'])

In [15]: obj2
Out[15]: 
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [16]: obj.reindex(['a','b','c','d','e'],fill_value=0)
Out[16]: 
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0

重建索引而且還能夠給空值補零

In [18]: obj3 = pd.Series(['blue','purple','yellow'],index=[0,2,4])

In [19]: obj3
Out[19]: 
0      blue
2    purple
4    yellow
dtype: object
In [21]: obj3.reindex(range(6),method='ffill')
Out[21]: 
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [22]: obj3.reindex(range(6),method='pad')
Out[22]: 
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

ffill或者pad 先後填充值;bfill或者backfill向後填充值

丟棄指定軸的項

In [31]: obj = pd.Series(np.arange(5),index=['a','b','c','d','e'])

In [32]: obj
Out[32]: 
a    0
b    1
c    2
d    3
e    4
dtype: int64

In [33]: new_obj = obj.drop('c')

In [34]: new_obj
Out[34]: 
a    0
b    1
d    3
e    4
dtype: int64

In [35]: obj.drop(['d','c'])
Out[35]: 
a    0
b    1
e    4
dtype: int64

算術計算和數據對齊

In [37]: s1 = pd.Series([7.3,-2.5,3.4,1.5],index=['a','b','c','d'])

In [38]: s2 = pd.Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g
    ...: '])

In [39]: s1
Out[39]: 
a    7.3
b   -2.5
c    3.4
d    1.5
dtype: float64

In [40]: s2
Out[40]: 
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [41]: s1 + s2
Out[41]: 
a    5.2
b    NaN
c    7.0
d    NaN
e    NaN
f    NaN
g    NaN
dtype: float64

在算術方法中填充值

In [44]: df1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns=list('
    ...: abcd'))
In [46]: df2 = pd.DataFrame(np.arange(20).reshape((4,5)),columns=list('
    ...: abcde'))
In [47]: df1 + df2
Out[47]: 
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

In [48]: df1.add(df2,fill_value=0)
Out[48]: 
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0  11.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

函數應用和映射

In [50]: frame = pd.DataFrame(np.random.rand(4,3),columns=list('bde'),i
    ...: ndex=['Uta','Ohio','Texas','Oreon'])

In [51]: f = lambda x: x.max() - x.min()

In [52]: frame.apply(f)
Out[52]: 
b    0.173280
d    0.569717
e    0.584717
dtype: float64

In [53]: frame.apply(f,axis=1)
Out[53]: 
Uta      0.558883
Ohio     0.117209
Texas    0.552407
Oreon    0.595551
dtype: float64
In [54]: def f(x):
    ...:     return pd.Series([x.min(),x.max()],index=['min','max'])
    ...: 

In [55]: frame.apply(f)
Out[55]: 
           b         d         e
min  0.67871  0.264895  0.239061
max  0.85199  0.834612  0.823778
In [59]: format = lambda x:'%.2f' % x

In [60]: frame.applymap(format)
Out[60]: 
          b     d     e
Uta    0.79  0.26  0.82
Ohio   0.73  0.71  0.61
Texas  0.85  0.70  0.30
Oreon  0.68  0.83  0.24
相關文章
相關標籤/搜索