Pandas含有使數據分析工做變得更快更簡單的高級數據結構和操做工具。pandas是基於NumPy構建的,讓以Numpy爲中心的應用變得更加簡單。數組
Series:數據結構
In [2]: from pandas import Series, DataFrame
In [3]: obj = Series([4, 7, -5, 3]) In [4]: obj Out[4]: 0 4 1 7 2 -5 3 3 dtype: int64
In [7]: obj[obj>2]
Out[7]: 0 4 1 7 3 3 dtype: int64 In [8]: obj*2 Out[8]: 0 8 1 14 2 -10 3 6 dtype: int64
0 in obj
Out[9]: True
In [10]: sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
In [11]: obj3 = Series(sdata) In [12]: obj3 Out[12]: Ohio 35000 Oregon 16000 Texas 71000 Utah 5000 dtype: int64
In [14]: import pandas as pd
In [16]: pd.isnull(obj3) Out[16]: Ohio False Oregon False Texas False Utah False dtype: bool
In [17]: obj3.name = 'population'
In [18]: obj3.index.name = 'state' In [19]: obj3 Out[20]: state Ohio 35000 Oregon 16000 Texas 71000 Utah 5000 Name: population, dtype: int64
Series最重要的一個功能:它在算術運算中會自動對齊不一樣索引的數據。app
DataFramedom
from pandas import Series, DataFrame
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year':[2000, 2001, 2002, 2001, 2001], 'pop':[1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data);
print (frame)
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2001
frame2 = DataFrame(data,columns=['year', 'state', 'pop'])
print (frame2['state'])
print (frame2.year)
輸出:函數
0 Ohio
1 Ohio 2 Ohio 3 Nevada 4 Nevada Name: state, dtype: object 0 2000 1 2001 2 2002 3 2001 4 2001 Name: year, dtype: int64
frame3 = DataFrame(data,columns=['year', 'state', 'pop', 'debt'])
print (frame3)
輸出:工具
year state pop debt
0 2000 Ohio 1.5 NaN 1 2001 Ohio 1.7 NaN 2 2002 Ohio 3.6 NaN 3 2001 Nevada 2.4 NaN 4 2001 Nevada 2.9 NaN
列能夠經過賦值的方式修改:spa
frame3['debt'] = np.arange(5.0)
print (frame3)
輸出:code
year state pop debt
0 2000 Ohio 1.5 0.0
1 2001 Ohio 1.7 1.0
2 2002 Ohio 3.6 2.0
3 2001 Nevada 2.4 3.0
4 2001 Nevada 2.9 4.0
val = Series([-1.2, -1.5, -1.7], index=[ 0, 2, 3])
frame3['debt'] = val print (frame3)
輸出:orm
year state pop debt
0 2000 Ohio 1.5 -1.2
1 2001 Ohio 1.7 NaN 2 2002 Ohio 3.6 -1.5 3 2001 Nevada 2.4 -1.7 4 2001 Nevada 2.9 NaN
注意:經過索引方式返回的列只是相應數據的視圖,並非副本對象
pop = {'Nevada':{2001:2.4, 2002:2.9},
'Ohio': {2000:2.5, 2001:1.7, 2002:3.6}} frame4 = DataFrame(pop) print (frame4)
輸出:
Nevada Ohio
2000 NaN 2.5
2001 2.4 1.7
2002 2.9 3.6
轉置:
frame4.T
frame4 = DataFrame(pop)
frame4.index.name = 'year' frame4.columns.name = 'state' print (frame4)
輸出:
state Nevada Ohio
year
2000 NaN 2.5
2001 2.4 1.7
2002 2.9 3.6
print (frame4.values)
#輸出
[[ nan 2.5] [ 2.4 1.7] [ 2.9 3.6]]
從新索引
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a' , 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) print (obj2) #輸出 a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
print (obj2) #輸出 a -5.3 b 7.2 c 3.6 d 4.5 e 0.0 dtype: float64
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print (obj3.reindex(range(6), method='ffill')) #輸出 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
參數 | 說明 |
ffill或pad | 前向填充值 |
bfill或backfill | 後向填充值 |
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame2 = frame.reindex(['a', 'b', 'c', 'd']) states = ['Texas', 'Utah', 'California'] frame3 = frame.reindex(columns=states)
利用ix的標籤索引功能,從新索引任務能夠變的更簡潔:
frame.ix[['a', 'b', 'c', 'd'], states]
丟棄指定軸上的項
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
print (data.drop(['Colorado', 'Ohio'])) print (data.drop(['two', 'three'], axis=1))
索引、選取和過濾
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b'] obj[1] obj[2:4] obj['b', 'a', 'd'] obj[[1,3]] obj[obj<2] obj['b', 'c']
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado', 'Utah',
data[['two', 'three']] #輸出 two three Ohio 1 2 Colorado 5 6 Utah 9 10 New York 13 14 data[:2] #輸出 one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 data[data['three']>5] #輸出 one two three four Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
類型 | 說明 |
obj[val] | 選取DataFrame的單個列或一組列 |
obj.ix[val] | 選取DataFrame的單個行或一組行 |
obj.ix[:,val] | 選取單個列或列子集 |
obj.ix[val1,val2] | 同時選取行和列 |
算術運算和數據對齊
df1 = DataFrame(np.arange(12.).reshape((3,4)), columns= list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4,5)), columns= list('abcde')) print(df1+df2) print(df1.add(df2,fill_value=0)) #輸出 a b c d e 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaN a b c d e 0 0.0 2.0 4.0 6.0 4.0 1 9.0 11.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0
df1.reindex(columns=df2.columns, fill_value=0)
#輸出
a b c d e 0 0.0 1.0 2.0 3.0 0 1 4.0 5.0 6.0 7.0 0 2 8.0 9.0 10.0 11.0 0
函數應用和映射
#示例 np.abs(frame)
frame = DataFrame(np.random.randn(4,3), columns=list('bde'), index=[1,2,3,4])
print (frame) f = lambda x:x.max() - x.min() print (frame.apply(f)) print (frame.apply(f, axis=1)) #輸出 b d e 1 0.756197 2.094682 -2.139083 2 0.231391 -0.682302 0.908212 3 0.447181 -0.172543 0.221593 4 -0.500720 0.248337 -0.034403 b 1.256918 d 2.776984 e 3.047295 dtype: float64 1 4.233765 2 1.590513 3 0.619723 4 0.749058 dtype: float64
format = lambda x: '%.2f' %x
print (frame.applymap(format))
之因此叫applymap,是由於Series有一個敢於應用元素級函數的map方法:
print (frame['e'].map(format))
#輸出
b d e
1 -0.84 0.10 1.07
2 0.56 -0.49 -1.98
3 -0.60 -1.57 -1.42
4 0.53 -0.85 -0.01
1 1.07
2 -1.98
3 -1.42
4 -0.01 Name: e, dtype: object
排序和排名
obj = Series(range(4), index=['d', 'a', 'b', 'c']) print (obj.sort_index()) #輸出 a 1 b 2 c 3 d 0 dtype: int64
默認升序,降序以下:
obj.sort_index(axis=1, ascending=False)
frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0 , 1]}) print (frame.sort_values(by=['a','b'])) #輸出 a b 2 0 -3 0 0 4 3 1 2 1 1 7
帶有重複值的軸索引
obj = Series(range(5),index=['a', 'a', 'b', 'b', 'c'])
索引的 is_unique 屬性能夠告訴你它的值是否惟一
obj.index.is_unique
彙總和計算描述統計
df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two']) print (df) print (df.sum()) print (df.sum(axis=1)) #axis 約簡的軸。DataFrame的行用0,列用1 #輸出 one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 one 9.25 two -5.80 dtype: float64 a 1.40 b 2.60 c NaN d -0.55 dtype: float64
NA值會自動被排除,除非整個切片(這裏指的是行或列)都是NA。經過 skipna 選項能夠禁用該功能。
相關係數和協方差
惟一值、值計數以及成員資格
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) uniques = obj.unique() print (uniques) #輸出 ['c' 'a' 'd' 'b']
返回的惟一值是爲排序的,可以使用 uniques.sort() 進行排序
print (obj.value_counts()) #輸出 c 3 a 3 b 2 d 1 dtype: int64
結果是按值頻率降序排列的。
pd.value_counts(obj.values, sort=False)
mask = obj.isin(['b','c']) print (mask) #輸出 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool
處理缺失數據
string_data = Series(['aa', 'bb', np.nan, 'cc']) print (string_data.isnull()) #輸出 0 False 1 False 2 True 3 False dtype: bool
Python內置的 None 值也會被看成NA處理。
方法 | 說明 |
dropna | 根據各標籤的值中是否存在缺失數據對軸標籤進行過濾,可經過閥值調節對缺失值的容忍度 |
fillna | 用指定值或插值方法(如ffill或bfill)填充缺失數據 |
isnull | 返回一個含有布爾值的對象,這些布爾值表示哪些值是缺失值/NA,該對象的類型與源類型同樣 |
notnull | isnull的否認式 |
濾除缺失數據
data = Series([1, np.nan, 3.5, np.nan, 7]) print (data.dropna()) #輸出 0 1.0 2 3.5 4 7.0 dtype: float64
傳入 how='all' 將只丟棄全爲 NA 的那些行:
data.dropna(how='all')
以這種方式丟棄列:
data.dropna(axis=1, how='all')
填充缺失數據
df.fillna(0)
df.fillna({1:0.5, 3:-1})
注:fillna 默認返回新對象,但也能夠對現有對象進行修改:
#老是返回被填充對象的引用 _ = df.fillna(o, inplace=True)
#df.fillna(method='ffill') df.fillna(method='ffill', limit=2) #(對於前向和後向填充)能夠連續填充的最大數量 #輸出 0 1 2 0 -2.512334 0.566809 -0.310823 1 1.480002 0.089615 -0.876138 2 1.446703 0.089615 -0.705835 3 -0.241433 0.089615 0.897966 4 0.028517 NaN 0.897966 5 -0.045509 NaN 0.897966
層次化索引
data = Series(np.random.randn(10), index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,1,2]]) print (data) print (data.index) #輸出 a 1 -0.041274 2 1.288469 3 -0.731776 b 1 1.411813 2 -1.108839 3 -0.689199 c 1 -0.185918 2 0.803436 d 1 0.094525 2 0.692778 dtype: float64 MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 0, 1]])
data['b'] data['b':'c'] data.ix[['b', 'd']]
在內層中選取:
data[:, 2] #輸出 a 0.780424 b 0.034343 c -1.635875 d -1.068981 dtype: float64
重排分級順序
frame = DataFrame(np.arange(12).reshape(4,3), index=[['a','a','b','b'], [1,2,1,2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) frame.index.names = ['key1', 'key2'] frame.columns.names = ['state', 'color'] #輸出 state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 frame.swaplevel('key1', 'key2')
frame.sort_index(level=1) #輸出 state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11
frame.swaplevel(0,1).sort_index(level=0) #輸出 state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 b 6 7 8 2 a 3 4 5 b 9 10 11
根據級別彙總統計
#示例 frame.sum(level='key2') #輸出 state Ohio Colorado color Green Red Green key2 1 6 8 10 2 12 14 16
frame.sum(level='color', axis=1) #輸出 color Green Red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10
使用DataFrame的列
frame = DataFrame({'a':range(7), 'b':range(7,0,-1), 'c':['one','one','one','two','two','two','two'], 'd':[0,1,2,0,1,2,3]}) print frame #輸出 a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3
frame2 = frame.set_index(['c','d']) #輸出 a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1
frame2.reset_index()