pandas的數據結構介紹
主要包含兩個數據結構,Series和DataFramepython
Series
相似於一維數組,有數據和索引。默認建立整數型索引。
能夠經過values和index獲取數據和索引。web
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj=Series([4,7,-5,3])
obj
0 4
1 7
2 -5
3 3
dtype: int64
若是想要自定義索引,舉例以下,index就是一個列表:json
obj2=Series([4,7,-5,3],index=['b','d','a','c'])
obj2
b 4
d 7
a -5
c 3
dtype: int64
經過索引選擇Series中單個或者一組值,輸入的參數是一個索引或者一個索引的list數組
obj2[['a','b','c']]
a -5
b 4
c 3
dtype: int64
Series相似與一個Dict,索引和數據之間存在映射關係。能夠直接使用Dict建立一個Series。微信
'b' in obj2
True
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
obj3
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
上述例子中只傳入一個字典,那麼Series的索引就是原來Dict中的key,若是設置的index不一樣的話,會出現NaN的狀況,後面會詳細講解一下NaN的處理。數據結構
states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
pd.isnull(obj4)
California True
Ohio False
Oregon False
Texas False
dtype: bool
DataFrame
DataFrame是一個表格型的數據結構,含有一組有序的列,每列能夠使不一樣的值類。
DataFrame既有行索引也有列索引。
構建DataFrame的經常使用方法是直接傳入一個由等長列表或者Numpy數組組成的Dict:app
data={
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame=DataFrame(data)
frame
|
pop |
state |
year |
0 |
1.5 |
Ohio |
2000 |
1 |
1.7 |
Ohio |
2001 |
2 |
3.6 |
Ohio |
2002 |
3 |
2.4 |
Nevada |
2001 |
4 |
2.9 |
Nevada |
2002 |
若是指定列序列,那麼DataFrame的列會按照制定順序排列:dom
DataFrame(data,columns=['year','state','pop'])
|
year |
state |
pop |
0 |
2000 |
Ohio |
1.5 |
1 |
2001 |
Ohio |
1.7 |
2 |
2002 |
Ohio |
3.6 |
3 |
2001 |
Nevada |
2.4 |
4 |
2002 |
Nevada |
2.9 |
若是傳入的列找不到對應的數據,那麼就會產生NA值:函數
frame2=DataFrame(data,columns=['year','state','pop','debt'],
index=['one','two','three','four','five'])
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
NaN |
two |
2001 |
Ohio |
1.7 |
NaN |
three |
2002 |
Ohio |
3.6 |
NaN |
four |
2001 |
Nevada |
2.4 |
NaN |
five |
2002 |
Nevada |
2.9 |
NaN |
frame2['state']或者frame2.year的方式,能夠獲取一個Series,也就是一列。
獲取行的方法是用索引字段ix,好比frame2.ix['three']。url
frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
frame2.ix['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
列能夠經過賦值的方式進行修改,若是將列表或數組賦值給某個列,長度須要跟DataFrame的長度匹配,若是賦值的是一個Series,就是精確匹配DataFrame的索引,全部的空位都會填上缺失值:
frame2['debt']=16.5
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
16.5 |
two |
2001 |
Ohio |
1.7 |
16.5 |
three |
2002 |
Ohio |
3.6 |
16.5 |
four |
2001 |
Nevada |
2.4 |
16.5 |
five |
2002 |
Nevada |
2.9 |
16.5 |
frame2['debt']=np.arange(5.)
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
0.0 |
two |
2001 |
Ohio |
1.7 |
1.0 |
three |
2002 |
Ohio |
3.6 |
2.0 |
four |
2001 |
Nevada |
2.4 |
3.0 |
five |
2002 |
Nevada |
2.9 |
4.0 |
val=Series([-1.2,-1.5,-1.7],
index=['two','four','five'])
frame2['debt']=val
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
NaN |
two |
2001 |
Ohio |
1.7 |
-1.2 |
three |
2002 |
Ohio |
3.6 |
NaN |
four |
2001 |
Nevada |
2.4 |
-1.5 |
five |
2002 |
Nevada |
2.9 |
-1.7 |
爲不存在的列賦值會建立出一個新列,使用del關鍵字能夠刪除列:
frame2['eastern']=frame2.state=='Ohio'
frame2
|
year |
state |
pop |
debt |
eatern |
eastern |
one |
2000 |
Ohio |
1.5 |
NaN |
True |
True |
two |
2001 |
Ohio |
1.7 |
-1.2 |
True |
True |
three |
2002 |
Ohio |
3.6 |
NaN |
True |
True |
four |
2001 |
Nevada |
2.4 |
-1.5 |
False |
False |
five |
2002 |
Nevada |
2.9 |
-1.7 |
False |
False |
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
若是使用嵌套字典來建立DataFrame,那麼外層字典的key做爲列,內層字典的key做爲行索引:
pop={
'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}
}
frame3=DataFrame(pop)
frame3
|
Nevada |
Ohio |
2000 |
NaN |
1.5 |
2001 |
2.4 |
1.7 |
2002 |
2.9 |
3.6 |
frame3.T
|
2000 |
2001 |
2002 |
Nevada |
NaN |
2.4 |
2.9 |
Ohio |
1.5 |
1.7 |
3.6 |
DataFrame(pop,index=[2001,2002,2003])
|
Nevada |
Ohio |
2001 |
2.4 |
1.7 |
2002 |
2.9 |
3.6 |
2003 |
NaN |
NaN |
索引對象
構建Series和DataFrame時,所用到的任何數組或其餘序列的標籤都會轉換成一個Index對象,Index對象是不能修改的,於是才能使Index對象在多個數據結構中共享。
index=pd.Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
obj2.index is index
True
Index的方法和屬性:
append,鏈接另外一個Index對象,產生一個新的Index
diff,計算差集,並獲得一個Index
delete,刪除索引i處的元素,並獲得新的Index
drop,刪除傳入的值,並獲得新的Index
基本功能
從新索引
reindex方法,建立一個適應新索引的新對象.
調用該Series的reindex將會根據新索引進行重拍,若是某個索引值不存在,就引入缺失值,fill_value。
method選項能夠進行插值填充,ffill或pad,向前填充,bfill或backfill,向後填充。
好比:
obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj2=obj.reindex(['a','b','c','d','e'],fill_value=0)
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
dtype: float64
obj3=Series(['blue','purple','yellow'],index=[0,2,4])
obj3.reindex(range(6),method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
使用columns關鍵字能夠從新索引列,可是插值只能按行應用,也就是index方向。
frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
丟棄指定軸上的項
使用drop方法,給出一個索引數據或者列表,就能夠刪除。
obj=Series(np.arange(5.),index=['a','b','c','d','e'])
new_obj=obj.drop(['b','c'])
new_obj
a 0.0
d 3.0
e 4.0
dtype: float64
索引、選取和過濾
Series的索引相似於Numpy數組的索引,只不過不是整數,好比:
obj=Series(np.arange(4.),index=['a','b','c','d'])
obj['b']
1.0
obj[1]
1.0
obj[2:4]#這種切片使不包含末端的
c 2.0
d 3.0
dtype: float64
obj[['b','a','d']]
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1,3]]
b 1.0
d 3.0
dtype: float64
obj[obj>2]
d 3.0
dtype: float64
obj['b':'c']#若是是利用標籤的切片,是包含末端的。
b 1.0
c 2.0
dtype: float64
obj['b':'c']=5#設置值的方式很簡單
obj
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
對DataFrame進行索引就是得到一個或者多個列:
data=DataFrame(np.arange(16).reshape(4,4),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
data
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
data['two']#獲取標籤爲two的那一列
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[:2]#獲取前兩行
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
data[data['three']>5]#獲取three這一列中大於5的那幾行
|
one |
two |
three |
four |
Colorado |
4 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
data<5#布爾方法,計算每一個元素與5的大小
|
one |
two |
three |
four |
Ohio |
True |
True |
True |
True |
Colorado |
True |
False |
False |
False |
Utah |
False |
False |
False |
False |
New York |
False |
False |
False |
False |
data[data<5]=0#將全部小於5的元素值設置爲0
data
|
one |
two |
three |
four |
Ohio |
0 |
0 |
0 |
0 |
Colorado |
0 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
DataFrame在行上進行索引時,能夠使用專門的.loc索引基於標籤的字段,.iloc索引基於位置的字段
data.loc['Colorado',['two','three']]
two 5
three 6
Name: Colorado, dtype: int32
DataFrame和Series之間的運算
arr=np.arange(12.).reshape(3,4)
arr-arr[0]
#默認狀況下DataFrame和Series之間的算術運算會將Series的索引匹配到DataFrame的列,而後沿着行一直向下廣播。
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
函數應用和映射
frame=DataFrame(np.random.randn(4,3),
columns=list('bde'),
index=['Utah','Ohio','Texas','Oregon'])
np.abs(frame)
|
b |
d |
e |
Utah |
0.855613 |
1.696205 |
0.503547 |
Ohio |
1.086818 |
1.448180 |
1.568419 |
Texas |
0.360607 |
0.674741 |
0.590972 |
Oregon |
1.270708 |
0.461014 |
0.427092 |
f=lambda x: x.max()-x.min()
frame.apply(f)#默認axis=0,也就是在列方向上,豎直方向上應用函數,能夠設置axis=1
b 0.910101
d 2.370946
e 2.071966
dtype: float64
排序和排名
要對行或者列索引進行排序,能夠用sort_index方法:
obj=Series(range(4),index=['d','a','c','b'])
obj.sort_index()
#按照index排序
a 1
b 3
c 2
d 0
dtype: int64
frame=DataFrame(np.arange(8).reshape(2,4),
index=['three','one'],
columns=['d','a','b','c'])
frame.sort_index()
#原本three在上,排序後one在上了,也就是默認爲豎直方向排序,axis=0.還能夠添加ascending=False進行降序排列
|
d |
a |
b |
c |
one |
4 |
5 |
6 |
7 |
three |
0 |
1 |
2 |
3 |
frame.sort_index(axis=1)
|
a |
b |
c |
d |
three |
1 |
2 |
3 |
0 |
one |
5 |
6 |
7 |
4 |
若是須要按值對Series排序,能夠使用sort_values方法:
obj=pd.Series(np.random.randn(8))
obj.sort_values()
6 -0.896499
2 -0.827439
3 -0.520070
5 -0.216063
7 0.353973
1 0.400870
0 0.902996
4 1.854120
dtype: float64
彙總和計算描述統計
df=DataFrame(np.arange(8.).reshape(4,2),
index=['a','b','c','d'],
columns=['one','two'])
df.sum()
#默認計算列方向上的和,axis=0,能夠設置axis=1計算行方向,設置skipna=True自動排除NA值,默認是true
one 12.0
two 16.0
dtype: float64
df.describe()
#針對Series或DataFrame各列計算彙總統計
|
one |
two |
count |
4.000000 |
4.000000 |
mean |
3.000000 |
4.000000 |
std |
2.581989 |
2.581989 |
min |
0.000000 |
1.000000 |
25% |
1.500000 |
2.500000 |
50% |
3.000000 |
4.000000 |
75% |
4.500000 |
5.500000 |
max |
6.000000 |
7.000000 |
df.cumsum()
#樣本值的累計和
|
one |
two |
a |
0.0 |
1.0 |
b |
2.0 |
4.0 |
c |
6.0 |
9.0 |
d |
12.0 |
16.0 |
相關係數與協方差
from pandas_datareader import data as web
all_data={}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})
returns=price.pct_change()
returns.tail()
#這個例子不演示了,打不開雅虎的網頁了。。。。
---------------------------------------------------------------------------
RemoteDataError Traceback (most recent call last)
<ipython-input-45-5ca20168c7a5> in <module>()
2 all_data={}
3 for ticker in ['AAPL','IBM','MSFT','GOOG']:
----> 4 all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
5 price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
6 volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})
c:\py35\lib\site-packages\pandas_datareader\data.py in get_data_yahoo(*args, **kwargs)
38
39 def get_data_yahoo(*args, **kwargs):
---> 40 return YahooDailyReader(*args, **kwargs).read()
41
42
c:\py35\lib\site-packages\pandas_datareader\yahoo\daily.py in read(self)
113 """ read one data from specified URL """
114 try:
--> 115 df = super(YahooDailyReader, self).read()
116 if self.ret_index:
117 df['Ret_Index'] = _calc_return_index(df['Adj Close'])
c:\py35\lib\site-packages\pandas_datareader\base.py in read(self)
179 if isinstance(self.symbols, (compat.string_types, int)):
180 df = self._read_one_data(self.url,
--> 181 params=self._get_params(self.symbols))
182 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
183 elif isinstance(self.symbols, DataFrame):
c:\py35\lib\site-packages\pandas_datareader\base.py in _read_one_data(self, url, params)
77 """ read one data from specified URL """
78 if self._format == 'string':
---> 79 out = self._read_url_as_StringIO(url, params=params)
80 elif self._format == 'json':
81 out = self._get_response(url, params=params).json()
c:\py35\lib\site-packages\pandas_datareader\base.py in _read_url_as_StringIO(self, url, params)
88 Open url (and retry)
89 """
---> 90 response = self._get_response(url, params=params)
91 text = self._sanitize_response(response)
92 out = StringIO()
c:\py35\lib\site-packages\pandas_datareader\base.py in _get_response(self, url, params, headers)
137 if params is not None and len(params) > 0:
138 url = url + "?" + urlencode(params)
--> 139 raise RemoteDataError('Unable to read URL: {0}'.format(url))
140
141 def _get_crumb(self, *args):
RemoteDataError: Unable to read URL: https://query1.finance.yahoo.com/v7/finance/download/IBM?crumb=%5Cu002FUftz31NJjj&period1=946656000&interval=1d&period2=1262361599&events=history
處理缺失數據
from numpy import nan as NA
data=Series([1,NA,3.5,NA,7])
data.dropna()
#dropna返回一個僅含非空數據和索引值的Series
0 1.0
2 3.5
4 7.0
dtype: float64
data=DataFrame([
[1.,6.5,3.],[1.,NA,NA],
[NA,NA,NA],[NA,6.5,3.]
])
cleaned=data.dropna()#對於DataFrame,dropna默認丟棄任何含有缺失值的行;
#傳入how='all'將只丟棄全爲NA的行
data
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
cleaned
data.fillna(0)
#填充缺失數據
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
3 |
0.0 |
6.5 |
3.0 |
層次化索引
在一個軸上有多個索引級別,也就是說能以低緯度形式處理高維度數據。
以Series爲例
data=Series(np.random.randn(10),
index=[
['a','a','a','b','b','b','c','c','d','d'],
[1,2,3,1,2,3,1,2,2,3]
])
data
#MultiIndex索引
a 1 0.704940
2 1.034785
3 -0.575555
b 1 1.465815
2 -2.065133
3 -0.191078
c 1 2.251724
2 -1.282849
d 2 0.270976
3 1.014202
dtype: float64
data['b']
1 1.465815
2 -2.065133
3 -0.191078
dtype: float64
data.unstack()
#多維度的Series能夠經過unstack方法從新安排到一個DataFrame中:其逆運算是stack
|
1 |
2 |
3 |
a |
0.704940 |
1.034785 |
-0.575555 |
b |
1.465815 |
-2.065133 |
-0.191078 |
c |
2.251724 |
-1.282849 |
NaN |
d |
NaN |
0.270976 |
1.014202 |
對於一個DataFrame,每條軸均可以有分層索引:
frame=DataFrame(np.arange(12).reshape(4,3),
index=[
['a','a','b','b'],[1,2,1,2]
],
columns=[
['Ohio','Ohio','Colorado'],['Green','Red','Green']
])
frame
|
|
Ohio |
Colorado |
|
|
Green |
Red |
Green |
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
#各層均可以有名字
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
重排分級順序
frame.swaplevel('key1','key2')
#swaplevel接受兩個級別編號或名稱,並返回一個互換了級別的新對象。
frame.sort_index(level=1)
#sort_index能夠根據單個級別中的值進行排序。
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
b |
1 |
6 |
7 |
8 |
a |
2 |
3 |
4 |
5 |
b |
2 |
9 |
10 |
11 |
frame.sum(level='key2')
state |
Ohio |
Colorado |
color |
Green |
Red |
Green |
key2 |
|
|
|
1 |
6 |
8 |
10 |
2 |
12 |
14 |
16 |
若是您以爲感興趣的話,能夠添加個人微信公衆號:一步一步學Python