Pandas入門

pandas的數據結構介紹

主要包含兩個數據結構,Series和DataFramepython

Series

相似於一維數組,有數據和索引。默認建立整數型索引。
能夠經過values和index獲取數據和索引。web

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj=Series([4,7,-5,3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

若是想要自定義索引,舉例以下,index就是一個列表:json

obj2=Series([4,7,-5,3],index=['b','d','a','c'])
obj2
b    4
d    7
a   -5
c    3
dtype: int64

經過索引選擇Series中單個或者一組值,輸入的參數是一個索引或者一個索引的list數組

obj2[['a','b','c']]
a   -5
b    4
c    3
dtype: int64

Series相似與一個Dict,索引和數據之間存在映射關係。能夠直接使用Dict建立一個Series。微信

'b' in obj2
True
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

上述例子中只傳入一個字典,那麼Series的索引就是原來Dict中的key,若是設置的index不一樣的話,會出現NaN的狀況,後面會詳細講解一下NaN的處理。數據結構

states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

DataFrame

DataFrame是一個表格型的數據結構,含有一組有序的列,每列能夠使不一樣的值類。
DataFrame既有行索引也有列索引。
構建DataFrame的經常使用方法是直接傳入一個由等長列表或者Numpy數組組成的Dict:app

data={
    'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
    'year':[2000,2001,2002,2001,2002],
    'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame=DataFrame(data)
frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

若是指定列序列,那麼DataFrame的列會按照制定順序排列:dom

DataFrame(data,columns=['year','state','pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

若是傳入的列找不到對應的數據,那麼就會產生NA值:函數

frame2=DataFrame(data,columns=['year','state','pop','debt'],
                 index=['one','two','three','four','five'])
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

frame2['state']或者frame2.year的方式,能夠獲取一個Series,也就是一列。
獲取行的方法是用索引字段ix,好比frame2.ix['three']。url

frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

列能夠經過賦值的方式進行修改,若是將列表或數組賦值給某個列,長度須要跟DataFrame的長度匹配,若是賦值的是一個Series,就是精確匹配DataFrame的索引,全部的空位都會填上缺失值:

frame2['debt']=16.5
frame2
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
frame2['debt']=np.arange(5.)
frame2
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
val=Series([-1.2,-1.5,-1.7],
          index=['two','four','five'])
frame2['debt']=val
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

爲不存在的列賦值會建立出一個新列,使用del關鍵字能夠刪除列:

frame2['eastern']=frame2.state=='Ohio'
frame2
year state pop debt eatern eastern
one 2000 Ohio 1.5 NaN True True
two 2001 Ohio 1.7 -1.2 True True
three 2002 Ohio 3.6 NaN True True
four 2001 Nevada 2.4 -1.5 False False
five 2002 Nevada 2.9 -1.7 False False
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

若是使用嵌套字典來建立DataFrame,那麼外層字典的key做爲列,內層字典的key做爲行索引:

pop={
    'Nevada':{2001:2.4,2002:2.9},
    'Ohio':{2000:1.5,2001:1.7,2002:3.6}
}
frame3=DataFrame(pop)
frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
frame3.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
DataFrame(pop,index=[2001,2002,2003])
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

索引對象

構建Series和DataFrame時,所用到的任何數組或其餘序列的標籤都會轉換成一個Index對象,Index對象是不能修改的,於是才能使Index對象在多個數據結構中共享。

index=pd.Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
obj2.index is index
True

Index的方法和屬性:
append,鏈接另外一個Index對象,產生一個新的Index
diff,計算差集,並獲得一個Index
delete,刪除索引i處的元素,並獲得新的Index
drop,刪除傳入的值,並獲得新的Index

基本功能

從新索引

reindex方法,建立一個適應新索引的新對象.
調用該Series的reindex將會根據新索引進行重拍,若是某個索引值不存在,就引入缺失值,fill_value。
method選項能夠進行插值填充,ffill或pad,向前填充,bfill或backfill,向後填充。
好比:

obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj2=obj.reindex(['a','b','c','d','e'],fill_value=0)
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
obj3=Series(['blue','purple','yellow'],index=[0,2,4])
obj3.reindex(range(6),method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

使用columns關鍵字能夠從新索引列,可是插值只能按行應用,也就是index方向。

frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)

丟棄指定軸上的項

使用drop方法,給出一個索引數據或者列表,就能夠刪除。

obj=Series(np.arange(5.),index=['a','b','c','d','e'])
new_obj=obj.drop(['b','c'])
new_obj
a    0.0
d    3.0
e    4.0
dtype: float64

索引、選取和過濾

Series的索引相似於Numpy數組的索引,只不過不是整數,好比:

obj=Series(np.arange(4.),index=['a','b','c','d'])
obj['b']
1.0
obj[1]
1.0
obj[2:4]#這種切片使不包含末端的
c    2.0
d    3.0
dtype: float64
obj[['b','a','d']]
b    1.0
a    0.0
d    3.0
dtype: float64
obj[[1,3]]
b    1.0
d    3.0
dtype: float64
obj[obj>2]
d    3.0
dtype: float64
obj['b':'c']#若是是利用標籤的切片,是包含末端的。
b    1.0
c    2.0
dtype: float64
obj['b':'c']=5#設置值的方式很簡單
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

對DataFrame進行索引就是得到一個或者多個列:

data=DataFrame(np.arange(16).reshape(4,4),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['one','two','three','four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']#獲取標籤爲two的那一列
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
data[:2]#獲取前兩行
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three']>5]#獲取three這一列中大於5的那幾行
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data<5#布爾方法,計算每一個元素與5的大小
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data<5]=0#將全部小於5的元素值設置爲0
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

DataFrame在行上進行索引時,能夠使用專門的.loc索引基於標籤的字段,.iloc索引基於位置的字段

data.loc['Colorado',['two','three']]
two      5
three    6
Name: Colorado, dtype: int32

DataFrame和Series之間的運算

arr=np.arange(12.).reshape(3,4)
arr-arr[0]
#默認狀況下DataFrame和Series之間的算術運算會將Series的索引匹配到DataFrame的列,而後沿着行一直向下廣播。
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

函數應用和映射

frame=DataFrame(np.random.randn(4,3),
               columns=list('bde'),
               index=['Utah','Ohio','Texas','Oregon'])
np.abs(frame)
b d e
Utah 0.855613 1.696205 0.503547
Ohio 1.086818 1.448180 1.568419
Texas 0.360607 0.674741 0.590972
Oregon 1.270708 0.461014 0.427092
f=lambda x: x.max()-x.min()
frame.apply(f)#默認axis=0,也就是在列方向上,豎直方向上應用函數,能夠設置axis=1
b    0.910101
d    2.370946
e    2.071966
dtype: float64

排序和排名

要對行或者列索引進行排序,能夠用sort_index方法:

obj=Series(range(4),index=['d','a','c','b'])
obj.sort_index()
#按照index排序
a    1
b    3
c    2
d    0
dtype: int64
frame=DataFrame(np.arange(8).reshape(2,4),
               index=['three','one'],
               columns=['d','a','b','c'])
frame.sort_index()
#原本three在上,排序後one在上了,也就是默認爲豎直方向排序,axis=0.還能夠添加ascending=False進行降序排列
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis=1)
a b c d
three 1 2 3 0
one 5 6 7 4

若是須要按值對Series排序,能夠使用sort_values方法:

obj=pd.Series(np.random.randn(8))
obj.sort_values()
6   -0.896499
2   -0.827439
3   -0.520070
5   -0.216063
7    0.353973
1    0.400870
0    0.902996
4    1.854120
dtype: float64

彙總和計算描述統計

df=DataFrame(np.arange(8.).reshape(4,2),
            index=['a','b','c','d'],
            columns=['one','two'])
df.sum()
#默認計算列方向上的和,axis=0,能夠設置axis=1計算行方向,設置skipna=True自動排除NA值,默認是true
one    12.0
two    16.0
dtype: float64
df.describe()
#針對Series或DataFrame各列計算彙總統計
one two
count 4.000000 4.000000
mean 3.000000 4.000000
std 2.581989 2.581989
min 0.000000 1.000000
25% 1.500000 2.500000
50% 3.000000 4.000000
75% 4.500000 5.500000
max 6.000000 7.000000
df.cumsum()
#樣本值的累計和
one two
a 0.0 1.0
b 2.0 4.0
c 6.0 9.0
d 12.0 16.0

相關係數與協方差

from pandas_datareader import data as web
all_data={}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})

returns=price.pct_change()
returns.tail()
#這個例子不演示了,打不開雅虎的網頁了。。。。
---------------------------------------------------------------------------

RemoteDataError                           Traceback (most recent call last)

<ipython-input-45-5ca20168c7a5> in <module>()
      2 all_data={}
      3 for ticker in ['AAPL','IBM','MSFT','GOOG']:
----> 4     all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
      5 price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
      6 volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})


c:\py35\lib\site-packages\pandas_datareader\data.py in get_data_yahoo(*args, **kwargs)
     38 
     39 def get_data_yahoo(*args, **kwargs):
---> 40     return YahooDailyReader(*args, **kwargs).read()
     41 
     42 


c:\py35\lib\site-packages\pandas_datareader\yahoo\daily.py in read(self)
    113         """ read one data from specified URL """
    114         try:
--> 115             df = super(YahooDailyReader, self).read()
    116             if self.ret_index:
    117                 df['Ret_Index'] = _calc_return_index(df['Adj Close'])


c:\py35\lib\site-packages\pandas_datareader\base.py in read(self)
    179         if isinstance(self.symbols, (compat.string_types, int)):
    180             df = self._read_one_data(self.url,
--> 181                                      params=self._get_params(self.symbols))
    182         # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    183         elif isinstance(self.symbols, DataFrame):


c:\py35\lib\site-packages\pandas_datareader\base.py in _read_one_data(self, url, params)
     77         """ read one data from specified URL """
     78         if self._format == 'string':
---> 79             out = self._read_url_as_StringIO(url, params=params)
     80         elif self._format == 'json':
     81             out = self._get_response(url, params=params).json()


c:\py35\lib\site-packages\pandas_datareader\base.py in _read_url_as_StringIO(self, url, params)
     88         Open url (and retry)
     89         """
---> 90         response = self._get_response(url, params=params)
     91         text = self._sanitize_response(response)
     92         out = StringIO()


c:\py35\lib\site-packages\pandas_datareader\base.py in _get_response(self, url, params, headers)
    137         if params is not None and len(params) > 0:
    138             url = url + "?" + urlencode(params)
--> 139         raise RemoteDataError('Unable to read URL: {0}'.format(url))
    140 
    141     def _get_crumb(self, *args):


RemoteDataError: Unable to read URL: https://query1.finance.yahoo.com/v7/finance/download/IBM?crumb=%5Cu002FUftz31NJjj&period1=946656000&interval=1d&period2=1262361599&events=history

處理缺失數據

from numpy import nan as NA
data=Series([1,NA,3.5,NA,7])
data.dropna()
#dropna返回一個僅含非空數據和索引值的Series
0    1.0
2    3.5
4    7.0
dtype: float64
data=DataFrame([
    [1.,6.5,3.],[1.,NA,NA],
    [NA,NA,NA],[NA,6.5,3.]
])
cleaned=data.dropna()#對於DataFrame,dropna默認丟棄任何含有缺失值的行;
#傳入how='all'將只丟棄全爲NA的行
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleaned
0 1 2
0 1.0 6.5 3.0
data.fillna(0)
#填充缺失數據
0 1 2
0 1.0 6.5 3.0
1 1.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 6.5 3.0

層次化索引

在一個軸上有多個索引級別,也就是說能以低緯度形式處理高維度數據。
以Series爲例

data=Series(np.random.randn(10),
           index=[
               ['a','a','a','b','b','b','c','c','d','d'],
               [1,2,3,1,2,3,1,2,2,3]
           ])
data
#MultiIndex索引
a  1    0.704940
   2    1.034785
   3   -0.575555
b  1    1.465815
   2   -2.065133
   3   -0.191078
c  1    2.251724
   2   -1.282849
d  2    0.270976
   3    1.014202
dtype: float64
data['b']
1    1.465815
2   -2.065133
3   -0.191078
dtype: float64
data.unstack()
#多維度的Series能夠經過unstack方法從新安排到一個DataFrame中:其逆運算是stack
1 2 3
a 0.704940 1.034785 -0.575555
b 1.465815 -2.065133 -0.191078
c 2.251724 -1.282849 NaN
d NaN 0.270976 1.014202

對於一個DataFrame,每條軸均可以有分層索引:

frame=DataFrame(np.arange(12).reshape(4,3),
               index=[
                   ['a','a','b','b'],[1,2,1,2]
               ],
               columns=[
                   ['Ohio','Ohio','Colorado'],['Green','Red','Green']
               ])
frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
#各層均可以有名字
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

重排分級順序

frame.swaplevel('key1','key2')
#swaplevel接受兩個級別編號或名稱,並返回一個互換了級別的新對象。
frame.sort_index(level=1)
#sort_index能夠根據單個級別中的值進行排序。
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
frame.sum(level='key2')
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16

若是您以爲感興趣的話,能夠添加個人微信公衆號:一步一步學Python

相關文章
相關標籤/搜索