Pandas庫基礎分析——數據生成和訪問

前言

Pandas是Python環境下最有名的數據統計包,是基於 Numpy 構建的含有更高級數據結構和工具的數據分析包。Pandas圍繞着 Series 和 DataFrame 兩個核心數據結構展開的。本文着重介紹這兩種數據結構的生成和訪問的基本方法。html


Series

Series是一種相似於一維數組的對象,由一組數據(一維ndarray數組對象)和一組與之對應相關的數據標籤(索引)組成。
注:numpy(Numerical Python)提供了python對多維數組對象的支持:ndarray,具備矢量運算能力,快速、節省空間。python

(1)Pandas說明文檔中對Series特色介紹以下數組

""" One-dimensional ndarray with axis labels (including time series).微信

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).數據結構

Operations between Series (+, -, /, , *) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.dom

Parameters
---------- data : array-like, dict, or scalar valueide

Contains data stored in Series index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex(len(data)) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict. dtype : numpy.dtype or None
If None, dtype will be inferred copy : boolean, default False
Copy input data """

(2)建立Series的基本方法以下,數據能夠是陣列(list、ndarray)、字典和常量值。s = pd.Series(data, index=index)工具

s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],dtype='int8' )
a   -1
b    0
c    0
d   -1
e   -1
dtype: int8

s = pd.Series(['a',-0.75414753,123,66666,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],)
a           a
b   -0.754148
c         123
d       66666
e    -1.64899
dtype: object

注:Series支持的數據類型包括整數、浮點數、複數、布爾值、字符串等numpy.dtype,與建立ndarray數組相同的是,如未指定類型,它會嘗試推斷出一個合適的數據類型,例程中數據包含數字和字符串時,推斷爲object類型;如指定int8類型時數據以int8顯示。scala

s = pd.Series(np.random.randn(5))
0    0.485468
1   -0.912130
2    0.771970
3   -1.058117
4    0.926649
dtype: float64

s.index
RangeIndex(start=0, stop=5, step=1)

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
a    0.485468
b   -0.912130
c    0.771970
d   -1.058117
e    0.926649
dtype: float64

注:當數據未指定索引時,Series會自動建立整數型索引code

s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})
a    0.0
b    1.0
c    2.0
dtype: float64

s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

注:經過Python字典建立Series,可視爲一個定長的有序字典。若是隻傳入一個字典,那麼Series中的索引便是原字典的鍵。若是傳入索引,那麼會找到索引相匹配的值並放在相應的位置上,未找到對應值時結果爲NaN。

s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

注:數值重複匹配以適應索引長度

(3)訪問Series中的元素和索引

s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

s.values
[  1.   2.  nan   0.]

s.index
Index([u'b', u'c', u'd', u'a'], dtype='object')

注:Series的values和index屬性獲取其數組表示形式和索引對象

s['a']
0.0

s[['a','b']]
a    0.0
b    1.0
dtype: float64

s[['a','b','c']]
a    0.0
b    1.0
c    2.0
dtype: float64

s[:2] 
b    1.0
c    2.0
dtype: float64

注:能夠經過索引的方式選取Series中的單個或一組值


DataFrame

DataFrame是一個表格型(二維)的數據結構,它含有一組有序的列,每列能夠是不一樣的值類型(數值、字符串、布爾值等)。DataFrame既有行索引也有列索引,它能夠看作由Series組成的字典(共用同一個索引)。

(1)Pandas說明文檔中對DataFrame特色介紹以下

""" Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structure

Parameters
---------- data : numpy ndarray (structured or homogeneous), dict, or DataFrame

Dict can contain Series, arrays, constants, or list-like objects index : Index or array-like
Index to use for resulting frame. Will default to np.arange(n) if
no indexing information part of input data and no index provided columns : Index or array-like
Column labels to use for resulting frame. Will default to
np.arange(n) if no column labels are provided dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input

(2)建立DataFrame的基本方法以下,數據能夠是由列表、一維ndarray或Series組成的字典(序列長度必須相同)、二維ndarray、字典組成的字典等df = pd.DataFrame(data, index=index)

df = pd.DataFrame({'one': [1., 2., 3., 5], 'two': [1., 2., 3., 4.]})
   one  two
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0

注:以列表組成的字典形式建立,每一個序列成爲DataFrame的一列。不支持單一列表建立df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]}),由於list爲unhashable類型

df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=['a', 'b'],columns=['one','two','three','four'])
   one  two  three  four
a  1.0  2.0    3.0   5.0
b  1.0  2.0    3.0   4.0

注:以嵌套列表組成形式建立2行4列的表格,經過index和 columns參數指定了索引和列名

data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
[(0,  0., '') (0,  0., '')]

注:zeros(shape, dtype=float, order='C')返回一個給定形狀和類型的用0填充的數組

data[:] = [(1,2.,'Hello'), (2,3.,"World")]        
df = pd.DataFrame(data)
   A    B      C
0  1  2.0  Hello
1  2  3.0  World

df = pd.DataFrame(data, index=['first', 'second'])
        A    B      C
first   1  2.0  Hello
second  2  3.0  World

df = pd.DataFrame(data, columns=['C', 'A', 'B'])
       C  A    B
0  Hello  1  2.0
1  World  2  3.0

注:同Series相同,未指定索引時DataFrame會自動加上索引,指定列則按指定順序進行排列

data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

注:以Series組成的字典形式建立時,每一個Series成爲一列,若是沒有顯示指定索引,則各Series的索引被合併成結果的行索引。NaN代替缺失的列數據

df = pd.DataFrame(data,index=['d', 'b', 'a'])
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

df = pd.DataFrame(data,index=['d', 'b', 'a'], columns=['two', 'three'])
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data2)
   a   b     c
0  1   2   NaN
1  5  10  20.0

注:以字典的列表形式建立時,各項成爲DataFrame的一行,字典鍵索引的並集成爲DataFrame的列標

df = pd.DataFrame(data2, index=['first', 'second'])
        a   b     c
first   1   2   NaN
second  5  10  20.0

df = pd.DataFrame(data2, columns=['a', 'b'])
   a   b
0  1   2
1  5  10

df = pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
                 ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
                 ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, 
                 ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},  
                 ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
       a              b
       a    b    c    a     b
A B  4.0  1.0  5.0  8.0  10.0
  C  3.0  2.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

注:以字典的字典形式建立時,列索引由外層的鍵合併成結果的列索引,各內層字典成爲一列,內層的鍵會被合併成結果的行索引。

(3)訪問DataFrame中的元素和索引

data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df['one']或df.one
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

注:經過相似字典標記的方式或屬性的方式,能夠將DataFrame的列獲取爲一個Series。返回的Series擁有原DataFrame相同的索引,且其name屬性也被相應設置。

df[0:1]
   one  two
a  1.0  1.0

注:返回前兩列數據

df.loc['a']
one    1.0
two    1.0
Name: a, dtype: float64

df.loc[:,['one','two'] ]
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

df.loc[['a',],['one','two']]
   one  two
a  1.0  1.0

df.loc['a','one']
1.0

注:loc是經過標籤來選擇數據

df.iloc[0:2,0:1]  
   one
a  1.0
b  2.0

df.iloc[0:2]  
   one  two
a  1.0  1.0
b  2.0  2.0

df.iloc[[0,2],[0,1]]#自由選取行位置,和列位置對應的數據
   one  two
a  1.0  1.0
c  3.0  3.0

注:iloc經過位置來選擇數據

df.ix['a']
one    1.0
two    1.0
Name: a, dtype: float64

df.ix['a',['one','two']]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix['a',[0,1]]
one    1.0
two    1.0
Name: a, dtype: float64

df.ix[['a','b'],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

df.ix[1,[0,1]]
one    2.0
two    2.0
Name: b, dtype: float64

df.ix[[0,1],[0,1]]
   one  two
a  1.0  1.0
b  2.0  2.0

注:經過索引字段ix和名稱結合的方式獲取行數據

df.ix[df.one>1,:1]
   one
b  2.0
c  3.0

注:使用條件來選擇,選取one列中大於1的行和第一列

df['one']=16.8
    one  two
a  16.8  1.0
b  16.8  2.0
c  16.8  3.0
d  16.8  4.0

val = pd.Series([2,2,2],index=['b', 'c', 'd'])
df['one']=val
   one  two
a  NaN  1.0
b  2.0  2.0
c  2.0  3.0
d  2.0  4.0

注:列能夠經過賦值方式修改,將列表或數組賦值給某個列時長度必須和DataFrame的長度相匹配。Series賦值時會精確匹配DataFrame的索引,空位以NaN填充。

df['four']=[3,3,3,3]
   one  two  four
a  NaN  1.0     3
b  2.0  2.0     3
c  2.0  3.0     3
d  2.0  4.0     3

注:對不存在的列賦值會建立新列

df.index.get_loc('a')
0

df.index.get_loc('b')
1

df.columns.get_loc('one')
0

注:經過行/列索引獲取整數形式位置

更多python量化交易內容互動請加微信公衆號:PythonQT-YuanXiao
歡迎訂閱量化交易課程: [連接地址]
相關文章
相關標籤/搜索