Pandas是Python環境下最有名的數據統計包,是基於 Numpy 構建的含有更高級數據結構和工具的數據分析包。Pandas圍繞着 Series 和 DataFrame 兩個核心數據結構展開的。本文着重介紹這兩種數據結構的生成和訪問的基本方法。html
Series是一種相似於一維數組的對象,由一組數據(一維ndarray數組對象)和一組與之對應相關的數據標籤(索引)組成。
注:numpy(Numerical Python)提供了python對多維數組對象的支持:ndarray,具備矢量運算能力,快速、節省空間。python
(1)Pandas說明文檔中對Series特色介紹以下:數組
""" One-dimensional ndarray with axis labels (including time series).微信
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).數據結構Operations between Series (+, -, /, , *) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.domParameters
---------- data : array-like, dict, or scalar valueideContains data stored in Series index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict. dtype : numpy.dtype or None If None, dtype will be inferred copy : boolean, default False Copy input data """
(2)建立Series的基本方法以下,數據能夠是陣列(list、ndarray)、字典和常量值。s = pd.Series(data, index=index)工具
s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],dtype='int8' ) a -1 b 0 c 0 d -1 e -1 dtype: int8 s = pd.Series(['a',-0.75414753,123,66666,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],) a a b -0.754148 c 123 d 66666 e -1.64899 dtype: object
注:Series支持的數據類型包括整數、浮點數、複數、布爾值、字符串等numpy.dtype,與建立ndarray數組相同的是,如未指定類型,它會嘗試推斷出一個合適的數據類型,例程中數據包含數字和字符串時,推斷爲object類型;如指定int8類型時數據以int8顯示。scala
s = pd.Series(np.random.randn(5)) 0 0.485468 1 -0.912130 2 0.771970 3 -1.058117 4 0.926649 dtype: float64 s.index RangeIndex(start=0, stop=5, step=1) s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) a 0.485468 b -0.912130 c 0.771970 d -1.058117 e 0.926649 dtype: float64
注:當數據未指定索引時,Series會自動建立整數型索引code
s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}) a 0.0 b 1.0 c 2.0 dtype: float64 s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a']) b 1.0 c 2.0 d NaN a 0.0 dtype: float64
注:經過Python字典建立Series,可視爲一個定長的有序字典。若是隻傳入一個字典,那麼Series中的索引便是原字典的鍵。若是傳入索引,那麼會找到索引相匹配的值並放在相應的位置上,未找到對應值時結果爲NaN。
s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e']) a 5.0 b 5.0 c 5.0 d 5.0 e 5.0 dtype: float64
注:數值重複匹配以適應索引長度
(3)訪問Series中的元素和索引
s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a']) b 1.0 c 2.0 d NaN a 0.0 dtype: float64 s.values [ 1. 2. nan 0.] s.index Index([u'b', u'c', u'd', u'a'], dtype='object')
注:Series的values和index屬性獲取其數組表示形式和索引對象
s['a'] 0.0 s[['a','b']] a 0.0 b 1.0 dtype: float64 s[['a','b','c']] a 0.0 b 1.0 c 2.0 dtype: float64 s[:2] b 1.0 c 2.0 dtype: float64
注:能夠經過索引的方式選取Series中的單個或一組值
DataFrame是一個表格型(二維)的數據結構,它含有一組有序的列,每列能夠是不一樣的值類型(數值、字符串、布爾值等)。DataFrame既有行索引也有列索引,它能夠看作由Series組成的字典(共用同一個索引)。
(1)Pandas說明文檔中對DataFrame特色介紹以下:
""" Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structureParameters
---------- data : numpy ndarray (structured or homogeneous), dict, or DataFrameDict can contain Series, arrays, constants, or list-like objects index : Index or array-like Index to use for resulting frame. Will default to np.arange(n) if no indexing information part of input data and no index provided columns : Index or array-like Column labels to use for resulting frame. Will default to np.arange(n) if no column labels are provided dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False Copy data from inputs. Only affects DataFrame / 2d ndarray input
(2)建立DataFrame的基本方法以下,數據能夠是由列表、一維ndarray或Series組成的字典(序列長度必須相同)、二維ndarray、字典組成的字典等df = pd.DataFrame(data, index=index)
df = pd.DataFrame({'one': [1., 2., 3., 5], 'two': [1., 2., 3., 4.]}) one two 0 1.0 1.0 1 2.0 2.0 2 3.0 3.0 3 5.0 4.0
注:以列表組成的字典形式建立,每一個序列成爲DataFrame的一列。不支持單一列表建立df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]}),由於list爲unhashable類型
df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=['a', 'b'],columns=['one','two','three','four']) one two three four a 1.0 2.0 3.0 5.0 b 1.0 2.0 3.0 4.0
注:以嵌套列表組成形式建立2行4列的表格,經過index和 columns參數指定了索引和列名
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')]) [(0, 0., '') (0, 0., '')]
注:zeros(shape, dtype=float, order='C')返回一個給定形狀和類型的用0填充的數組
data[:] = [(1,2.,'Hello'), (2,3.,"World")] df = pd.DataFrame(data) A B C 0 1 2.0 Hello 1 2 3.0 World df = pd.DataFrame(data, index=['first', 'second']) A B C first 1 2.0 Hello second 2 3.0 World df = pd.DataFrame(data, columns=['C', 'A', 'B']) C A B 0 Hello 1 2.0 1 World 2 3.0
注:同Series相同,未指定索引時DataFrame會自動加上索引,指定列則按指定順序進行排列
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0
注:以Series組成的字典形式建立時,每一個Series成爲一列,若是沒有顯示指定索引,則各Series的索引被合併成結果的行索引。NaN代替缺失的列數據
df = pd.DataFrame(data,index=['d', 'b', 'a']) one two d NaN 4.0 b 2.0 2.0 a 1.0 1.0 df = pd.DataFrame(data,index=['d', 'b', 'a'], columns=['two', 'three']) two three d 4.0 NaN b 2.0 NaN a 1.0 NaN data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data2) a b c 0 1 2 NaN 1 5 10 20.0
注:以字典的列表形式建立時,各項成爲DataFrame的一行,字典鍵索引的並集成爲DataFrame的列標
df = pd.DataFrame(data2, index=['first', 'second']) a b c first 1 2 NaN second 5 10 20.0 df = pd.DataFrame(data2, columns=['a', 'b']) a b 0 1 2 1 5 10 df = pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}) a b a b c a b A B 4.0 1.0 5.0 8.0 10.0 C 3.0 2.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
注:以字典的字典形式建立時,列索引由外層的鍵合併成結果的列索引,各內層字典成爲一列,內層的鍵會被合併成結果的行索引。
(3)訪問DataFrame中的元素和索引
data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0 df['one']或df.one a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64
注:經過相似字典標記的方式或屬性的方式,能夠將DataFrame的列獲取爲一個Series。返回的Series擁有原DataFrame相同的索引,且其name屬性也被相應設置。
df[0:1] one two a 1.0 1.0
注:返回前兩列數據
df.loc['a'] one 1.0 two 1.0 Name: a, dtype: float64 df.loc[:,['one','two'] ] one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0 df.loc[['a',],['one','two']] one two a 1.0 1.0 df.loc['a','one'] 1.0
注:loc是經過標籤來選擇數據
df.iloc[0:2,0:1] one a 1.0 b 2.0 df.iloc[0:2] one two a 1.0 1.0 b 2.0 2.0 df.iloc[[0,2],[0,1]]#自由選取行位置,和列位置對應的數據 one two a 1.0 1.0 c 3.0 3.0
注:iloc經過位置來選擇數據
df.ix['a'] one 1.0 two 1.0 Name: a, dtype: float64 df.ix['a',['one','two']] one 1.0 two 1.0 Name: a, dtype: float64 df.ix['a',[0,1]] one 1.0 two 1.0 Name: a, dtype: float64 df.ix[['a','b'],[0,1]] one two a 1.0 1.0 b 2.0 2.0 df.ix[1,[0,1]] one 2.0 two 2.0 Name: b, dtype: float64 df.ix[[0,1],[0,1]] one two a 1.0 1.0 b 2.0 2.0
注:經過索引字段ix和名稱結合的方式獲取行數據
df.ix[df.one>1,:1] one b 2.0 c 3.0
注:使用條件來選擇,選取one列中大於1的行和第一列
df['one']=16.8 one two a 16.8 1.0 b 16.8 2.0 c 16.8 3.0 d 16.8 4.0 val = pd.Series([2,2,2],index=['b', 'c', 'd']) df['one']=val one two a NaN 1.0 b 2.0 2.0 c 2.0 3.0 d 2.0 4.0
注:列能夠經過賦值方式修改,將列表或數組賦值給某個列時長度必須和DataFrame的長度相匹配。Series賦值時會精確匹配DataFrame的索引,空位以NaN填充。
df['four']=[3,3,3,3] one two four a NaN 1.0 3 b 2.0 2.0 3 c 2.0 3.0 3 d 2.0 4.0 3
注:對不存在的列賦值會建立新列
df.index.get_loc('a') 0 df.index.get_loc('b') 1 df.columns.get_loc('one') 0
注:經過行/列索引獲取整數形式位置
更多python量化交易內容互動請加微信公衆號:PythonQT-YuanXiao
歡迎訂閱量化交易課程: [連接地址]