Pandas學習筆記

Pandas是一個Python包，提供快速、靈活和富有表現力的數據結構，使關聯或標記數據的使用既簡單又直觀。html

它旨在成爲Python中進行實際，真實世界數據分析的基礎高級構建塊。這次外還有更普遍的目標，即稱爲任何語言中最強大，最靈活的開源數據分析/操做工具。數組

適合許多不一樣類型的數據數據結構

具備異構類型列的表格數據，如SQL表或Excel表
有序和無序的時間序列數據
具備行和列標籤的任意矩陣數據
任何其餘形式的觀察/統計數據集。實際上不須要將數據標記爲放置在pandas數據結構中

主要數據結構是Series（一維）和DataFrame（二維）處理金融，統計，社會科學和許多工程領域中的絕大多數典型用例。Pandas創建在NumPy之上，與許多其餘第三方庫完美集成。dom

數據結構

數據對齊是固有的函數

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Series是一維標記的數組，可以保存任何數據類型。軸標籤統稱爲索引，建立系列的基本方法是調用工具

s = pd.Series(data,index=index)

data能夠有不少不一樣的東西，好比spa

一個Python字典
一個ndarray
標量

傳遞的索引是軸標籤列表。所以根據數據的不一樣，能夠分爲幾種狀況code

來自ndarrayhtm

若是data是ndarray，則索引的長度必須與數據的長度相同。若是沒有傳遞索引，將建立一個具備值的索引對象

pd.Series(np.random.rand(5))

來自dict

能夠從dicts實例化

pd.Series({'b':1})

來自標量值

若是data是標量值，則必須提供索引。將重複該值以匹配索引的長度

pd.Series(5.,index=['a','b'])

index是行索引，columns是列索引

能夠對Series進行計算和切片等操做

系列也能夠有一個name屬性

s = pd.Series(np.random.randn(5),name='something')

DataFrame是一個二維標記數據結構，具備可能不一樣類型的列。能夠將其視爲電子表格或SQL表，或Series對象的字段。一般是最經常使用的pandas對象。接受不一樣類型的輸入

1D ndarray，list,dicts或Series的Dict
二維numpy.ndarray
結構化或記錄ndarray
一個Series
另外一個DataFrame

除了數據，還能夠傳遞索引（行標籤）和列（列標籤）參數。若是傳遞索引和/或列，則能夠保證生成的DataFrame的索引和/或列。

來自dict或Series的dicts

獲得的指數將是各類系列的指標。若是有任何嵌套的dicts，將首先轉換爲Series。

In [34]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), ....: 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} ....: In [35]: df = pd.DataFrame(d) In [36]: df Out[36]: one two a 1.0  1.0 b 2.0  2.0 c 3.0  3.0 d NaN 4.0 In [37]: pd.DataFrame(d, index=['d', 'b', 'a']) Out[37]: one two d NaN 4.0 b 2.0  2.0 a 1.0  1.0 In [38]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three']) Out[38]: two three d 4.0 NaN b 2.0 NaN a 1.0   NaN

經過訪問索引和列數據，能夠分別訪問行和列標籤

In [39]: df.index Out[39]: Index(['a', 'b', 'c', 'd'], dtype='object') In [40]: df.columns Out[40]: Index(['one', 'two'], dtype='object')

來自ndarrays 、lists的字典

ndarrays必須都是相同的長度。若是傳遞索引，則它必須明顯與數組的長度相同。若是沒有傳遞索引，結果將是range(n)

In [41]: d = {'one' : [1., 2., 3., 4.], ....: 'two' : [4., 3., 2., 1.]} ....: In [42]: pd.DataFrame(d) Out[42]: one two 0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0 In [43]: pd.DataFrame(d, index=['a', 'b', 'c', 'd']) Out[43]: one two a 1.0  4.0 b 2.0  3.0 c 3.0  2.0 d 4.0  1.0

來自structrued或record array

處理方式與數組的字段相同

In [44]: data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')]) In [45]: data[:] = [(1,2.,'Hello'), (2,3.,"World")] In [46]: pd.DataFrame(data) Out[46]: A B C 0  1  2.0  b'Hello'
1  2  3.0  b'World' In [47]: pd.DataFrame(data, index=['first', 'second']) Out[47]: A B C first 1  2.0  b'Hello' second 2  3.0  b'World' In [48]: pd.DataFrame(data, columns=['C', 'A', 'B']) Out[48]: C A B 0  b'Hello'  1  2.0
1  b'World'  2  3.0

來自dicts列表

In [49]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}] In [50]: pd.DataFrame(data2) Out[50]: a b c 0  1   2 NaN 1  5  10  20.0 In [51]: pd.DataFrame(data2, index=['first', 'second']) Out[51]: a b c first 1   2 NaN second 5  10  20.0 In [52]: pd.DataFrame(data2, columns=['a', 'b']) Out[52]: a b 0  1   2
1  5  10

來自元組

In [53]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}) ....: Out[53]: a b b a c a b A B 1.0  4.0  5.0  8.0  10.0 C 2.0  3.0  6.0  7.0 NaN D NaN NaN NaN NaN 9.0

構造函數

DataFrame.from_dict採用dicts的dict或相似數組序列的dict並返回DataFrame。DataFrame除了默認狀況下的orient參數外，它的操做相似於構造函數columns，但能夠將其設置index爲使用dict鍵做爲行標籤

In [54]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])])) Out[54]: A B 0  1  4
1  2  5
2  3  6

若是設置orient='index'，則鍵將是行標籤。還能夠傳遞所需的列名稱

In [55]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]), ....: orient='index', columns=['one', 'two', 'three']) ....: Out[55]: one two three A 1    2      3 B 4    5      6

DataFrame.from_record獲取元組列表或帶有結構化dtype的ndarray。相似於普通DataFrame構造函數，但生產的DataFrame索引多是結構化dtype的特定字段

In [57]: pd.DataFrame.from_records(data, index='C') Out[57]: A B C b'Hello'  1  2.0 b'World'  2  3.0

列選擇、添加、刪除

能夠將DataFrame視爲相似索引的Series對象的dict。獲取，設置和刪除的工做方式與dict操做相同

In [58]: df['one'] Out[58]: a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 In [59]: df['three'] = df['one'] * df['two'] In [60]: df['flag'] = df['one'] > 2 In [61]: df Out[61]: one two three flag a 1.0  1.0    1.0 False b 2.0  2.0    4.0 False c 3.0  3.0    9.0 True d NaN 4.0    NaN  False

刪除

In [62]: del df['two'] In [63]: three = df.pop('three') In [64]: df Out[64]: one flag a 1.0 False b 2.0 False c 3.0 True d NaN False

插入

In [65]: df['foo'] = 'bar' In [66]: df Out[66]: one flag foo a 1.0 False bar b 2.0 False bar c 3.0 True bar d NaN False bar

若是插入與DataFrame不具備相同索引的Series時，將符合DataFrame的索引

In [67]: df['one_trunc'] = df['one'][:2] In [68]: df Out[68]: one flag foo one_trunc a 1.0  False  bar        1.0 b 2.0  False  bar        2.0 c 3.0 True bar NaN d NaN False bar NaN

能夠插入原始的ndarrays，但他們的長度必須與DataFrame索引的長度相匹配

默認狀況下，列會在末尾插入。該insert函數能夠用於插入列的特定位置

In [69]: df.insert(1, 'bar', df['one']) In [70]: df Out[70]: one bar flag foo one_trunc a 1.0  1.0  False  bar        1.0 b 2.0  2.0  False  bar        2.0 c 3.0  3.0 True bar NaN d NaN NaN False bar NaN

在方法鏈中分配新列

DataFrame有一種assign()方法能夠建立從現有列派生的新列

In [71]: iris = pd.read_csv('data/iris.data') In [72]: iris.head() Out[72]: SepalLength SepalWidth PetalLength PetalWidth Name 0          5.1         3.5          1.4         0.2  Iris-setosa 1          4.9         3.0          1.4         0.2  Iris-setosa 2          4.7         3.2          1.3         0.2  Iris-setosa 3          4.6         3.1          1.5         0.2  Iris-setosa 4          5.0         3.6          1.4         0.2  Iris-setosa In [73]: (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength']) ....: .head()) ....: Out[73]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio 0          5.1         3.5          1.4         0.2  Iris-setosa       0.6863
1          4.9         3.0          1.4         0.2  Iris-setosa       0.6122
2          4.7         3.2          1.3         0.2  Iris-setosa       0.6809
3          4.6         3.1          1.5         0.2  Iris-setosa       0.6739
4          5.0         3.6          1.4         0.2  Iris-setosa       0.7200

咱們插入了一個預先計算的值，還能夠傳入一個參數的函數，以便在分配給DataFrame上進行求值

In [74]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] / ....: x['SepalLength'])).head() ....: Out[74]: SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio 0          5.1         3.5          1.4         0.2  Iris-setosa       0.6863
1          4.9         3.0          1.4         0.2  Iris-setosa       0.6122
2          4.7         3.2          1.3         0.2  Iris-setosa       0.6809
3          4.6         3.1          1.5         0.2  Iris-setosa       0.6739
4          5.0         3.6          1.4         0.2  Iris-setosa       0.7200

assign始終返回數據的副本，保持原始DataFrame不變

當沒有引用DataFrame時，傳遞可調用的，而不是要插入的實際值。這assign在操做鏈中使用時很常見

In [75]: (iris.query('SepalLength > 5') ....: .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength, ....: PetalRatio = lambda x: x.PetalWidth / x.PetalLength) ....: .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))

索引選擇

選擇列：df[col]：系列

按標籤選擇行：df.loc[label]：系列

按整數位置選擇行：df.iloc[loc]：系列

切片行：df[5:10]：數據幀

按布爾向量選擇行：df[bool_vec]：數據幀

數據對齊和算術

DataFrame對象之間的數據對其自動在列和索引（行標籤）上對齊。一樣生成的對象具備列和行標籤的並集

In [82]: df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D']) In [83]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C']) In [84]: df + df2 Out[84]: A B C D 0  0.0457 -0.0141  1.3809 NaN 1 -0.9554 -1.5010  0.0372 NaN 2 -0.6627  1.5348 -0.8597 NaN 3 -2.4529  1.2373 -0.1337 NaN

DataFrame和Series之間執行操做時，默認行爲是在DataFrame列上對齊Series索引，從而按行進行廣播

df - df.iloc[0]

在使用時間序列數據的特殊狀況下，DataFrame索引還包含日期，廣播將按列進行

index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))
df*5+2
1/df
df ** 4
df1 & df2
df1 | df2
df1 ^ df2
-df1

轉置

要進行轉置，須要訪問T屬性，相似於ndarray

df[:5].T

DataFrame與NumPy函數的互操做性

能夠在DataFrame上使用其餘的Numpy函數

np.exp(df)

np.asarray(df)

df.T.dot(df)

DataFrame列屬性訪問

若是DataFrame列標籤是有效的Python變量名稱，則能夠像屬性同樣訪問

df.A

基本功能

要查看Series或DataFrame對象的小樣本，請使用head()和tail()方法。顯示的默認元素數爲5，但您能夠傳遞自定義數字。

long_series = pd.Series(np.random.randn(100))

long_series.head() //查看頭

long_series.tail(3) //查看末尾

屬性和原始ndarray(s)

df.columns = [x.lower() for x in df.columns] //列名小寫

df.values 訪問實際數據

加速操做

pandas支持使用numexpr庫和bottleneck庫加速某些類型的二進制數值和布爾運算

pd.set_option('compute.use_bottleneck', False) pd.set_option('compute.use_numexpr', False)

匹配和廣播

數據框所擁有的方法add()，sub()，mul()，div()和相關的功能radd()，rsub()用於執行二進制運算。

對於廣播行爲，系列輸入是主要關注點。能夠經過axis關鍵字匹配索引或列。

填充值數據的操做

算術函數能夠選擇輸入fill_value，即當缺乏某個位置時須要替換的值。

df.add(df2,fill_value=0)

布爾縮減

能夠使用empty,any(),all,bool()提供一種方法來歸納一個布爾結果

(df>0).all()
(df>0).any()
df.empty

比較對象是否相等

df+df != df*2 這個表達式是錯誤的

(df+df).equals(df*2) 應該使用這個表達式

比較相似數組的對象

pandas數據結構與標量值進行比較時，能夠用下面方式執行元素比較

In [64]: pd.Series(['foo', 'bar', 'baz']) == 'foo' Out[64]: 0 True 1 False 2 False dtype: bool In [65]: pd.Index(['foo', 'bar', 'baz']) == 'foo' Out[65]: array([ True, False, False], dtype=bool)

pandas還處理相同長度的不一樣數組對象之間的元素比較

In [66]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux']) Out[66]: 0 True 1 True 2 False dtype: bool In [67]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux']) Out[67]: 0 True 1 True 2 False dtype: bool

嘗試比較Index或Series不一樣長度的對象將引起ValueError

注意：這個和NumPy的廣播不同

組合重疊數據集

兩個類似數據集的組合，其中一個比較好。咱們但願組合兩個DataFrame對象，其中一個DataFrame中的缺失值有條件地填充來自其餘DataFrame的相似標記的值。實現操做的函數是combine_first()，如同位置兩邊都存在則使用第一個。

In [70]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan], ....: 'B' : [np.nan, 2., 3., np.nan, 6.]}) ....: In [71]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.], ....: 'B' : [np.nan, np.nan, 3., 4., 6., 8.]}) ....: In [72]: df1 Out[72]: A B 0  1.0 NaN 1  NaN  2.0
2  3.0  3.0
3  5.0 NaN 4  NaN  6.0 In [73]: df2 Out[73]: A B 0  5.0 NaN 1  2.0 NaN 2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0 In [74]: df1.combine_first(df2) Out[74]: A B 0  1.0 NaN 1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

通用DataFrame組合

另外一個DataFrame和組合器函數，對齊輸入DataFrame，而後傳遞Series的組合器函數對（名稱相同的列）

In [75]: combiner = lambda x, y: np.where(pd.isna(x), y, x) In [76]: df1.combine(df2, combiner)

描述性統計

http://pandas.pydata.org/pandas-docs/stable/basics.html