python 數據分析----pandas

時間 2019-11-24

標籤 python 數據分析 pandas 欄目 Python 简体版

原文原文鏈接

pandas是一個強大的Python數據分析的工具包。正則表達式

pandas是基於NumPy構建的。數據庫

pandas的主要功能json

具有對其功能的數據結構DataFrame、Series
集成時間序列功能
提供豐富的數學運算和操做

安裝方法：pip install pandas數組

引用方法：import pandas as pd數據結構

`Series：`

是帶有標籤的一維數組，能夠保存任何數據類型（整數，字符串，浮點數，Python對象等）。軸標籤統稱爲索引。建立Series的基本方法是調用：app

s = pd.Series(data, index=index)

data 能夠是：dom

Python dict（字典）
ndarray
數字

若是data是ndarray，則索引必須與數據長度相同。若是沒有傳遞索引，將建立值爲[0， ...， len(data) - 1]的索引。函數

In [125]: import pandas as pd

In [126]: a = pd.Series([4,7,-5,3],index=['a','b','c','d'])

In [127]: a
Out[127]: 
a    4
b    7
c   -5
d    3
dtype: int64


In [130]: b = pd.Series({'a':1,'b':2})

In [131]: b
Out[131]: 
a    1
b    2
dtype: int64

獲取值數組和索引數組：values屬性和index屬性工具

In [133]: a.values
Out[133]: array([ 4,  7, -5,  3])

In [134]: a.index
Out[134]: Index(['a', 'b', 'c', 'd'], dtype='object')

Series特性:spa

從ndarray建立Series：Series(arr)

In [140]: arr = np.array([1,2,3,4,5]) #必須是一維數組

In [141]: a = pd.Series(arr)

In [142]: a
Out[142]: 
0    1
1    2
2    3
3    4
4    5
dtype: int64

與標量運算：sr*2

In [142]: a
Out[142]: 
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [143]: a*2
Out[143]: 
0     2
1     4
2     6
3     8
4    10
dtype: int64

兩個Series運算：sr1+sr2. 注意索引得同樣，不然報NaN錯誤（not a number）

In [145]: b
Out[145]: 
a    4
b    7
c   -5
d    3
e    0
dtype: int64

In [146]: a
Out[146]: 
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [147]: b+a
Out[147]: 
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
dtype: float64


In [150]: b = pd.Series(np.arange(5))

In [151]: a
Out[151]: 
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [152]: b
Out[152]: 
0    0
1    1
2    2
3    3
4    4
dtype: int64

In [153]: a+b
Out[153]: 
0    1
1    3
2    5
3    7
4    9
dtype: int64

索引：sr[0], sr[[1,2,4]]

In [157]: b
Out[157]: 
a    4
b    7
c   -5
d    3
e    0
dtype: int64

In [158]: b[0]
Out[158]: 4

In [159]: b['a']
Out[159]: 4

切片：sr[0:2]（切片依然是視圖形式,淺拷貝）

In [161]: b[1:3]
Out[161]: 
b    7
c   -5
dtype: int64

In [162]: b['b':'d']
Out[162]: 
b    7
c   -5
d    3
dtype: int64

In [163]: b['b':'c']
Out[163]: 
b    7
c   -5
dtype: int64

通用函數：np.abs(sr)

布爾值過濾：sr[sr>0]

In [164]: b
Out[164]: 
a    4
b    7
c   -5
d    3
e    0
dtype: int64

In [165]: b[b>3]
Out[165]: 
a    4
b    7
dtype: int64

統計函數：mean() sum() cumsum()

In [169]: b
Out[169]: 
a    4
b    7
c   -5
d    3
e    0
dtype: int64

In [170]: b.mean()
Out[170]: 1.8

In [171]: b.sum()
Out[171]: 9

In [173]: b.cumsum(). #每一個數字與這列以前的數據的和
Out[173]: 
a     4
b    11
c     6
d     9
e     9
dtype: int64

Series支持字典的特性（標籤）：
- 從字典建立Series：Series(dic),
- in運算：’a’ in sr、for x in sr
- 鍵索引：sr['a'], sr[['a', 'b', 'd']]
- 鍵切片：sr['a':'c']
- 其餘函數：get('a', default=0)等

整數索引：

若是索引是整數類型，則根據整數進行數據操做時老是面向標籤的。

loc屬性以標籤解釋
iloc屬性如下標解釋

In [178]: sr = pd.Series(np.arange(4.))

In [179]: sr
Out[179]: 
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

In [185]: sr.iloc[0]
Out[185]: 0.0

In [186]: sr.loc[0]
Out[186]: 0.0

Series數據對齊

　　pandas在運算時，會按索引進行對齊而後計算。若是存在不一樣的索引，則結果的索引是兩個操做數索引的並集。

sr1.add(sr2, fill_value=0)
靈活的算術方法：add, sub, div, mul

In [189]: sr1
Out[189]: 
c    12
a    23
d    34
dtype: int64

In [190]: sr2
Out[190]: 
d    11
c    20
a    10
dtype: int64

In [191]: sr1+sr2
Out[191]: 
a    33
c    32
d    45
dtype: int64

In [192]: sr3 = pd.Series([11,20,10,14], index=['d','c','a','b'])

In [193]: sr2+sr3
Out[193]: 
a    20.0
b     NaN
c    40.0
d    22.0
dtype: float64


In [194]: sr2.add(sr3,fill_value = 0)
Out[194]: 
a    20.0
b    14.0
c    40.0
d    22.0
dtype: float64

缺失數據：使用NaN（Not a Number）來表示缺失數據。其值等於np.nan。內置的None值也會被當作NaN處理。
處理缺失數據的相關方法：

dropna() 過濾掉值爲NaN的行
fillna() 填充缺失數據
isnull() 返回布爾數組，缺失值對應爲True
notnull() 返回布爾數組，缺失值對應爲False

過濾缺失數據：sr.dropna() 或 sr[data.notnull()]
填充缺失數據：fillna(0)

DataFrame

DataFrame是一個表格型的數據結構，含有一組有序的列。

DataFrame能夠被看作是由Series組成的字典，而且共用一個索引。

建立方式：

pd.DataFrame({'one':[1,2,3,4],'two':[4,3,2,1]})
pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

In [195]: df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']), 'two'
     ...: :pd.Series([1,2,3,4],index=['b','a','c','d'])})

In [196]: df
Out[196]: 
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4

csv文件讀取與寫入：

df.read_csv('filename.csv')
df.to_csv()

經常使用屬性及方法:

T 轉置
index 獲取索引
columns 獲取列索引

values 獲取值數組

In [213]: df
Out[213]: 
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4

In [214]: df.index
Out[214]: Index(['a', 'b', 'c', 'd'], dtype='object')

In [215]: df.columns
Out[215]: Index(['one', 'two'], dtype='object')

In [216]: df.values
Out[216]: 
array([[  1.,   2.],
       [  2.,   1.],
       [  3.,   3.],
       [ nan,   4.]])

索引和切片:

In [196]: df
Out[196]: 
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4

In [197]: df['one'] #只能是列名
Out[197]: 
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [221]: df
Out[221]: 
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4

In [222]: df.loc['a',df.columns[1]] #行和列
Out[222]: 2

經過位置獲取：

df.iloc[3]
df.iloc[3,3]
df.iloc[0:3,4:6]
df.iloc[1:5,:]
df.iloc[[1,2,4],[0,3]]

經過布爾值過濾：

df[df['A']>0]

df[df['A'].isin([1,3,5])]

In [237]: df['one'].isin([1.0,3.0])
Out[237]: 
a     True
b    False
c     True
d    False
Name: one, dtype: bool

df[df<0] = 0

DataFrame數據對齊與缺失數據：

DataFrame處理缺失數據的方法：

dropna(axis=0, how='any',…) #默認刪一行,axis = o 刪行，axis = 1 刪列

In [261]: df
Out[261]: 
   one  two    1
a  1.0  2.0  NaN
b  2.0  1.0  1.0
c  3.0  3.0  NaN
d  NaN  NaN  NaN
3  1.0  1.0  1.0

In [262]: df.dropna()
Out[262]: 
   one  two    1
b  2.0  1.0  1.0
3  1.0  1.0  1.0

In [263]: df.dropna(axis=1)
Out[263]: 
Empty DataFrame
Columns: []
Index: [a, b, c, d, 3]

fillna()
isnull()
notnull()

In [223]: df
Out[223]: 
   one  two
a  1.0    2
b  2.0    1
c  3.0    3
d  NaN    4

In [224]: df.fillna(11)
Out[224]: 
    one  two
a   1.0    2
b   2.0    1
c   3.0    3
d  11.0    4

pandas經常使用方法（適用Series和DataFrame）：

mean(axis=0,skipna=False)
sum(axis=1)
sort_index(axis, …, ascending) 按行或列索引排序
sort_values(by, axis, ascending) 按值排序
NumPy的通用函數一樣適用於pandas
apply(func, axis=0) 將自定義函數應用在各行或者各列上 ，func可返回標量或者Series
applymap(func) 將函數應用在DataFrame各個元素上
map(func) 將函數應用在Series各個元素上

層次化索引

層次化索引是Pandas的一項重要功能，它使咱們可以在一個軸上擁有多個索引級別。
例：data=pd.Series(np.random.rand(9), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], [1,2,3,1,2,3,1,2,3]])

從文件讀取

讀取文件：從文件名、URL、文件對象中加載數據

read_csv 默認分隔符爲csv
read_table 默認分隔符爲\t
read_excel 讀取excel文件

讀取文件函數主要參數：

sep 指定分隔符，可用正則表達式如'\s+'
header=None 指定文件無列名
name 指定列名
index_col 指定某列做爲索引
skip_row 指定跳過某些行
na_values 指定某些字符串表示缺失值
parse_dates 指定某些列是否被解析爲日期，布爾值或列表

寫入到文件：

to_csv

寫入文件函數的主要參數：

sep
na_rep 指定缺失值轉換的字符串，默認爲空字符串
header=False 不輸出列名一行
index=False 不輸出行索引一列

其餘文件類型：json, XML, HTML, 數據庫
pandas轉換爲二進制文件格式（pickle）:

save
load

時間對象處理(常做爲索引)：

第三方包：dateutil

dateutil.parser.parse()

成組處理日期：pandas

pd.to_datetime(['2001-01-01', '2002-02-02'])

產生時間對象數組：date_range

start 開始時間
end 結束時間
periods 時間長度
freq 時間頻率，默認爲'D'，可選H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…

時間序列就是以時間對象爲索引的Series或DataFrame。

datetime對象做爲索引時是存儲在DatetimeIndex對象中的。
時間序列特殊功能：

傳入「年」或「年月」做爲切片方式
傳入日期範圍做爲切片方式

df = pd.read_csv('601318.csv',index_col='date',parse_dates=['date'],)

In [266]: df

Out[266]:

Unnamed: 0 open close high low volume code

date

2007-03-01 0 22.074 20.657 22.503 20.220 1977633.51 601318

2007-03-02 1 20.750 20.489 20.944 20.256 425048.32 601318

2007-03-05 2 20.300 19.593 20.384 19.218 419196.74 601318

2007-03-06 3 19.426 19.977 20.308 19.315 297727.88 601318

2007-03-07 4 19.995 20.520 20.706 19.827 287463.78 601318

2007-03-08 5 20.353 20.273 20.454 20.167 130983.83 601318

2007-03-09 6 20.264 20.101 20.353 19.735 160887.79 601318

2007-03-12 7 19.999 19.739 19.999 19.646 145353.06 601318

2007-03-13 8 19.783 19.818 19.982 19.699 102319.68 601318

2007-03-14 9 19.558 19.841 19.911 19.333 173306.56 601318

2007-03-15 10 20.097 19.849 20.525 19.779 152521.90 601318

2007-03-16 11 19.863 19.960 20.286 19.602 227547.24 601318

2007-03-20 12 20.662 20.211 20.715 20.088 222026.87 601318

2007-03-21 13 20.220 19.911 20.308 19.823 136728.32 601318

2007-03-22 14 20.066 20.026 20.273 19.969 167509.84 601318

2007-03-23 15 20.017 19.938 20.101 19.739 139810.14 601318

2007-03-26 16 19.955 20.282 20.397 19.946 223266.79 601318

2007-03-27 17 20.216 20.269 20.467 20.145 139338.19 601318

2007-03-28 18 20.264 20.565 20.706 20.123 258263.69 601318

df[df['close']>df['open']] #時間索引的好處

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。