pandas模塊

pandas官方文檔:https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750python

pandas基於Numpy,能夠當作是處理文本或者表格數據。pandas中有兩個主要的數據結構,其中Series數據結構相似於Numpy中的一維數組,DataFrame相似於多維表格數據結構。mysql

pandas是python數據分析的核心模塊。它主要提供了五大功能:正則表達式

  1. 支持文件存取操做,支持數據庫(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
  2. 支持增刪改查、切片、高階函數、分組聚合等單表操做,以及和dict、list的互相轉換。
  3. 支持多表拼接合並操做。
  4. 支持簡單的繪圖操做。
  5. 支持簡單的統計分析操做。

1、Series數據結構

Series是一種相似於一維數組的對象,由一組數據和一組與之相關的數據標籤(索引)組成。sql

Series比較像列表(數組)和字典的結合體數據庫

import numpy as np
import pandas as pd
df = pd.Series(0, index=['a', 'b', 'c', 'd'])
print(df)
a    0
b    0
c    0
d    0
dtype: int64
print(df.values)
[0 0 0 0]
print(df.index)
Index(['a', 'b', 'c', 'd'], dtype='object')

1.1 Series支持NumPy模塊的特性(下標)

詳解 方法
從ndarray建立Series Series(arr)
與標量運算 df*2
兩個Series運算 df1+df2
索引 df[0], df[[1,2,4]]
切片 df[0:2]
通用函數 np.abs(df)
布爾值過濾 df[df>0]
arr = np.array([1, 2, 3, 4, np.nan])
print(arr)
[ 1.  2.  3.  4. nan]
df = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])
print(df)
a    1.0
b    2.0
c    3.0
d    4.0
e    NaN
dtype: float64
print(df**2)
a     1.0
b     4.0
c     9.0
d    16.0
e     NaN
dtype: float64
print(df[0])
1.0
print(df['a'])
1.0
print(df[[0, 1, 2]])
a    1.0
b    2.0
c    3.0
dtype: float64
print(df[0:2])
a    1.0
b    2.0
dtype: float64
np.sin(df)
a    0.841471
b    0.909297
c    0.141120
d   -0.756802
e         NaN
dtype: float64
df[df > 1]
b    2.0
c    3.0
d    4.0
dtype: float64

1.2 Series支持字典的特性(標籤)

詳解 方法
從字典建立Series Series(dic),
in運算 ’a’ in sr
鍵索引 sr['a'], sr[['a', 'b', 'd']]
df = pd.Series({'a': 1, 'b': 2})
print(df)
a    1
b    2
dtype: int64
print('a' in df)
True
print(df['a'])
1

1.3 Series缺失數據處理

方法 詳解
dropna() 過濾掉值爲NaN的行
fillna() 填充缺失數據
isnull() 返回布爾數組,缺失值對應爲True
notnull() 返回布爾數組,缺失值對應爲False
df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e'])
print(df)
a    1.0
b    2.0
c    3.0
d    4.0
e    NaN
dtype: float64
print(df.dropna())
a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64
print(df.fillna(5))
a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64
print(df.isnull())
a    False
b    False
c    False
d    False
e     True
dtype: bool
print(df.notnull())
a     True
b     True
c     True
d     True
e    False
dtype: bool

2、DataFrame數據結構

DataFrame是一個表格型的數據結構,含有一組有序的列。json

DataFrame能夠被看作是由Series組成的字典,而且共用一個索引。數組

2.1 產生時間對象數組:date_range

date_range參數詳解:數據結構

參數 詳解
start 開始時間
end 結束時間
periods 時間長度
freq 時間頻率,默認爲'D',可選H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…
dates = pd.date_range('20190101', periods=6, freq='M')
print(dates)
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30'],
              dtype='datetime64[ns]', freq='M')
np.random.seed(1)
arr = 10 * np.random.randn(6, 4)
print(arr)
[[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]
 [  8.65407629 -23.01538697  17.44811764  -7.61206901]
 [  3.19039096  -2.49370375  14.62107937 -20.60140709]
 [ -3.22417204  -3.84054355  11.33769442 -10.99891267]
 [ -1.72428208  -8.77858418   0.42213747   5.82815214]
 [-11.00619177  11.4472371    9.01590721   5.02494339]]
df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4'])
df
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943

3、DataFrame屬性

屬性 詳解
dtype是 查看數據類型
index 查看行序列或者索引
columns 查看各列的標籤
values 查看數據框內的數據,也即不含表頭索引的數據
describe 查看數據每一列的極值,均值,中位數,只可用於數值型數據
transpose 轉置,也可用T來操做
sort_index 排序,可按行或列index排序輸出
sort_values 按數據值來排序
# 查看數據類型
print(df2.dtypes)
0    float64
1    float64
2    float64
3    float64
dtype: object
df
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943
print(df.index)
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30'],
              dtype='datetime64[ns]', freq='M')
print(df.columns)
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
print(df.values)
[[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]
 [  8.65407629 -23.01538697  17.44811764  -7.61206901]
 [  3.19039096  -2.49370375  14.62107937 -20.60140709]
 [ -3.22417204  -3.84054355  11.33769442 -10.99891267]
 [ -1.72428208  -8.77858418   0.42213747   5.82815214]
 [-11.00619177  11.4472371    9.01590721   5.02494339]]
df.describe()
c1 c2 c3 c4
count 6.000000 6.000000 6.000000 6.000000
mean 2.022213 -5.466424 7.927203 -6.514830
std 9.580084 11.107772 8.707171 10.227641
min -11.006192 -23.015387 -5.281718 -20.601407
25% -2.849200 -8.113329 2.570580 -10.931606
50% 0.733054 -4.979054 10.176801 -9.170878
75% 7.288155 -2.830414 13.800233 1.865690
max 16.243454 11.447237 17.448118 5.828152
df.T
2019-01-31 00:00:00 2019-02-28 00:00:00 2019-03-31 00:00:00 2019-04-30 00:00:00 2019-05-31 00:00:00 2019-06-30 00:00:00
c1 16.243454 8.654076 3.190391 -3.224172 -1.724282 -11.006192
c2 -6.117564 -23.015387 -2.493704 -3.840544 -8.778584 11.447237
c3 -5.281718 17.448118 14.621079 11.337694 0.422137 9.015907
c4 -10.729686 -7.612069 -20.601407 -10.998913 5.828152 5.024943
# 按行標籤[c1, c2, c3, c4]從大到小排序
df.sort_index(axis=0)
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943
# 按列標籤[2019-01-01, 2019-01-02...]從大到小排序
df.sort_index(axis=1)
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943
# 按c2列的值從大到小排序
df.sort_values(by='c2')
c1 c2 c3 c4
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
2019-06-30 -11.006192 11.447237 9.015907 5.024943

4、DataFrame取值

df
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943

4.1 經過columns取值

df['c2']
2019-01-31    -6.117564
2019-02-28   -23.015387
2019-03-31    -2.493704
2019-04-30    -3.840544
2019-05-31    -8.778584
2019-06-30    11.447237
Freq: M, Name: c2, dtype: float64
df[['c2', 'c3']]
c2 c3
2019-01-31 -6.117564 -5.281718
2019-02-28 -23.015387 17.448118
2019-03-31 -2.493704 14.621079
2019-04-30 -3.840544 11.337694
2019-05-31 -8.778584 0.422137
2019-06-30 11.447237 9.015907

4.2 loc(經過行標籤取值)

# 經過自定義的行標籤選擇數據
df.loc['2019-01-01':'2019-01-03']
c1 c2 c3 c4
df[0:3]
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407

4.3 iloc(相似於numpy數組取值)

df.values
array([[ 16.24345364,  -6.11756414,  -5.28171752, -10.72968622],
       [  8.65407629, -23.01538697,  17.44811764,  -7.61206901],
       [  3.19039096,  -2.49370375,  14.62107937, -20.60140709],
       [ -3.22417204,  -3.84054355,  11.33769442, -10.99891267],
       [ -1.72428208,  -8.77858418,   0.42213747,   5.82815214],
       [-11.00619177,  11.4472371 ,   9.01590721,   5.02494339]])
# 經過行索引選擇數據
print(df.iloc[2, 1])
-2.493703754774101
df.iloc[1:4, 1:4]
c2 c3 c4
2019-02-28 -23.015387 17.448118 -7.612069
2019-03-31 -2.493704 14.621079 -20.601407
2019-04-30 -3.840544 11.337694 -10.998913

4.4 使用邏輯判斷取值

df[df['c1'] > 0]
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
df[(df['c1'] > 0) & (df['c2'] > -8)]
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-03-31 3.190391 -2.493704 14.621079 -20.601407

5、DataFrame值替換

df
c1 c2 c3 c4
2019-01-31 16.243454 -6.117564 -5.281718 -10.729686
2019-02-28 8.654076 -23.015387 17.448118 -7.612069
2019-03-31 3.190391 -2.493704 14.621079 -20.601407
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943
df.iloc[0:3, 0:2] = 0
df
c1 c2 c3 c4
2019-01-31 0.000000 0.000000 -5.281718 -10.729686
2019-02-28 0.000000 0.000000 17.448118 -7.612069
2019-03-31 0.000000 0.000000 14.621079 -20.601407
2019-04-30 -3.224172 -3.840544 11.337694 -10.998913
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943
df['c3'] > 10
2019-01-31    False
2019-02-28     True
2019-03-31     True
2019-04-30     True
2019-05-31    False
2019-06-30    False
Freq: M, Name: c3, dtype: bool
# 針對行作處理
df[df['c3'] > 10] = 100
df
c1 c2 c3 c4
2019-01-31 0.000000 0.000000 -5.281718 -10.729686
2019-02-28 100.000000 100.000000 100.000000 100.000000
2019-03-31 100.000000 100.000000 100.000000 100.000000
2019-04-30 100.000000 100.000000 100.000000 100.000000
2019-05-31 -1.724282 -8.778584 0.422137 5.828152
2019-06-30 -11.006192 11.447237 9.015907 5.024943
# 針對行作處理
df = df.astype(np.int32)
df[df['c3'].isin([100])] = 1000
df
c1 c2 c3 c4
2019-01-31 0 0 -5 -10
2019-02-28 1000 1000 1000 1000
2019-03-31 1000 1000 1000 1000
2019-04-30 1000 1000 1000 1000
2019-05-31 -1 -8 0 5
2019-06-30 -11 11 9 5

6、讀取CSV文件

import pandas as pd
from io import StringIO
test_data = '''
5.1,,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,,0.2
7.0,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,
,,,
'''

test_data = StringIO(test_data)
df = pd.read_csv(test_data, header=None)
df.columns = ['c1', 'c2', 'c3', 'c4']
df
c1 c2 c3 c4
0 5.1 NaN 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 NaN 0.2
3 7.0 3.2 4.7 1.4
4 6.4 3.2 4.5 1.5
5 6.9 3.1 4.9 NaN
6 NaN NaN NaN NaN

7、處理丟失數據

df.isnull()
c1 c2 c3 c4
0 False True False False
1 False False False False
2 False False True False
3 False False False False
4 False False False False
5 False False False True
6 True True True True
# 經過在isnull()方法後使用sum()方法便可得到該數據集某個特徵含有多少個缺失值
print(df.isnull().sum())
c1    1
c2    2
c3    2
c4    2
dtype: int64
# axis=0刪除有NaN值的行
df.dropna(axis=0)
c1 c2 c3 c4
1 4.9 3.0 1.4 0.2
3 7.0 3.2 4.7 1.4
4 6.4 3.2 4.5 1.5
# axis=1刪除有NaN值的列
df.dropna(axis=1)
0
1
2
3
4
5
6
# 刪除全爲NaN值得行或列
df.dropna(how='all')
c1 c2 c3 c4
0 5.1 NaN 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 NaN 0.2
3 7.0 3.2 4.7 1.4
4 6.4 3.2 4.5 1.5
5 6.9 3.1 4.9 NaN
# 刪除行不爲4個值的
df.dropna(thresh=4)
c1 c2 c3 c4
1 4.9 3.0 1.4 0.2
3 7.0 3.2 4.7 1.4
4 6.4 3.2 4.5 1.5
# 刪除c2中有NaN值的行
df.dropna(subset=['c2'])
c1 c2 c3 c4
1 4.9 3.0 1.4 0.2
2 4.7 3.2 NaN 0.2
3 7.0 3.2 4.7 1.4
4 6.4 3.2 4.5 1.5
5 6.9 3.1 4.9 NaN
# 填充nan值
df.fillna(value=10)
c1 c2 c3 c4
0 5.1 10.0 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 10.0 0.2
3 7.0 3.2 4.7 1.4
4 6.4 3.2 4.5 1.5
5 6.9 3.1 4.9 10.0
6 10.0 10.0 10.0 10.0

8、合併數據

df1 = pd.DataFrame(np.zeros((3, 4)))
df1
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
df2 = pd.DataFrame(np.ones((3, 4)))
df2
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
# axis=0合併列
pd.concat((df1, df2), axis=0)
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
# axis=1合併行
pd.concat((df1, df2), axis=1)
0 1 2 3 0 1 2 3
0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# append只能合併列
df1.append(df2)
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0

9、導入導出數據

使用df = pd.read_excel(filename)讀取文件,使用df.to_excel(filename)保存文件。app

9.1 讀取文件導入數據

讀取文件導入數據函數主要參數:

參數 詳解
sep 指定分隔符,可用正則表達式如'\s+'
header=None 指定文件無行名
name 指定列名
index_col 指定某列做爲索引
skip_row 指定跳過某些行
na_values 指定某些字符串表示缺失值
parse_dates 指定某些列是否被解析爲日期,布爾值或列表
df = pd.read_excel(filename)
df = pd.read_csv(filename)

9.2 寫入文件導出數據

寫入文件函數的主要參數:

參數 詳解
sep 分隔符
na_rep 指定缺失值轉換的字符串,默認爲空字符串
header=False 不保存列名
index=False 不保存行索引
cols 指定輸出的列,傳入列表
df.to_excel(filename)

10、pandas讀取json文件

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'

df = pd.read_json(strtext, orient='records')
df
code code1 code2 issue time ttery
0 8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min
1 7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min
2 5,9,1,2,9 298329129 NaN 20130801-3389 1013395346000 min
3 3,8,7,3,3 298588733 NaN 20130801-3388 1013395286000 min
4 0,8,5,2,7 298818527 NaN 20130801-3387 1013395226000 min
df.to_excel('pandas處理json.xlsx',
            index=False,
            columns=["ttery", "issue", "code", "code1", "code2", "time"])

10.1 orient參數的五種形式

orient是代表預期的json字符串格式。orient的設置有如下五個值:

1.'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

這種就是有索引,有列字段,和數據矩陣構成的json格式。key名稱只能是index,columns和data。

s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
df = pd.read_json(s, orient='split')
df
a b
1 1 3
2 2 8
3 3 9

2.'records' : list like [{column -> value}, ... , {column -> value}]

這種就是成員爲字典的列表。如我今天要處理的json數據示例所見。構成是列字段爲鍵,值爲鍵值,每個字典成員就構成了dataframe的一行數據。

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'

df = pd.read_json(strtext, orient='records')
df
code code1 code2 issue time ttery
0 8,4,5,2,9 297734529 NaN 20130801-3391 1013395466000 min
1 7,8,2,1,2 298058212 NaN 20130801-3390 1013395406000 min

3.'index' : dict like {index -> {column -> value}}

以索引爲key,以列字段構成的字典爲鍵值。如:

s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
df = pd.read_json(s, orient='index')
df
a b
0 1 2
1 9 11

4.'columns' : dict like {column -> {index -> value}}

這種處理的就是以列爲鍵,對應一個值字典的對象。這個字典對象以索引爲鍵,以值爲鍵值構成的json字符串。以下圖所示:

s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
df = pd.read_json(s, orient='columns')
df
a b
0 1 2
1 9 11

5.'values' : just the values array。

values這種咱們就很常見了。就是一個嵌套的列表。裏面的成員也是列表,2層的。

s = '[["a",1],["b",2]]'
df = pd.read_json(s, orient='values')
df
0 1
0 a 1
1 b 2

11、pandas讀取sql語句

import numpy as np
import pandas as pd
import pymysql


def conn(sql):
    # 鏈接到mysql數據庫
    conn = pymysql.connect(
        host="localhost",
        port=3306,
        user="root",
        passwd="123",
        db="db1",
    )
    try:
        data = pd.read_sql(sql, con=conn)
        return data
    except Exception as e:
        print("SQL is not correct!")
    finally:
        conn.close()


sql = "select * from test1 limit 0, 10"  # sql語句
data = conn(sql)
print(data.columns.tolist())  # 查看字段
print(data)  # 查看數據
相關文章
相關標籤/搜索