pandas官方文檔:https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750python
pandas基於Numpy,能夠當作是處理文本或者表格數據。pandas中有兩個主要的數據結構,其中Series數據結構相似於Numpy中的一維數組,DataFrame相似於多維表格數據結構。mysql
pandas是python數據分析的核心模塊。它主要提供了五大功能:正則表達式
- 支持文件存取操做,支持數據庫(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
- 支持增刪改查、切片、高階函數、分組聚合等單表操做,以及和dict、list的互相轉換。
- 支持多表拼接合並操做。
- 支持簡單的繪圖操做。
- 支持簡單的統計分析操做。
1、Series數據結構
Series是一種相似於一維數組的對象,由一組數據和一組與之相關的數據標籤(索引)組成。sql
Series比較像列表(數組)和字典的結合體數據庫
import numpy as np
import pandas as pd
df = pd.Series(0, index=['a', 'b', 'c', 'd'])
print(df)
a 0
b 0
c 0
d 0
dtype: int64
print(df.values)
[0 0 0 0]
print(df.index)
Index(['a', 'b', 'c', 'd'], dtype='object')
1.1 Series支持NumPy模塊的特性(下標)
從ndarray建立Series |
Series(arr) |
與標量運算 |
df*2 |
兩個Series運算 |
df1+df2 |
索引 |
df[0], df[[1,2,4]] |
切片 |
df[0:2] |
通用函數 |
np.abs(df) |
布爾值過濾 |
df[df>0] |
arr = np.array([1, 2, 3, 4, np.nan])
print(arr)
[ 1. 2. 3. 4. nan]
df = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])
print(df)
a 1.0
b 2.0
c 3.0
d 4.0
e NaN
dtype: float64
print(df**2)
a 1.0
b 4.0
c 9.0
d 16.0
e NaN
dtype: float64
print(df[0])
1.0
print(df['a'])
1.0
print(df[[0, 1, 2]])
a 1.0
b 2.0
c 3.0
dtype: float64
print(df[0:2])
a 1.0
b 2.0
dtype: float64
np.sin(df)
a 0.841471
b 0.909297
c 0.141120
d -0.756802
e NaN
dtype: float64
df[df > 1]
b 2.0
c 3.0
d 4.0
dtype: float64
1.2 Series支持字典的特性(標籤)
從字典建立Series |
Series(dic), |
in運算 |
’a’ in sr |
鍵索引 |
sr['a'], sr[['a', 'b', 'd']] |
df = pd.Series({'a': 1, 'b': 2})
print(df)
a 1
b 2
dtype: int64
print('a' in df)
True
print(df['a'])
1
1.3 Series缺失數據處理
dropna() |
過濾掉值爲NaN的行 |
fillna() |
填充缺失數據 |
isnull() |
返回布爾數組,缺失值對應爲True |
notnull() |
返回布爾數組,缺失值對應爲False |
df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e'])
print(df)
a 1.0
b 2.0
c 3.0
d 4.0
e NaN
dtype: float64
print(df.dropna())
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
print(df.fillna(5))
a 1.0
b 2.0
c 3.0
d 4.0
e 5.0
dtype: float64
print(df.isnull())
a False
b False
c False
d False
e True
dtype: bool
print(df.notnull())
a True
b True
c True
d True
e False
dtype: bool
2、DataFrame數據結構
DataFrame是一個表格型的數據結構,含有一組有序的列。json
DataFrame能夠被看作是由Series組成的字典,而且共用一個索引。數組
2.1 產生時間對象數組:date_range
date_range參數詳解:數據結構
start |
開始時間 |
end |
結束時間 |
periods |
時間長度 |
freq |
時間頻率,默認爲'D',可選H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),… |
dates = pd.date_range('20190101', periods=6, freq='M')
print(dates)
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
'2019-05-31', '2019-06-30'],
dtype='datetime64[ns]', freq='M')
np.random.seed(1)
arr = 10 * np.random.randn(6, 4)
print(arr)
[[ 16.24345364 -6.11756414 -5.28171752 -10.72968622]
[ 8.65407629 -23.01538697 17.44811764 -7.61206901]
[ 3.19039096 -2.49370375 14.62107937 -20.60140709]
[ -3.22417204 -3.84054355 11.33769442 -10.99891267]
[ -1.72428208 -8.77858418 0.42213747 5.82815214]
[-11.00619177 11.4472371 9.01590721 5.02494339]]
df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4'])
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
3、DataFrame屬性
dtype是 |
查看數據類型 |
index |
查看行序列或者索引 |
columns |
查看各列的標籤 |
values |
查看數據框內的數據,也即不含表頭索引的數據 |
describe |
查看數據每一列的極值,均值,中位數,只可用於數值型數據 |
transpose |
轉置,也可用T來操做 |
sort_index |
排序,可按行或列index排序輸出 |
sort_values |
按數據值來排序 |
# 查看數據類型
print(df2.dtypes)
0 float64
1 float64
2 float64
3 float64
dtype: object
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
print(df.index)
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
'2019-05-31', '2019-06-30'],
dtype='datetime64[ns]', freq='M')
print(df.columns)
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
print(df.values)
[[ 16.24345364 -6.11756414 -5.28171752 -10.72968622]
[ 8.65407629 -23.01538697 17.44811764 -7.61206901]
[ 3.19039096 -2.49370375 14.62107937 -20.60140709]
[ -3.22417204 -3.84054355 11.33769442 -10.99891267]
[ -1.72428208 -8.77858418 0.42213747 5.82815214]
[-11.00619177 11.4472371 9.01590721 5.02494339]]
df.describe()
|
c1 |
c2 |
c3 |
c4 |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.022213 |
-5.466424 |
7.927203 |
-6.514830 |
std |
9.580084 |
11.107772 |
8.707171 |
10.227641 |
min |
-11.006192 |
-23.015387 |
-5.281718 |
-20.601407 |
25% |
-2.849200 |
-8.113329 |
2.570580 |
-10.931606 |
50% |
0.733054 |
-4.979054 |
10.176801 |
-9.170878 |
75% |
7.288155 |
-2.830414 |
13.800233 |
1.865690 |
max |
16.243454 |
11.447237 |
17.448118 |
5.828152 |
df.T
|
2019-01-31 00:00:00 |
2019-02-28 00:00:00 |
2019-03-31 00:00:00 |
2019-04-30 00:00:00 |
2019-05-31 00:00:00 |
2019-06-30 00:00:00 |
c1 |
16.243454 |
8.654076 |
3.190391 |
-3.224172 |
-1.724282 |
-11.006192 |
c2 |
-6.117564 |
-23.015387 |
-2.493704 |
-3.840544 |
-8.778584 |
11.447237 |
c3 |
-5.281718 |
17.448118 |
14.621079 |
11.337694 |
0.422137 |
9.015907 |
c4 |
-10.729686 |
-7.612069 |
-20.601407 |
-10.998913 |
5.828152 |
5.024943 |
# 按行標籤[c1, c2, c3, c4]從大到小排序
df.sort_index(axis=0)
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
# 按列標籤[2019-01-01, 2019-01-02...]從大到小排序
df.sort_index(axis=1)
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
# 按c2列的值從大到小排序
df.sort_values(by='c2')
|
c1 |
c2 |
c3 |
c4 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
4、DataFrame取值
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
4.1 經過columns取值
df['c2']
2019-01-31 -6.117564
2019-02-28 -23.015387
2019-03-31 -2.493704
2019-04-30 -3.840544
2019-05-31 -8.778584
2019-06-30 11.447237
Freq: M, Name: c2, dtype: float64
df[['c2', 'c3']]
|
c2 |
c3 |
2019-01-31 |
-6.117564 |
-5.281718 |
2019-02-28 |
-23.015387 |
17.448118 |
2019-03-31 |
-2.493704 |
14.621079 |
2019-04-30 |
-3.840544 |
11.337694 |
2019-05-31 |
-8.778584 |
0.422137 |
2019-06-30 |
11.447237 |
9.015907 |
4.2 loc(經過行標籤取值)
# 經過自定義的行標籤選擇數據
df.loc['2019-01-01':'2019-01-03']
df[0:3]
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
4.3 iloc(相似於numpy數組取值)
df.values
array([[ 16.24345364, -6.11756414, -5.28171752, -10.72968622],
[ 8.65407629, -23.01538697, 17.44811764, -7.61206901],
[ 3.19039096, -2.49370375, 14.62107937, -20.60140709],
[ -3.22417204, -3.84054355, 11.33769442, -10.99891267],
[ -1.72428208, -8.77858418, 0.42213747, 5.82815214],
[-11.00619177, 11.4472371 , 9.01590721, 5.02494339]])
# 經過行索引選擇數據
print(df.iloc[2, 1])
-2.493703754774101
df.iloc[1:4, 1:4]
|
c2 |
c3 |
c4 |
2019-02-28 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.840544 |
11.337694 |
-10.998913 |
4.4 使用邏輯判斷取值
df[df['c1'] > 0]
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
df[(df['c1'] > 0) & (df['c2'] > -8)]
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
5、DataFrame值替換
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
df.iloc[0:3, 0:2] = 0
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
0.000000 |
0.000000 |
-5.281718 |
-10.729686 |
2019-02-28 |
0.000000 |
0.000000 |
17.448118 |
-7.612069 |
2019-03-31 |
0.000000 |
0.000000 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
df['c3'] > 10
2019-01-31 False
2019-02-28 True
2019-03-31 True
2019-04-30 True
2019-05-31 False
2019-06-30 False
Freq: M, Name: c3, dtype: bool
# 針對行作處理
df[df['c3'] > 10] = 100
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
0.000000 |
0.000000 |
-5.281718 |
-10.729686 |
2019-02-28 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-03-31 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-04-30 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
# 針對行作處理
df = df.astype(np.int32)
df[df['c3'].isin([100])] = 1000
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-31 |
0 |
0 |
-5 |
-10 |
2019-02-28 |
1000 |
1000 |
1000 |
1000 |
2019-03-31 |
1000 |
1000 |
1000 |
1000 |
2019-04-30 |
1000 |
1000 |
1000 |
1000 |
2019-05-31 |
-1 |
-8 |
0 |
5 |
2019-06-30 |
-11 |
11 |
9 |
5 |
6、讀取CSV文件
import pandas as pd
from io import StringIO
test_data = '''
5.1,,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,,0.2
7.0,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,
,,,
'''
test_data = StringIO(test_data)
df = pd.read_csv(test_data, header=None)
df.columns = ['c1', 'c2', 'c3', 'c4']
df
|
c1 |
c2 |
c3 |
c4 |
0 |
5.1 |
NaN |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
NaN |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
NaN |
6 |
NaN |
NaN |
NaN |
NaN |
7、處理丟失數據
df.isnull()
|
c1 |
c2 |
c3 |
c4 |
0 |
False |
True |
False |
False |
1 |
False |
False |
False |
False |
2 |
False |
False |
True |
False |
3 |
False |
False |
False |
False |
4 |
False |
False |
False |
False |
5 |
False |
False |
False |
True |
6 |
True |
True |
True |
True |
# 經過在isnull()方法後使用sum()方法便可得到該數據集某個特徵含有多少個缺失值
print(df.isnull().sum())
c1 1
c2 2
c3 2
c4 2
dtype: int64
# axis=0刪除有NaN值的行
df.dropna(axis=0)
|
c1 |
c2 |
c3 |
c4 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
# axis=1刪除有NaN值的列
df.dropna(axis=1)
# 刪除全爲NaN值得行或列
df.dropna(how='all')
|
c1 |
c2 |
c3 |
c4 |
0 |
5.1 |
NaN |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
NaN |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
NaN |
# 刪除行不爲4個值的
df.dropna(thresh=4)
|
c1 |
c2 |
c3 |
c4 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
# 刪除c2中有NaN值的行
df.dropna(subset=['c2'])
|
c1 |
c2 |
c3 |
c4 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
NaN |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
NaN |
# 填充nan值
df.fillna(value=10)
|
c1 |
c2 |
c3 |
c4 |
0 |
5.1 |
10.0 |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
10.0 |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
10.0 |
6 |
10.0 |
10.0 |
10.0 |
10.0 |
8、合併數據
df1 = pd.DataFrame(np.zeros((3, 4)))
df1
|
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
df2 = pd.DataFrame(np.ones((3, 4)))
df2
|
0 |
1 |
2 |
3 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
# axis=0合併列
pd.concat((df1, df2), axis=0)
|
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
# axis=1合併行
pd.concat((df1, df2), axis=1)
|
0 |
1 |
2 |
3 |
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
# append只能合併列
df1.append(df2)
|
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
9、導入導出數據
使用df = pd.read_excel(filename)讀取文件,使用df.to_excel(filename)保存文件。app
9.1 讀取文件導入數據
讀取文件導入數據函數主要參數:
sep |
指定分隔符,可用正則表達式如'\s+' |
header=None |
指定文件無行名 |
name |
指定列名 |
index_col |
指定某列做爲索引 |
skip_row |
指定跳過某些行 |
na_values |
指定某些字符串表示缺失值 |
parse_dates |
指定某些列是否被解析爲日期,布爾值或列表 |
df = pd.read_excel(filename)
df = pd.read_csv(filename)
9.2 寫入文件導出數據
寫入文件函數的主要參數:
sep |
分隔符 |
na_rep |
指定缺失值轉換的字符串,默認爲空字符串 |
header=False |
不保存列名 |
index=False |
不保存行索引 |
cols |
指定輸出的列,傳入列表 |
df.to_excel(filename)
10、pandas讀取json文件
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'
df = pd.read_json(strtext, orient='records')
df
|
code |
code1 |
code2 |
issue |
time |
ttery |
0 |
8,4,5,2,9 |
297734529 |
NaN |
20130801-3391 |
1013395466000 |
min |
1 |
7,8,2,1,2 |
298058212 |
NaN |
20130801-3390 |
1013395406000 |
min |
2 |
5,9,1,2,9 |
298329129 |
NaN |
20130801-3389 |
1013395346000 |
min |
3 |
3,8,7,3,3 |
298588733 |
NaN |
20130801-3388 |
1013395286000 |
min |
4 |
0,8,5,2,7 |
298818527 |
NaN |
20130801-3387 |
1013395226000 |
min |
df.to_excel('pandas處理json.xlsx',
index=False,
columns=["ttery", "issue", "code", "code1", "code2", "time"])
10.1 orient參數的五種形式
orient是代表預期的json字符串格式。orient的設置有如下五個值:
1.'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
這種就是有索引,有列字段,和數據矩陣構成的json格式。key名稱只能是index,columns和data。
s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
df = pd.read_json(s, orient='split')
df
2.'records' : list like [{column -> value}, ... , {column -> value}]
這種就是成員爲字典的列表。如我今天要處理的json數據示例所見。構成是列字段爲鍵,值爲鍵值,每個字典成員就構成了dataframe的一行數據。
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'
df = pd.read_json(strtext, orient='records')
df
|
code |
code1 |
code2 |
issue |
time |
ttery |
0 |
8,4,5,2,9 |
297734529 |
NaN |
20130801-3391 |
1013395466000 |
min |
1 |
7,8,2,1,2 |
298058212 |
NaN |
20130801-3390 |
1013395406000 |
min |
3.'index' : dict like {index -> {column -> value}}
以索引爲key,以列字段構成的字典爲鍵值。如:
s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
df = pd.read_json(s, orient='index')
df
4.'columns' : dict like {column -> {index -> value}}
這種處理的就是以列爲鍵,對應一個值字典的對象。這個字典對象以索引爲鍵,以值爲鍵值構成的json字符串。以下圖所示:
s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
df = pd.read_json(s, orient='columns')
df
5.'values' : just the values array。
values這種咱們就很常見了。就是一個嵌套的列表。裏面的成員也是列表,2層的。
s = '[["a",1],["b",2]]'
df = pd.read_json(s, orient='values')
df
11、pandas讀取sql語句
import numpy as np
import pandas as pd
import pymysql
def conn(sql):
# 鏈接到mysql數據庫
conn = pymysql.connect(
host="localhost",
port=3306,
user="root",
passwd="123",
db="db1",
)
try:
data = pd.read_sql(sql, con=conn)
return data
except Exception as e:
print("SQL is not correct!")
finally:
conn.close()
sql = "select * from test1 limit 0, 10" # sql語句
data = conn(sql)
print(data.columns.tolist()) # 查看字段
print(data) # 查看數據