Pandas學習筆記

時間 2019-12-13

標籤 pandas 學習筆記简体版

原文原文鏈接

1. 數據結構python

Pandas主要有三種數據：數組

Series（一維數據，大小不可變）
DataFrame（二維數據，大小可變）
Panel（三維數據，大小可變）

Series數據結構

具備均勻數據的一維數組結構。例如1,3,5,7,...的集合app

...

關鍵點dom

均勻數據
尺寸大小不變
數據值可變

DataFrame函數

具備異構數據的二維數據。例如ui

姓名	年齡	性別
小明	20	男
小紅	15	女
小剛	18	男

關鍵點編碼

異構數據
大小可變
數據可變

Panelspa

具備異構數據的三維數據結構，能夠說成是DataFrame的容器。rest

關鍵點

異構數據
大小可變
數據可變

2. Series

Series是可以保存任何類型的數據（整型，字符串，浮點數，python對象等）的一維標記數據。

構造函數

pandas.Series(data, index, dtype, copy)

參數	描述
data	數據採起各類形式，如：ndarray，list，constants
index	索引值必須是惟一的和散列的，與數據的長度相同。默認np.arange(n)若是沒有索引被傳遞。
dtype	用於數據類型。若是沒有，將推斷數據類型。
copy	複製數據，默認爲false

構建一個空的Series

1 import pandas as pd
2 s=pd.Series()
3 print(s)

輸出

Series([], dtype: float64)

若是數據是ndarray，則傳遞的索引必須具備相同的長度。若是沒有傳遞索引值，那麼默認索引是（0 - n-1）

1 import pandas as pd
2 import numpy as np
3 data = np.array(['a','b','c','d'])
4 s = pd.Series(data)
5 print(s)

輸出

0    a
1    b
2    c
3    d
dtype: object

1 import pandas as pd
2 import numpy as np
3 data = np.array(['a','b','c','d'])
4 s = pd.Series(data,index=[100,101,102,103])
5 print(s)

輸出

100    a
101    b
102    c
103    d
dtype: object

從字典（dict）建立一個Series，沒有指定索引，則使用字典鍵做爲索引，若是指定索引則使用指定的索引值。

1 import pandas as pd
2 import numpy as np
3 data = {'a' : 0., 'b' : 1., 'c' : 2.}
4 s = pd.Series(data)
5 print(s)

輸出

a    0.0
b    1.0
c    2.0
dtype: float64

1 import pandas as pd
2 import numpy as np
3 data = {'a' : 0., 'b' : 1., 'c' : 2.}
4 s = pd.Series(data,index=['b','c','d','a'])
5 print(s)

輸出

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

從標量建立一個系列，若是數據是標量值，則必須提供索引。若是索引長度超過數據長度，則將重複該值以匹配索引的長度。

1 import pandas as pd
2 import numpy as np
3 s = pd.Series(5, index=[0, 1, 2, 3])
4 print(s)

輸出

0    5
1    5
2    5
3    5
dtype: int64

從具備位置的Series中訪問數據，Series中的數據可使用相似訪問ndarray中的數據來訪問。

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s)
4 print(s[0])

輸出

a    1
b    2
c    3
d    4
e    5
dtype: int64
1

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s[:3])

輸出

a    1
b    2
c    3
dtype: int64

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s[-3:])

輸出

c    3
d    4
e    5
dtype: int64

使用標籤檢索數據，經過索引標籤獲取和設置值。

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s['a'])

輸出

1 import pandas as pd
2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
3 print(s[['a','c','d']])

輸出

a    1
c    3
d    4
dtype: int64

若是不包含標籤，則會出項異常。

3. DataFrame

pandas.DataFrame(data, index, columns, dtype, copy)

構造函數的參數：

參數	描述
data	數據採起各類形式，如：ndarray，series，map，lists，dict，constant和DataFrame。
index	對於行標籤
columns	對於列標籤
dtype	每列的數據類型
copy	默認值爲False

建立一個空的DataFrame

1 import pandas as pd
2 df = pd.DataFrame()
3 print(df)

輸出

Empty DataFrame
Columns: []
Index: []

從列表建立DataFrame

1 import pandas as pd
2 data = [1,2,3,4,5]
3 df = pd.DataFrame(data)
4 print(df)

輸出

1 import pandas as pd
2 data = [['Alex',10],['Bob',12],['Clarke',13]]
3 df = pd.DataFrame(data,columns=['Name','Age'])
4 print(df)

輸出

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13

1 import pandas as pd
2 data = [['Alex',10],['Bob',12],['Clarke',13]]
3 df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
4 print(df)

輸出

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0

從ndarray/Lists的字典來建立DataFrame，全部的ndarrays必須具備相同的長度，若是傳遞了索引，則索引的長度應等於數組的長度，若是沒有則使用默認索引。

1 import pandas as pd
2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
3 df = pd.DataFrame(data)
4 print(df)

輸出

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42

使用數組建立一個索引的DataFrame

1 import pandas as pd
2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
3 df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
4 print(df)

輸出

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42

從列表建立DataFrame，字典和列表可做爲輸入數據傳遞以用來建立DataFrame，字典鍵默認爲列名。

1 import pandas as pd
2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
3 df = pd.DataFrame(data)
4 print(df)

輸出

   a   b     c
0  1   2   NaN
1  5  10  20.0

使用字典，行索引和列索引建立DataFrame

1 import pandas as pd
2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
3 df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
4 df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
5 print(df1)
6 print(df2)

輸出

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN

字典的Series能夠傳遞造成一個DataFrame，獲得的索引是全部Series索引的並集

1 import pandas as pd
2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
4 df = pd.DataFrame(d)
5 print(df)

輸出

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

列選擇

1 import pandas as pd
2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
4 df = pd.DataFrame(d)
5 print(df ['one'])

輸出

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

列添加

 1 import pandas as pd
 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
 4 df = pd.DataFrame(d)
 5 print ("Adding a new column by passing as Series:")
 6 df['three']=pd.Series([10,20,30],index=['a','b','c'])
 7 print(df)
 8 print ("Adding a new column using the existing columns in DataFrame:")
 9 df['four']=df['one']+df['three']
10 print(df)

輸出

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:
   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN

列刪除

 1 import pandas as pd
 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
 3      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
 4      'three' : pd.Series([10,20,30], index=['a','b','c'])}
 5 df = pd.DataFrame(d)
 6 print ("Our dataframe is:")
 7 print(df)
 8 print ("Deleting the first column using DEL function:")
 9 del df['one']
10 print(df)
11 print ("Deleting another column using POP function:")
12 df.pop('two')
13 print(df)

輸出

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN

行選擇，添加和刪除

 1 import pandas as pd
 2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
 3      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
 4 df = pd.DataFrame(d)
 5 print(df)
 6 print('---------')
 7 print(df.loc['a'])
 8 print('---------')
 9 print(df.iloc[2])
10 print('---------')
11 print(df[2:4])
12 print('---------')
13 df2=pd.DataFrame([[5,6],[7,8]],index=['e','f'],columns=['one','two'])
14 df=df.append(df2)
15 print(df)
16 df=df.drop('a')
17 print('---------')
18 print(df)

輸出

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
---------
one    1.0
two    1.0
Name: a, dtype: float64
---------
one    3.0
two    3.0
Name: c, dtype: float64
---------
   one  two
c  3.0    3
d  NaN    4
---------
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    6
f  7.0    8
---------
   one  two
b  2.0    2
c  3.0    3
d  NaN    4
e  5.0    6
f  7.0    8

4. Panel

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

參數	描述
data	數據採起各類形式，如：ndarray, series, map, lists, dict, constant和DataFrame
items	axis=0
major_axis	axis=1
minor_axis	axis=2
dtype	每列的數據類型
copy	複製數據

建立panel和選擇數據

 1 print('--------creat an empty panel---------')
 2 import pandas as pd
 3 p=pd.Panel()
 4 print(p)
 5 print('-------------end---------------------')
 6 print('---creat an panel from 3D ndarray----')
 7 import pandas as pd
 8 import numpy as np
 9 data = np.random.rand(2,4,5)
10 p = pd.Panel(data)
11 print(p)
12 print('-------------end---------------------')
13 print('-creat an panel from dict(DataFrame)-')
14 import pandas as pd
15 import numpy as np
16 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
17         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
18 p = pd.Panel(data)
19 print(p)
20 print('-------------end---------------------')
21 print('-------select data from panel--------')
22 import pandas as pd
23 import numpy as np
24 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
25         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
26 p = pd.Panel(data)
27 print(p['Item1'])
28 print('-------------end---------------------')
29 print('-----select data use major_axis------')
30 import pandas as pd
31 import numpy as np
32 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
33         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
34 p = pd.Panel(data)
35 print(p.major_xs(1))
36 print('-------------end---------------------')
37 print('-----select data use minor_axis------')
38 import pandas as pd
39 import numpy as np
40 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
41         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
42 p = pd.Panel(data)
43 print(p.minor_xs(1))
44 print('-------------end---------------------')

輸出

--------creat an empty panel---------
<class 'pandas.core.panel.Panel'>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None
-------------end---------------------
---creat an panel from 3D ndarray----
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
-------------end---------------------
-creat an panel from dict(DataFrame)-
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
-------------end---------------------
-------select data from panel--------
          0         1         2
0 -0.960065 -1.114559 -0.296025
1 -0.382277 -0.585262  1.503437
2  1.315953 -0.350967 -0.711729
3  0.959712  0.800819 -0.673261
-------------end---------------------
-----select data use major_axis------
      Item1     Item2
0 -1.742578 -0.697723
1 -0.156266  0.003577
2  0.023405       NaN
-------------end---------------------
-----select data use minor_axis------
      Item1     Item2
0  1.103015  0.488929
1 -0.391214 -0.030208
2  1.783799  0.039654
3 -1.863803 -0.949056
-------------end---------------------

5. 基本功能

Series基本功能

屬性或方法	描述
axes	返回行軸標籤列表。
dtype	返回對象的數據類型。
empty	檢查是否爲空，返回布爾型。
ndim	返回底層數據的維數，默認定義：1。
size	返回基礎數據中的元素數。
values	將Series做爲ndarray放回。
head(n)	放回前n行。
tail(n)	放回最後n行。

 1 import pandas as pd
 2 import numpy as np
 3 s = pd.Series(np.random.randn(4))
 4 print(s)
 5 print('-------------')
 6 print("The axes are:")
 7 print(s.axes)
 8 print('-------------')
 9 print ("Is the Object empty?")
10 print(s.empty)
11 print('-------------')
12 print("The dimensions of the object:")
13 print(s.ndim)
14 print('-------------')
15 print("The size of the object:")
16 print(s.size)
17 print('-------------')
18 print("The actual data series is:")
19 print(s.values)
20 print('-------------')
21 print("The first two rows of the data series:")
22 print(s.head(2))
23 print('-------------')
24 print("The last two rows of the data series:")
25 print(s.tail(2))

輸出

0   -1.478084
1    0.468882
2    0.394107
3    0.682990
dtype: float64
-------------
The axes are:
[RangeIndex(start=0, stop=4, step=1)]
-------------
Is the Object empty?
False
-------------
The dimensions of the object:
1
-------------
The size of the object:
4
-------------
The actual data series is:
[-1.47808355  0.46888222  0.3941075   0.68299036]
-------------
The first two rows of the data series:
0   -1.478084
1    0.468882
dtype: float64
-------------
The last two rows of the data series:
2    0.394107
3    0.682990
dtype: float64

DataFrame基本功能

屬性或方法	描述
T	轉置行和列。
axes	返回一個列，行軸標籤和列軸標籤做爲惟一的成員。
dtypes	放回此對象中的數據類型。
empty	檢查是否爲空，返回布爾型。
ndim	軸/數組維度大小。
shape	返回表示DataFrame的維度的元組。
size	尺寸
values	ndarray表示返回。
head()	放回開頭前n行。
tail()	返回最後n行。

 1 print('---------creat a DataFrame----------')
 2 import pandas as pd
 3 import numpy as np
 4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
 5    'Age':pd.Series([25,26,25,23,30,29,23]),
 6    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 7 df = pd.DataFrame(d)
 8 print("Our data series is:")
 9 print(df)
10 print('----------------end-----------------')
11 print('--the transpose of the data series--')
12 print(df.T)
13 print('----------------end-----------------')
14 print('-----row and column axis labels-----')
15 print(df.axes)
16 print('----------------end-----------------')
17 print('---the data types of each column----')
18 print(df.dtypes)
19 print('----------------end-----------------')
20 print('---------is the object empty--------')
21 print(df.empty)
22 print('----------------end-----------------')
23 print('-----------the dimension------------')
24 print(df.ndim)
25 print('----------------end-----------------')
26 print('--------------the shape-------------')
27 print(df.shape)
28 print('----------------end-----------------')
29 print('--------------the shape-------------')
30 print(df.shape)
31 print('----------------end-----------------')
32 print('------total number of elements------')
33 print(df.size)
34 print('----------------end-----------------')
35 print('-------------actual data------------')
36 print(df.values)
37 print('----------------end-----------------')
38 print('-------first two rows of data-------')
39 print(df.head(2))
40 print('----------------end-----------------')
41 print('--------last two rows of data-------')
42 print(df.tail(2))
43 print('----------------end-----------------')

輸出

---------creat a DataFrame----------
Our data series is:
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
2  Ricky   25    3.98
3    Vin   23    2.56
4  Steve   30    3.20
5  Minsu   29    4.60
6   Jack   23    3.80
----------------end-----------------
--the transpose of the data series--
           0      1      2     3      4      5     6
Name     Tom  James  Ricky   Vin  Steve  Minsu  Jack
Age       25     26     25    23     30     29    23
Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8
----------------end-----------------
-----row and column axis labels-----
[RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]
----------------end-----------------
---the data types of each column----
Name       object
Age         int64
Rating    float64
dtype: object
----------------end-----------------
---------is the object empty--------
False
----------------end-----------------
-----------the dimension------------
2
----------------end-----------------
--------------the shape-------------
(7, 3)
----------------end-----------------
--------------the shape-------------
(7, 3)
----------------end-----------------
------total number of elements------
21
----------------end-----------------
-------------actual data------------
[['Tom' 25 4.23]
 ['James' 26 3.24]
 ['Ricky' 25 3.98]
 ['Vin' 23 2.56]
 ['Steve' 30 3.2]
 ['Minsu' 29 4.6]
 ['Jack' 23 3.8]]
----------------end-----------------
-------first two rows of data-------
    Name  Age  Rating
0    Tom   25    4.23
1  James   26    3.24
----------------end-----------------
--------last two rows of data-------
    Name  Age  Rating
5  Minsu   29     4.6
6   Jack   23     3.8
----------------end-----------------

6. 描述性統計

函數	描述
sum()	返回所請求軸的值的總和，默認axis=0
mean()	返回平均值
std()	返回標準差
median()	全部值的中位數
mode()	值的模值
min()	最小值
max()	最大值
abs()	絕對值
prod()	數組元素的乘積
cumsum()	累計總和
cumprod()	累計乘積
describe()	計算統計信息的摘要，object-彙總字符串，number-彙總數字，all-彙總全部列

 1 print('--------creat a DataFrame---------')
 2 import pandas as pd
 3 import numpy as np
 4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
 5    'Lee','David','Gasper','Betina','Andres']),
 6    'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
 7    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
 8 df = pd.DataFrame(d)
 9 print(df)
10 print('---------------end----------------')
11 print('---------------sum----------------')
12 print(df.sum())
13 print('---------------end----------------')
14 print(df.sum(1))
15 print('---------------end----------------')
16 print('--------------mean----------------')
17 print(df.mean())
18 print('---------------end----------------')
19 print('--------------std----------------')
20 print(df.std())
21 print('---------------end----------------')
22 print('------------describe--------------')
23 print(df.describe())
24 print('---------------end----------------')

輸出

--------creat a DataFrame---------
      Name  Age  Rating
0      Tom   25    4.23
1    James   26    3.24
2    Ricky   25    3.98
3      Vin   23    2.56
4    Steve   30    3.20
5    Minsu   29    4.60
6     Jack   23    3.80
7      Lee   34    3.78
8    David   40    2.98
9   Gasper   30    4.80
10  Betina   51    4.10
11  Andres   46    3.65
---------------end----------------
---------------sum----------------
Name      TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object
---------------end----------------
0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64
---------------end----------------
--------------mean----------------
Age       31.833333
Rating     3.743333
dtype: float64
---------------end----------------
--------------std----------------
Age       9.232682
Rating    0.661628
dtype: float64
---------------end----------------
------------describe--------------
             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000
---------------end----------------

7. 函數應用

表合理函數應用：pipe()
行或列函數應用：apply()
元素函數應用：applymap()

經過將函數和適當數量的參數做爲管道參數來執行自定義操做。

 1 import pandas as pd
 2 import numpy as np
 3 def adder(ele1,ele2):
 4     return ele1+ele2
 5 df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
 6 print(df)
 7 print('---------------end----------------')
 8 print(df.pipe(adder,2))
 9 print('---------------end----------------')
10 print(df.apply(np.mean))
11 print('---------------end----------------')
12 print(df.apply(np.mean,axis=1))
13 print('---------------end----------------')
14 print(df.apply(lambda x:x.max()-x.min()))
15 print('---------------end----------------')
16 print(df['col1'].map(lambda x:x*100))
17 print('---------------end----------------')
18 print(df.applymap(lambda x:x*100))
19 print('---------------end----------------')

輸出

       col1      col2      col3
0  1.689749  0.959856  1.074871
1 -0.392017  0.001075  0.806392
2 -0.484529  0.635483  0.644830
3 -0.049649  0.113976 -0.220698
4  1.413197 -0.576231 -0.075871
---------------end----------------
       col1      col2      col3
0  3.689749  2.959856  3.074871
1  1.607983  2.001075  2.806392
2  1.515471  2.635483  2.644830
3  1.950351  2.113976  1.779302
4  3.413197  1.423769  1.924129
---------------end----------------
col1    0.435350
col2    0.226832
col3    0.445905
dtype: float64
---------------end----------------
0    1.241492
1    0.138483
2    0.265261
3   -0.052123
4    0.253698
dtype: float64
---------------end----------------
col1    2.174278
col2    1.536088
col3    1.295569
dtype: float64
---------------end----------------
0    168.974915
1    -39.201732
2    -48.452922
3     -4.964864
4    141.319700
Name: col1, dtype: float64
---------------end----------------
         col1       col2        col3
0  168.974915  95.985614  107.487138
1  -39.201732   0.107497   80.639193
2  -48.452922  63.548250   64.483009
3   -4.964864  11.397646  -22.069797
4  141.319700 -57.623138   -7.587075
---------------end----------------

8. 重建索引

從新索引會更改DataFrame的行標籤和列標籤，從新索引意味着符合數據以匹配特定軸上的一組給定的標籤。

從新排序現有數據以匹配一組新的標籤
在沒有標籤數據的標籤位置插入缺失值（NA）標記

 1 import pandas as pd
 2 import numpy as np
 3 N=20
 4 df = pd.DataFrame({
 5    'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
 6    'x': np.linspace(0,stop=N-1,num=N),
 7    'y': np.random.rand(N),
 8    'C': np.random.choice(['Low','Medium','High'],N).tolist(),
 9    'D': np.random.normal(100, 10, size=(N)).tolist()
10 })
11 df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
12 print(df_reindexed)

輸出

           A       C   B
0 2016-01-01    High NaN
2 2016-01-03  Medium NaN
5 2016-01-06  Medium NaN

重建索引與其餘對象對齊

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
4 df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
5 df1 = df1.reindex_like(df2)
6 print(df1)

輸出

       col1      col2      col3
0  0.533272  1.462343  1.958989
1  0.822496  1.020661 -0.958452
2  0.583271  1.100357  0.405649
3 -0.617700 -0.444208  0.921092
4 -0.883714 -0.068178  1.507545
5 -0.696816  0.729113 -0.509259
6 -0.127911 -0.255686 -1.378398

填充時從新加註

pad/ffill - 向前填充值
bfill/backfill - 向後填充值
nearest - 從最近的索引值填充

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
5 print(df2.reindex_like(df1))
6 print("Data Frame with Forward Fill:")
7 print(df2.reindex_like(df1,method='ffill'))

輸出

       col1      col2      col3
0  0.518742  0.162080  1.606103
1 -0.355712  2.200266  1.072651
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill:
       col1      col2      col3
0  0.518742  0.162080  1.606103
1 -0.355712  2.200266  1.072651
2 -0.355712  2.200266  1.072651
3 -0.355712  2.200266  1.072651
4 -0.355712  2.200266  1.072651
5 -0.355712  2.200266  1.072651

重建索引時的填充限制，限制參數在重建索引時提供對填充的額外控制。

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
5 print(df2.reindex_like(df1))
6 print("Data Frame with Forward Fill limiting to 1:")
7 print(df2.reindex_like(df1,method='ffill',limit=1))

輸出

       col1      col2      col3
0  0.550406  0.220336 -0.733154
1  0.372353  0.978386  1.202727
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0  0.550406  0.220336 -0.733154
1  0.372353  0.978386  1.202727
2  0.372353  0.978386  1.202727
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN

重命名

1 import pandas as pd
2 import numpy as np
3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
4 print(df1)
5 print("After renaming the rows and columns:")
6 print(df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

輸出

       col1      col2      col3
0  0.162944 -0.257846 -0.890368
1 -0.969776  1.685473 -1.330109
2 -1.271563 -0.375700  0.778564
3 -1.123660  0.849679  0.436355
4  0.321475  0.779693 -2.100270
5 -1.184636 -0.206975  0.941504
After renaming the rows and columns:
              c1        c2      col3
apple   0.162944 -0.257846 -0.890368
banana -0.969776  1.685473 -1.330109
durian -1.271563 -0.375700  0.778564
3      -1.123660  0.849679  0.436355
4       0.321475  0.779693 -2.100270
5      -1.184636 -0.206975  0.941504

9. 迭代

 1 import pandas as pd
 2 import numpy as np
 3 N=20
 4 df = pd.DataFrame({
 5     'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
 6     'x': np.linspace(0,stop=N-1,num=N),
 7     'y': np.random.rand(N),
 8     'C': np.random.choice(['Low','Medium','High'],N).tolist(),
 9     'D': np.random.normal(100, 10, size=(N)).tolist()
10     })
11 for col in df:
12     print(col)

輸出

A
x
y
C
D

要遍歷DataFrame中的行，可使用如下函數

iteritems() - 迭代（key, value）對
iterrows() - 將行迭代爲（索引，Series）對
itertuples() - 以namedtuples的形式迭代行

 1 import pandas as pd
 2 import numpy as np
 3 df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
 4 print('------------iteritems--------------')
 5 for key,value in df.iteritems():
 6     print(key,value)
 7 print('----------------end----------------')
 8 print('-------------iterrows--------------')
 9 for row_index,row in df.iterrows():
10     print(row_index,row)
11 print('----------------end----------------')
12 print('-------------itertuples------------')
13 for row in df.itertuples():
14     print(row)
15 print('----------------end----------------')

輸出

------------iteritems--------------
col1 0   -0.453626
1   -1.555137
2    1.209289
3    0.238345
Name: col1, dtype: float64
col2 0   -0.309713
1   -0.018258
2    0.326646
3    1.584639
Name: col2, dtype: float64
col3 0   -1.746411
1    0.144020
2    0.932400
3   -0.848700
Name: col3, dtype: float64
----------------end----------------
-------------iterrows--------------
0 col1   -0.453626
col2   -0.309713
col3   -1.746411
Name: 0, dtype: float64
1 col1   -1.555137
col2   -0.018258
col3    0.144020
Name: 1, dtype: float64
2 col1    1.209289
col2    0.326646
col3    0.932400
Name: 2, dtype: float64
3 col1    0.238345
col2    1.584639
col3   -0.848700
Name: 3, dtype: float64
----------------end----------------
-------------itertuples------------
Pandas(Index=0, col1=-0.453625680715928, col2=-0.30971276978094636, col3=-1.7464111236386397)
Pandas(Index=1, col1=-1.5551365938912898, col2=-0.018257622785818713, col3=0.1440202346073698)
Pandas(Index=2, col1=1.2092886777094904, col2=0.3266461576970751, col3=0.9323998460902878)
Pandas(Index=3, col1=0.23834535595475798, col2=1.5846386089382405, col3=-0.8486996087036667)
----------------end----------------

10. 排序

sort_values()提供了mergeesort,heapsort和quicksort的配置。

 1 import pandas as pd
 2 import numpy as np
 3 unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
 4 print(unsorted_df)
 5 print('---------按標籤排序----------')
 6 sorted_df=unsorted_df.sort_index()
 7 print(sorted_df)
 8 print('--------改變排序順序---------')
 9 sorted_df = unsorted_df.sort_index(ascending=False)
10 print(sorted_df)
11 print('----------按列排序-----------')
12 sorted_df=unsorted_df.sort_index(axis=1)
13 print(sorted_df)
14 print('----------按值排序-----------')
15 sorted_df = unsorted_df.sort_values(by='col1')
16 print(sorted_df)

輸出

       col2      col1
1  0.295840 -0.880007
4  0.151129  1.843255
6 -0.516764  0.195839
2 -0.040592  0.582046
3  1.806547 -0.760579
5 -1.366668  0.652985
9 -1.180956  1.198587
8 -1.621409 -0.555094
0  0.403722  0.296659
7  0.520232 -0.759177
---------按標籤排序----------
       col2      col1
0  0.403722  0.296659
1  0.295840 -0.880007
2 -0.040592  0.582046
3  1.806547 -0.760579
4  0.151129  1.843255
5 -1.366668  0.652985
6 -0.516764  0.195839
7  0.520232 -0.759177
8 -1.621409 -0.555094
9 -1.180956  1.198587
--------改變排序順序---------
       col2      col1
9 -1.180956  1.198587
8 -1.621409 -0.555094
7  0.520232 -0.759177
6 -0.516764  0.195839
5 -1.366668  0.652985
4  0.151129  1.843255
3  1.806547 -0.760579
2 -0.040592  0.582046
1  0.295840 -0.880007
0  0.403722  0.296659
----------按列排序-----------
       col1      col2
1 -0.880007  0.295840
4  1.843255  0.151129
6  0.195839 -0.516764
2  0.582046 -0.040592
3 -0.760579  1.806547
5  0.652985 -1.366668
9  1.198587 -1.180956
8 -0.555094 -1.621409
0  0.296659  0.403722
7 -0.759177  0.520232
----------按值排序-----------
       col2      col1
1  0.295840 -0.880007
3  1.806547 -0.760579
7  0.520232 -0.759177
8 -1.621409 -0.555094
6 -0.516764  0.195839
0  0.403722  0.296659
2 -0.040592  0.582046
5 -1.366668  0.652985
9 -1.180956  1.198587
4  0.151129  1.843255

11. 字符串和文本數據

函數	描述
lower()	將Series/Index中的字符串轉換爲小寫
upper()	將Series/Index中的字符串轉換爲大寫
len()	計算字符串長度
strip()	幫助從兩側的Series/索引中的每一個字符串中刪除空格
split()	用給定的模式拆分每一個字符串
cat()	使用給定的分隔符鏈接Series/索引元素
get_dummies()	返回具備單熱編碼值的DataFrame
contains()	若是元素中包含子字符串，則返回每一個元素的布爾值
replace(a,b)	將值a替換爲值b
repeat()	重複每一個元素指定的次數
count()	返回模式中每一個元素的出現總數
startswith()	若是元素以模式開始，則返回true
endswith()	若是元素以模式結束，則返回true
find()	返回模式第一次出現的位置
findall()	返回模式的全部出現的列表
swapcase()	變換字母大小寫
islower()	是否小寫
isupper()	是否大寫
isnumeric()	是否數字

12. 自定義顯示選項

pd.get_option(param) #顯示默認值
pd.set_option(param, value) #設置默認值
pd.reset_option(param) #重置默認值
pd.describe_option(param) #打印參數的描述
pd.option_context(param, value) #臨時設置默認值，退出做用域自動銷燬

參數	描述
"display.max_rows"	顯示的最大行數
"display.max_columns"	顯示的最大列數
"display.expand_frame_repr"	拉伸頁面
"display.max_colwidth"	顯示的最大列寬
"display.precision"	顯示的十進制數的精度