pandas進階

  pandas是基於numpy構建的庫,在數據處理方面能夠把它理解爲numpy的增強版,因爲numpy主要用於科學計算,特長不在於數據處理,咱們日常處理的數據通常帶有列標籤和index索引,這時pandas做爲數據分析包而被開發出來。html

pandas數據結構(Series/DataFrame)python

1、Series

一、Series建立

  Series相似一維數組的數據結構,由一組數據(各類numpy數據類型)和與之關聯的數據標籤(索引)組成,結構至關於定長有序的字典,index和value之間相互獨立.數組

In [2]:
import pandas as pd
import numpy as np
In [3]:
# 建立Series
a1 = pd.Series([1, 2, 3])  # 數組生成Series
a1
Out[3]:
0    1
1    2
2    3
dtype: int64
In [4]:
a2 = pd.Series(np.array([1, 2, 3]))  # numpy數組生成Series
a2
Out[4]:
0    1
1    2
2    3
dtype: int32
In [5]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"])  # 指定標籤index生成
a3
Out[5]:
index1    1
index2    2
index3    3
dtype: int64
In [6]:
a4 = pd.Series({"index1": 1, "index2": 2, "index3": 3})  # 字典生成Series
a4
Out[6]:
index1    1
index2    2
index3    3
dtype: int64
In [8]:
a5 = pd.Series({"index": 1, "index2": 2, "index3": 3},
               index=["index1", "index2", "index3"])  # 字典生成Series,指定index,不匹配部分爲NaN
a5
Out[8]:
index1    NaN
index2    2.0
index3    3.0
dtype: float64
In [9]:
a6 = pd.Series(10, index=["index1", "index2", "index3"])
a6
Out[9]:
index1    10
index2    10
index3    10
dtype: int64
 

二、Series屬性

  能夠把Series當作一個定長的有序字典數據結構

  能夠經過shape(維度),size(長度),index(鍵),values(值)等獲得series的屬性app

In [10]:
a1 = pd.Series([1, 2, 3])
a1.index  # Series索引
Out[10]:
RangeIndex(start=0, stop=3, step=1)
In [12]:
a1.values  # Series數值
Out[12]:
array([1, 2, 3], dtype=int64)
In [13]:
a1.name = "population"  # 指定Series名字
a1.index.name = "state"  # 指定Series索引名字
a1
Out[13]:
state
0    1
1    2
2    3
Name: population, dtype: int64
In [14]:
a1.shape
Out[14]:
(3,)
In [15]:
a1.size
Out[15]:
3
 

三、Series查找元素

loc爲顯示切片(經過鍵),iloc爲隱式切片(經過索引)ide

訪問單個元素

s[indexname]
s.loc[indexname] 推薦
s[loc]
s.iloc[loc] 推薦<函數

訪問多個元素

s[[indexname1,indexname2]]
s.loc[[indexname1,indexname2]] 推薦
s[[loc1,loc2]]
s.iloc[[loc1,loc2]] 推薦ui

In [17]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"])
a3
Out[17]:
index1    1
index2    2
index3    3
dtype: int64
In [18]:
a3["index1"]
Out[18]:
1
In [19]:
a3.loc['index1']
Out[19]:
1
In [20]:
a3[1]
Out[20]:
2
In [22]:
a3.iloc[1]
Out[22]:
2
In [23]:
a3[['index1','index2']]
Out[23]:
index1    1
index2    2
dtype: int64
In [24]:
a3.loc[['index1','index2']]
Out[24]:
index1    1
index2    2
dtype: int64
In [25]:
a3[[1,2]]
Out[25]:
index2    2
index3    3
dtype: int64
In [26]:
a3.iloc[[1,2]]
Out[26]:
index2    2
index3    3
dtype: int64
In [27]:
a3[a3 > np.mean(a3)]  # 布爾值查找元素
Out[27]:
index3    3
dtype: int64
In [28]:
a3[0:2]  # 絕對位置切片
Out[28]:
index1    1
index2    2
dtype: int64
In [30]:
a3["index1":"index2"]  # 索引切片
Out[30]:
index1    1
index2    2
dtype: int64
 

四、Series修改元素

In [32]:
# 修改元素
a3["index3"] = 100  # 按照索引修改元素
a3
Out[32]:
index1      1
index2      2
index3    100
dtype: int64
In [33]:
a3[2] = 1000  # 按照絕對位置修改元素
a3
Out[33]:
index1       1
index2       2
index3    1000
dtype: int64
 

五、Series添加元素

In [34]:
# 添加元素
a3["index4"] = 10  # 按照索引添加元素
a3
Out[34]:
index1       1
index2       2
index3    1000
index4      10
dtype: int64
 

六、Series刪除元素

In [35]:
a3.drop(["index4", "index3"], inplace=True)  # inplace=True表示做用在當前Series
a3
Out[35]:
index1    1
index2    2
dtype: int64
 

七、Series方法

In [36]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"])
a3["index3"] = np.NaN  # 添加元素
a3
Out[36]:
index1    1.0
index2    2.0
index3    NaN
dtype: float64
In [37]:
a3.isnull()  # 判斷Series是否有缺失值
Out[37]:
index1    False
index2    False
index3     True
dtype: bool
In [38]:
a3.notnull()  # 判斷Series是否沒有缺失值
Out[38]:
index1     True
index2     True
index3    False
dtype: bool
In [39]:
"index1" in a3  # 判斷Series中某個索引是否存在
Out[39]:
True
In [47]:
a3.isin([1,2])  # 判斷Series中某個值是否存在
Out[47]:
index1     True
index2     True
index3    False
dtype: bool
In [48]:
a3.unique()  # 統計Series中去重元素
Out[48]:
array([ 1.,  2., nan])
In [49]:
a3.value_counts()  # 統計Series中去重元素和個數
Out[49]:
2.0    1
1.0    1
dtype: int64
 

2、Dataframe

  DataFrame是一個【表格型】的數據結構,能夠看作是【由Series組成的字典】(共用同一個索引)。DataFrame由按必定順序排列的多列數據組成。設計初衷是將Series的使用場景從一維拓展到多維。DataFrame既有行索引,也有列索引。spa

行索引:index
列索引:columns
值:values(numpy的二維數組)設計

 

一、建立DataFrame

1.1經過字典建立

In [50]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]}
dataFrame1 = pd.DataFrame(data=data)  # 經過字典建立
dataFrame1
Out[50]:
 
  color price
0 green 1
1 red 2
2 blue 3
3 black 4
4 yellow 5
In [51]:
dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"])
dataFrame2
Out[51]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [52]:
dataFrame3 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"],
                          columns=["price"])  # 指定列索引
dataFrame3
Out[52]:
 
  price
index1 1
index2 2
index3 3
index4 4
index5 5
In [53]:
dataFrame4 = pd.DataFrame(data=np.arange(12).reshape(3, 4))  # 經過numpy數組建立
dataFrame4
Out[53]:
 
  0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [54]:
dic = {
    '張三':[150,150,150,300],
    '李四':[0,0,0,0]
}
pd.DataFrame(data=dic,index=['語文','數學','英語','理綜'])
Out[54]:
 
  張三 李四
語文 150 0
數學 150 0
英語 150 0
理綜 300 0
In [56]:
data = [[0,150],[0,150],[0,150],[0,300]]
index = ['語文','數學','英語','理綜']
columns = ['李四','張三']
pd.DataFrame(data=data,index=index,columns=columns)
Out[56]:
 
  李四 張三
語文 0 150
數學 0 150
英語 0 150
理綜 0 300
 

1.2經過Series建立

In [59]:
cars = pd.Series({"Beijing": 300000, "Shanghai": 350000, "Shenzhen": 300000, "Tianjian": 200000, "Guangzhou": 250000,
                  "Chongqing": 150000})
cars
Out[59]:
Beijing      300000
Shanghai     350000
Shenzhen     300000
Tianjian     200000
Guangzhou    250000
Chongqing    150000
dtype: int64
In [60]:
cities = {"Shanghai": 90000, "Foshan": 4500, "Dongguan": 5500, "Beijing": 6600, "Nanjing": 8000, "Lanzhou": None}
apts = pd.Series(cities, name="price")
apts
Out[60]:
Shanghai    90000.0
Foshan       4500.0
Dongguan     5500.0
Beijing      6600.0
Nanjing      8000.0
Lanzhou         NaN
Name: price, dtype: float64
In [61]:
df = pd.DataFrame({"apts": apts, "cars": cars})
df
Out[61]:
 
  apts cars
Beijing 6600.0 300000.0
Chongqing NaN 150000.0
Dongguan 5500.0 NaN
Foshan 4500.0 NaN
Guangzhou NaN 250000.0
Lanzhou NaN NaN
Nanjing 8000.0 NaN
Shanghai 90000.0 350000.0
Shenzhen NaN 300000.0
Tianjian NaN 200000.0
 

1.3經過dicts的list來構建Dataframe

In [62]:
data = [{"Beijing": 1000, "Shanghai": 2500, "Nanjing": 9850}, {"Beijing": 5000, "Shanghai": 4600, "Nanjing": 7000}]
pd.DataFrame(data)
Out[62]:
 
  Beijing Nanjing Shanghai
0 1000 9850 2500
1 5000 7000 4600
 

二、查找DataFrame中的元素

In [65]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]}
dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"])
dataFrame2
Out[65]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [66]:
dataFrame2.columns  # 查找dataFrame中全部列標籤
Out[66]:
Index(['color', 'price'], dtype='object')
In [67]:
dataFrame2.index  # 查找dataFrame中的全部行標籤
Out[67]:
Index(['index1', 'index2', 'index3', 'index4', 'index5'], dtype='object')
In [68]:
dataFrame2.values  # 查找dataFrame中的全部值
Out[68]:
array([['green', 1],
       ['red', 2],
       ['blue', 3],
       ['black', 4],
       ['yellow', 5]], dtype=object)
In [72]:
dataFrame2["color"]["index1"]  # 索引查找數值(先列後行,不然報錯)
Out[72]:
'green'
In [73]:
dataFrame2.at["index1", "color"]  # 索引查找數值(先行後列,不然報錯)
Out[73]:
'green'
In [79]:
dataFrame2.iat[0, 1]  # 絕對位置查找數值
Out[79]:
1
 

三、查找DataFrame中某一行/列元素

In [89]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]}
dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"])
dataFrame2
Out[89]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [91]:
dataFrame2.loc["index1"]  # 查找一行元素
Out[91]:
color    green
price        1
Name: index1, dtype: object
In [92]:
dataFrame2.iloc[0]  # 查找一行元素(絕對位置)
Out[92]:
color    green
price        1
Name: index1, dtype: object
In [96]:
dataFrame2.iloc[0:2]  # 經過iloc方法能夠拿到行和列,直接按照index的順序來取。# 能夠當作numpy的ndarray的二維數組來操做。
Out[96]:
 
  color price
index1 green 1
index2 red 2
In [100]:
dataFrame2.loc[:, "price"]  # 查找一列元素
Out[100]:
index1    1
index2    2
index3    3
index4    4
index5    5
Name: price, dtype: int64
In [101]:
dataFrame2.iloc[:, 0]  # 查找一列元素(絕對位置)
Out[101]:
index1     green
index2       red
index3      blue
index4     black
index5    yellow
Name: color, dtype: object
In [102]:
dataFrame2.values[0]  # 查找一行元素
Out[102]:
array(['green', 1], dtype=object)
In [103]:
dataFrame2["price"]  # 查找一列元素,#經過列名的方式,查找列,不能查找行
Out[103]:
index1    1
index2    2
index3    3
index4    4
index5    5
Name: price, dtype: int64
In [104]:
dataFrame2["color"] 
Out[104]:
index1     green
index2       red
index3      blue
index4     black
index5    yellow
Name: color, dtype: object
 

四、查找DataFrame中的多行/列元素

In [106]:
dataFrame2.head(5)  # 查看前5行元素
Out[106]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [107]:
dataFrame2.tail(5)  # 查看後5行元素
Out[107]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [108]:
dataFrame2["index1":"index4"]  # 切片多行
Out[108]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
In [109]:
dataFrame2[0:4]  # 切片多行
Out[109]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
In [111]:
dataFrame2.loc[["index1", "index2"]]  # 多行
Out[111]:
 
  color price
index1 green 1
index2 red 2
In [113]:
dataFrame2.iloc[[0, 1]]  # 多行
Out[113]:
 
  color price
index1 green 1
index2 red 2
In [114]:
dataFrame2.loc[:, ["price"]]  # 多列
Out[114]:
 
  price
index1 1
index2 2
index3 3
index4 4
index5 5
In [115]:
dataFrame2.iloc[:, [0, 1]]  # 多列
Out[115]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [116]:
dataFrame2.loc[["index1", "index3"], ["price"]]  # 索引查找
Out[116]:
 
  price
index1 1
index3 3
In [117]:
dataFrame2.iloc[[1, 2], [0]]  # 絕對位置查找
Out[117]:
 
  color
index2 red
index3 blue
 

五、添加一行/列元素

In [119]:
dataFrame2.loc["index6"] = ["pink", 3] 
dataFrame2
Out[119]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 pink 3
In [120]:
dataFrame2.loc["index6"]=10
dataFrame2
Out[120]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 10 10
In [123]:
dataFrame2.iloc[5] = 10
dataFrame2
Out[123]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 10 10
In [125]:
dataFrame2.loc["index7"] = 100
dataFrame2
Out[125]:
 
  color price
index1 green 1
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
index6 10 10
index7 100 100
In [129]:
dataFrame2.loc[:, "size"] = "small"
dataFrame2
Out[129]:
 
  color price size
index1 green 1 small
index2 red 2 small
index3 blue 3 small
index4 black 4 small
index5 yellow 5 small
index6 10 10 small
index7 100 100 small
In [130]:
dataFrame2.iloc[:, 2] = 10
dataFrame2
Out[130]:
 
  color price size
index1 green 1 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
 

六、修改元素

In [131]:
dataFrame2.loc["index1", "price"] = 100
dataFrame2
Out[131]:
 
  color price size
index1 green 100 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
In [132]:
dataFrame2.iloc[0, 1] = 10
dataFrame2
Out[132]:
 
  color price size
index1 green 10 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
In [133]:
dataFrame2.at["index1", "price"] = 100
dataFrame2
Out[133]:
 
  color price size
index1 green 100 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
In [135]:
dataFrame2.iat[0, 1] = 1000
dataFrame2
Out[135]:
 
  color price size
index1 green 1000 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
index6 10 10 10
index7 100 100 10
 

七、刪除元素

In [136]:
dataFrame2.drop(["index6", "index7"], inplace=True)  # inplace=True表示做用在原數組
dataFrame2
Out[136]:
 
  color price size
index1 green 1000 10
index2 red 2 10
index3 blue 3 10
index4 black 4 10
index5 yellow 5 10
In [141]:
a=dataFrame2.drop(["price"], axis=1, inplace=False)
dataFrame2
Out[141]:
 
  color price
index1 green 1000
index2 red 2
index3 blue 3
index4 black 4
index5 yellow 5
In [142]:
a
Out[142]:
 
  color
index1 green
index2 red
index3 blue
index4 black
index5 yellow
 

8. 處理NaN數據

In [148]:
dates = pd.date_range('20180101', periods=3)
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=dates, columns=['a', 'b', 'c', 'd'])
df.iloc[1, 1], df.iloc[2, 2] = np.nan, np.nan
df
Out[148]:
 
  a b c d
2018-01-01 0 1.0 2.0 3
2018-01-02 4 NaN 6.0 7
2018-01-03 8 9.0 NaN 11
 

8.1刪除NaN數據

In [151]:
re=df.dropna(axis=1, inplace=False)  # inplace默認爲false
df
Out[151]:
 
  a b c d
2018-01-01 0 1.0 2.0 3
2018-01-02 4 NaN 6.0 7
2018-01-03 8 9.0 NaN 11
In [152]:
re
Out[152]:
 
  a d
2018-01-01 0 3
2018-01-02 4 7
2018-01-03 8 11
 

8.2填充NaN數據

In [153]:
re2 = df.fillna(value='*')
re2
Out[153]:
 
  a b c d
2018-01-01 0 1 2 3
2018-01-02 4 * 6 7
2018-01-03 8 9 * 11
 

8.3 檢查是否存在NaN

In [155]:
df.isnull()
Out[155]:
 
  a b c d
2018-01-01 False False False False
2018-01-02 False True False False
2018-01-03 False False True False
 

9.合併DataFrame

 

9.1 concat函數

In [156]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df1
Out[156]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
In [157]:
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])
df2
Out[157]:
 
  a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
In [158]:
df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd'])
df3
Out[158]:
 
  a b c d
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
In [159]:
# ignore_index=True將從新對index排序
pd.concat([df1, df2, df3], axis=0, ignore_index=True)
Out[159]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
In [160]:
# ignore_index=True將從新對index排序
pd.concat([df1, df2, df3], axis=0, ignore_index=False)
Out[160]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
 

join參數用法

In [164]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
# join默認爲'outer',不共有的列用NaN填充 
pd.concat([df1, df2], sort=False, join='outer')
Out[164]:
 
  a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
In [166]:
# join='inner'只合並共有的列
pd.concat([df1, df2], sort=False, join='inner',ignore_index=True)
Out[166]:
 
  b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
 

join_axes參數用法

In [167]:
# 按照df1的index進行合併
pd.concat([df1, df2], axis=1, join_axes=[df1.index])
Out[167]:
 
  a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
 

9.2 append函數

In [169]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])

re = df1.append(df2, ignore_index=True)
re
Out[169]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
 

append一組數據

In [170]:
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
s = pd.Series([4, 4, 4, 4], index=['a', 'b', 'c', 'd'])

re = df1.append(s, ignore_index=True)
re
Out[170]:
 
  a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 4.0 4.0 4.0 4.0
 

9.3 merge函數

基於某一列進行合併

In [171]:
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'KEY': ['K1', 'K2', 'K3']})
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3'],
                    'KEY': ['K1', 'K2', 'K3']})

df1
Out[171]:
 
  A B KEY
0 A1 B1 K1
1 A2 B2 K2
2 A3 B3 K3
In [172]:
df2
Out[172]:
 
  C D KEY
0 C1 D1 K1
1 C2 D2 K2
2 C3 D3 K3
In [173]:
re = pd.merge(df1, df2, on='KEY')
re
Out[173]:
 
  A B KEY C D
0 A1 B1 K1 C1 D1
1 A2 B2 K2 C2 D2
2 A3 B3 K3 C3 D3
 

基於某兩列進行合併

In [175]:
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'KEY1': ['K1', 'K2', 'K0'],
                    'KEY2': ['K0', 'K1', 'K3']})
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3'],
                    'KEY1': ['K0', 'K2', 'K1'],
                    'KEY2': ['K1', 'K1', 'K0']})
# how:['left','right','outer','inner']
re = pd.merge(df1, df2, on=['KEY1', 'KEY2'], how='inner')
re
Out[175]:
 
  A B KEY1 KEY2 C D
0 A1 B1 K1 K0 C3 D3
1 A2 B2 K2 K1 C2 D2
 

按index合併

In [176]:
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3']},
                   index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'],
                    'D': ['D1', 'D2', 'D3']},
                   index=['K0', 'K1', 'K3'])

re = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
re
Out[176]:
 
  A B C D
K0 A1 B1 C1 D1
K1 A2 B2 C2 D2
K2 A3 B3 NaN NaN
K3 NaN NaN C3 D3
 

爲列加後綴

In [177]:
df_boys = pd.DataFrame({'id': ['1', '2', '3'],
                        'age': ['23', '25', '18']})
df_girls = pd.DataFrame({'id': ['1', '2', '3'],
                         'age': ['18', '18', '18']})
re = pd.merge(df_boys, df_girls, on='id', suffixes=['_boys', '_girls'])
re
Out[177]:
 
  id age_boys age_girls
0 1 23 18
1 2 25 18
2 3 18 18
相關文章
相關標籤/搜索