任何分組(groupby)操做都涉及原始對象的如下操做之一:python
在許多狀況下,咱們將數據分紅多個集合,並在每一個子集上應用一些函數。在應用函數中,能夠執行如下操做:shell
下面來看看建立一個DataFrame對象並對其執行全部操做 -ide
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df)
輸出結果:函數
Points Rank Team Year 0 876 1 Riders 2014 1 789 2 Riders 2015 2 863 2 Devils 2014 3 673 3 Devils 2015 4 741 3 Kings 2014 5 812 4 kings 2015 6 756 1 Kings 2016 7 788 1 Kings 2017 8 694 2 Riders 2016 9 701 4 Royals 2014 10 804 1 Royals 2015 11 690 2 Riders 2017
Pandas對象能夠分紅任何對象。有多種方式來拆分對象,如 -spa
如今來看看如何將分組對象應用於DataFrame對象code
示例orm
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df.groupby('Team'))
輸出結果:對象
<pandas.core.groupby.DataFrameGroupBy object at 0x00000245D60AD518>
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df.groupby('Team').groups)
輸出結果:
{ 'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64') }
按多列分組blog
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print (df.groupby(['Team','Year']).groups)
輸出結果:索引
{ ('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64') }
使用groupby
對象,能夠遍歷相似itertools.obj
的對象。
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name,group in grouped: print (name) print (group) print ('\n')
輸出結果:
2014 Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014
2015 Points Rank Team Year 1 789 2 Riders 2015 3 673 3 Devils 2015 5 812 4 kings 2015 10 804 1 Royals 2015
2016 Points Rank Team Year 6 756 1 Kings 2016 8 694 2 Riders 2016
2017 Points Rank Team Year 7 788 1 Kings 2017 11 690 2 Riders 2017
默認狀況下,groupby
對象具備與分組名相同的標籤名稱。
使用get_group()
方法,能夠選擇一個組。參考如下示例代碼 -
import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print (grouped.get_group(2014))
輸出結果:
Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014
聚合函數爲每一個組返回單個聚合值。當建立了分組(group by)對象,就能夠對分組數據執行多個聚合操做。
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print (grouped['Points'].agg(np.mean))
輸出結果:
Year 2014 795.25 2015 769.50 2016 725.00 2017 739.00 Name: Points, dtype: float64
另外一種查看每一個分組的大小的方法是應用size()
函數 -
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') print (grouped.agg(np.size))
輸出結果:
Team Devils 2 2 2 Kings 3 3 3 Riders 4 4 4 Royals 2 2 2 kings 1 1 1
經過分組系列,還能夠傳遞函數的列表或字典來進行聚合,並生成DataFrame
做爲輸出
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print(df) print('\n') grouped = df.groupby('Team') agg = grouped['Points'].agg([np.sum, np.mean, np.std]) print (agg)
輸出結果:
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
sum mean std
Team
Devils 1536 768.000000 134.350288
Kings 2285 761.666667 24.006943
Riders 3049 762.250000 88.567771
Royals 1505 752.500000 72.831998
kings 812 812.000000 NaN
分組或列上的轉換返回索引大小與被分組的索引相同的對象。所以,轉換應該返回與組塊大小相同的結果。
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') score = lambda x: (x - x.mean()) / x.std()*10 print (grouped.transform(score))
輸出結果:
Points Rank Year 0 12.843272 -15.000000 -11.618950 1 3.020286 5.000000 -3.872983 2 7.071068 -7.071068 -7.071068 3 -7.071068 7.071068 7.071068 4 -8.608621 11.547005 -10.910895 5 NaN NaN NaN 6 -2.360428 -5.773503 2.182179 7 10.969049 -5.773503 8.728716 8 -7.705963 5.000000 3.872983 9 -7.071068 7.071068 -7.071068 10 7.071068 -7.071068 7.071068 11 -8.157595 5.000000 11.618950
過濾根據定義的標準過濾數據並返回數據的子集。filter()
函數用於過濾數據。
import pandas as pd import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) filter = df.groupby('Team').filter(lambda x: len(x) >= 3) print (filter)
輸出結果:
Points Rank Team Year 0 876 1 Riders 2014 1 789 2 Riders 2015 4 741 3 Kings 2014 6 756 1 Kings 2016 7 788 1 Kings 2017 8 694 2 Riders 2016 11 690 2 Riders 2017
在上述過濾條件下,要求返回三次以上參加IPL的隊伍。