機器學習的過程當中不少時候須要用到相似透視表的功能。Pandas提供了pivot和pivot_table實現透視表功能。相對比而言,pivot_table更增強大,在實現透視表的時候能夠進行聚類等操做。html
pivot_table幫助地址:api
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html機器學習
官方給的幾個例子:學習
>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",htm
... "bar", "bar", "bar", "bar"],blog
... "B": ["one", "one", "one", "two", "two",ip
... "one", "one", "two", "two"],pandas
... "C": ["small", "large", "large", "small",io
... "small", "large", "small", "small",table
... "large"],
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
>>> df
A B C D E
0 foo one small 1 2
1 foo one large 2 4
2 foo one large 2 5
3 foo two small 3 5
4 foo two small 3 6
5 bar one large 4 6
6 bar one small 5 8
7 bar two small 6 9
8 bar two large 7 9
This first example aggregates values by taking the sum.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
>>> table
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
We can also fill missing values using the fill_value parameter.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum, fill_value=0)
>>> table
C large small
A B
bar one 4 5
two 7 6
foo one 4 1
two 0 6
The next example aggregates by taking the mean across multiple columns.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
... aggfunc={'D': np.mean,
... 'E': np.mean})
>>> table
D E
A C
bar large 5.500000 7.500000
small 5.500000 8.500000
foo large 2.000000 4.500000
small 2.333333 4.333333
We can also calculate multiple types of aggregations for any given value column.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
... aggfunc={'D': np.mean,
... 'E': [min, max, np.mean]})
>>> table
D E
mean max mean min
A C
bar large 5.500000 9.0 7.500000 6.0
small 5.500000 9.0 8.500000 8.0
foo large 2.000000 5.0 4.500000 4.0
small 2.333333 6.0 4.333333 2.0
如今的一個問題是,處理後的dataframe的columns是多層的,例如最後一個例子的columns是這個樣子的:
table.columns:
MultiIndex(levels=[['D', 'E'], ['max', 'mean', 'min']],
labels=[[0, 1, 1, 1], [1, 0, 1, 2]])
爲了後續的運算,咱們常常但願它能簡化,便於處理。也就是說吧columns拍平。你們能夠這麼處理:
table.columns =[s1 +'_'+ str(s2) for (s1,s2) in table.columns.tolist()]
table.reset_index(inplace=True)
效果以下:
table.columns
Index(['A', 'C', 'D_mean', 'E_max', 'E_mean', 'E_min'], dtype='object')
整個案例效果: