方法 | 說明 |
---|---|
count | 計數 |
describe | 給出各列的經常使用統計量 |
min,max | 最大最小值 |
argmin,argmax | 最大最小值的索引位置(整數) |
idxmin,idxmax | 最大最小值的索引值 |
quantile | 計算樣本分位數 |
sum,mean | 對列求和,均值 |
mediam | 中位數 |
mad | 根據平均值計算平均絕對離差 |
var,std | 方差,標準差 |
skew | 偏度(三階矩) |
Kurt | 峯度(四階矩) |
cumsum | 累積和 |
Cummins,cummax | 累計組大體和累計最小值 |
cumprod | 累計積 |
diff | 一階差分 |
pct_change | 計算百分數變化 |
df[df.isnull()] #判斷是夠是Nan,None返回的是個true或false的Series對象 df[df.notnull()] #dropna(): 過濾丟失數據 #df3.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) df.dropna() #將全部含有nan項的row刪除 df.dropna(axis=1,thresh=3) #將在列的方向上三個爲NaN的項刪除 df.dropna(how='ALL') #將所有項都是nan的row刪除 df.dropna()與data[data.notnull()] #效果一致 #fillna(): 填充丟失數據 #前置填充 axis = 0 行 #後置填充 axis = 1 列 df3.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) df.fillna({1:0, 2:0.5}) #對第一列nan值賦0,第二列賦值0.5 df.fillna(method='ffill') #在列方向上之前一個值做爲值賦給NaN
print frame.drop(['a']) print frame.drop(['Ohio'], axis = 1)
drop函數默認刪除行,列須要加axis = 1python
採用drop方法,有下面三種等價的表達式:數組
1. DF= DF.drop('column_name', axis=1); 2. DF.drop('column_name',axis=1, inplace=True) 3. DF.drop([DF.columns[[0,1, 3]]], axis=1, inplace=True)
注意:凡是會對原數組做出修改並返回一個新數組的,每每都有一個 inplace可選參數。若是手動設定爲True(默認爲False),那麼原數組直接就被替換。也就是說,採用inplace=True以後,原數組名(如2和3狀況所示)對應的內存值直接改變;app
而採用inplace=False以後,原數組名對應的內存值並不改變,須要將新的結果賦給一個新的數組或者覆蓋原數組的內存位置(如1狀況所示)。dom
df['Name'] = df['Name'].astype(np.datetime64)
DataFrame.astype() 方法可對整個DataFrame或某一列進行數據格式轉換,支持Python和NumPy的數據類型。函數
df.duplicated() 返回boolean列表,重複爲Trueexcel
df.drop_duplicates() 刪除重複元素即值爲True的列行code
參數列表orm
# 判斷是否重複 data.duplicated()` #移除重複數據 data.drop_duplicated() #對指定列判斷是否存在重複值,而後刪除重複數據 data.drop_duplicated(['key1']) df = DataFrame({'color':['white','white','red','red','white'], 'value':[2,1,3,3,2]}) display(df,df.duplicated(),df.drop_duplicates()) #輸出: color value 0 white 2 1 white 1 2 red 3 3 red 3 4 white 2 0 False 1 False 2 False 3 True 4 True dtype: bool color value 0 white 2 1 white 1 2 red 3
1 replace() 替換元素 replace({索引鍵值對})對象
df = DataFrame({'item':['ball','mug','pen'], 'color':['white','rosso','verde'], 'price':[5.56,4.20,1.30]}) newcolors = {'rosso':'red','verde':'green'} display(df,df.replace(newcolors)) #輸出: color item price 0 white ball 5.56 1 rosso mug 4.20 2 verde pen 1.30 color item price 0 white ball 5.56 1 red mug 4.20 2 green pen 1.30 2.replace還常常用來替換NaN元素 df2 = DataFrame({'math':[100,139,np.nan],'English':[146,None,119]},index = ['張三','李四','Tom']) newvalues = {np.nan:100} display(df2,df2.replace(newvalues)) #輸出: English math 張三 146.0 100.0 李四 NaN 139.0 Tom 119.0 NaN English math 張三 146.0 100.0 李四 100.0 139.0 Tom 119.0 100.0
2 map()函數:新建一列排序
map(函數,可迭代對象) map(函數/{索引鍵值對})
map中返回的數據是一個具體值,不能迭代.
df3 = DataFrame({'color':['red','green','blue'],'project':['math','english','chemistry']}) price = {'red':5.56,'green':3.14,'chemistry':2.79} df3['price'] = df3['color'].map(price) display(df3) #輸出: color project price 0 red math 5.56 1 green english 3.14 2 blue chemistry NaN df3 = DataFrame({'zs':[129,130,34],'ls':[136,98,8]},index = ['張三','李四','倩倩']) display(df3) display(df3['zs'].map({129:'你好',130:'很是好',34:'不錯'})) display(df3['zs'].map({129:120})) def mapscore(score): if score<90: return 'failed' elif score>120: return 'excellent' else: return 'pass' df3['status'] = ddd['zs'].map(mapscore) df3 輸出: zs ls 張三 129 136 李四 130 98 倩倩 34 8 張三 你好 李四 很是好 倩倩 不錯 Name: zs, dtype: object 張三 120.0 李四 NaN 倩倩 NaN Name: zs, dtype: float64 Out[96]: ls zs status 張三 136 129 excellent 李四 98 130 excellent 倩倩 8 34 failed
3 rename()函數:替換索引 rename({索引鍵值對})
df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)}) new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'} display(df4,df4.rename(new_index)) #輸出: color value 0 white 2 1 gray 0 2 purple 9 3 blue 2 4 green 0 color value first white 2 two gray 0 three purple 9 four blue 2 five green 0
1 使用describe()函數查看每一列的描述性統計量
df = DataFrame(np.random.randint(10,size = 10)) display(df.describe()) 0 count 10.000000 mean 5.900000 std 2.685351 min 1.000000 25% 6.000000 50% 7.000000 75% 7.750000 max 8.000000
2 使用std()函數能夠求得DataFrame對象每一列的標準差
df.std() #輸出: 0 3.306559 dtype: float64
3 根據每一列的標準差,對DataFrame元素進行過濾。
藉助any()函數,對每一列應用篩選條件,any過濾出全部符合條件的數據
display(df[(df>df.std()*3).any(axis = 1)]) df.drop(df[(np.abs(df) > (3*df.std())).any(axis=1)].index,inplace=True) display(df,df.shape) 輸出: 0 1 2 7 9 6 8 8 9 8 1 0 1 0 5 0 1 3 3 3 3 5 4 2 4 5 7 6 7 1 6 8 7 7 (7, 2)
使用take()函數排序
能夠藉助np.random.permutation()函數隨機排序
df5 = DataFrame(np.arange(25).reshape(5,5)) new_order = np.random.permutation(5) display(new_order) display(df5,df5.take(new_order)) #輸出 array([4, 2, 3, 1, 0]) 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 4 20 21 22 23 24 0 1 2 3 4 4 20 21 22 23 24 2 10 11 12 13 14 3 15 16 17 18 19 1 5 6 7 8 9 0 0 1 2 3 4
groupby()函數
import pandas as pd df = pd.DataFrame([{'col1':'a', 'col2':1, 'col3':'aa'}, {'col1':'b', 'col2':2, 'col3':'bb'}, {'col1':'c', 'col2':3, 'col3':'cc'}, {'col1':'a', 'col2':44, 'col3':'aa'}]) display(df) # 按col1分組並按col2求和 display(df.groupby(by='col1').agg({'col2':sum}).reset_index()) # 按col1分組並按col2求最值 display(df.groupby(by='col1').agg({'col2':['max', 'min']}).reset_index()) # 按col1 ,col3分組並按col2求和 display(df.groupby(by=['col1', 'col3']).agg({'col2':sum}).reset_index())
import matplotlib.pyplot as plt import pandas as pd import numpy as np from datetime import datetime ''' 分組groupby ''' df=pd.DataFrame({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':np.arange(5), 'data2':np.arange(5)}) print(df) # key1 key2 data1 data2 # 0 a one 0 0 # 1 a two 1 1 # 2 b one 2 2 # 3 b two 3 3 # 4 a one 4 4 ''' 根據分組進行計算 ''' #按key1分組,計算data1的平均值 grouped=df['data1'].groupby(df['key1']) print(grouped.mean()) # a 1.666667 # b 2.500000 #按key1和key2分組,計算data1的平均值 groupedmean=df['data1'].groupby([df['key1'],df['key2']]).mean() print(groupedmean) # key1 key2 # a one 2 # two 1 # b one 2 # two 3 #列變行 print(groupedmean.unstack()) # key2 one two # key1 # a 2 1 # b 2 3 df['key1']#獲取出來的數據series數據 #groupby分組鍵能夠是series還能夠是數組 states=np.array(['Oh','Ca','Ca','Oh','Oh']) years=np.array([2005,2005,2006,2005,2006]) print(df['data1'].groupby([states,years]).mean()) # Ca 2005 1.0 # 2006 2.0 # Oh 2005 1.5 # 2006 4.0 #直接將列名進行分組,非數據項不在其中,非數據項會自動排除分組 print(df.groupby('key1').mean()) # data1 data2 # key1 # a 1.666667 1.666667 # b 2.500000 2.500000 #將入key2分組 print(df.groupby(['key1','key2']).mean()) # data1 data2 # key1 key2 # a one 2 2 # two 1 1 # b one 2 2 # two 3 3 #size()方法,返回含有分組大小的Series,獲得分組的數量 print(df.groupby(['key1','key2']).size()) # key1 key2 # a one 2 # two 1 # b one 1 # two 1 ''' 對分組信息進行迭代 ''' #將a,b進行分組 for name,group in df.groupby('key1'): print(name) print(group) # a # key1 key2 data1 data2 # 0 a one 0 0 # 1 a two 1 1 # 4 a one 4 4 # b # key1 key2 data1 data2 # 2 b one 2 2 # 3 b two 3 3 #根據多個建進行分組 for (k1,k2),group in df.groupby(['key1','key2']): print(name) print(group) # key1 key2 data1 data2 # 0 a one 0 0 # 4 a one 4 4 # b # key1 key2 data1 data2 # 1 a two 1 1 # b # key1 key2 data1 data2 # 2 b one 2 2 # b # key1 key2 data1 data2 # 3 b two 3 3 ''' 選取一個或一組列,返回的Series的分組對象 ''' #對於groupBy對象,若是用一個或一組列名進行索引。就會聚合 print(df.groupby(df['key1'])['data1'])#根據key1分組,生成data1的數據 print(df.groupby(['key1'])[['data1','data2']].mean())#根據key1分組,生成data1,data2的數據 # data1 data2 # key1 # a 1.666667 1.666667 # b 2.500000 2.500000 print(df.groupby(['key1','key2'])['data1'].mean()) # key1 key2 # a one 2 # two 1 # b one 2 # two 3 ''' 經過函數進行分組 ''' #加入你能根據人名長度進行分組的話,就直接傳入len函數 print(people.groupby(len,axis=1).sum())#杭州3是三個字母 # 2 3 # a 30.0 20.0 # b 23.0 21.0 # c 26.0 22.0 # d 42.0 23.0 # e 46.0 24.0 #還能夠和數組、字典、列表、Series混合使用 key_list=['one','one','one','two','two'] print(people.groupby([len,key_list],axis=1).min()) # 2 3 # one two two # a 0.0 15.0 20.0 # b 1.0 16.0 21.0 # c 2.0 17.0 22.0 # d 3.0 18.0 23.0 # e 4.0 19.0 24.0 ''' 根據索引級別分組 ''' columns=pd.MultiIndex.from_arrays([['US',"US",'US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor']) hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns) print(hier_df) # cty US JP # tenor 1 3 5 1 3 # 0 -1.507729 2.112678 0.841736 -0.158109 -0.645219 # 1 0.355262 0.765209 -0.287648 1.134998 -0.440188 # 2 1.049813 0.763482 -0.362013 -0.428725 -0.355601 # 3 -0.868420 -1.213398 -0.386798 0.137273 0.678293 #根據級別分組 print(hier_df.groupby(level='cty',axis=1).count()) # cty JP US # 0 2 3 # 1 2 3 # 2 2 3 # 3 2 3
1 可使用pd.merge()函數包聚合操做的計算結果添加到df的每一行
d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'], 'color':['white','white','red','green','white','white','red','green'], 'weight':np.random.randint(10,size = 8), 'price':np.random.randint(10,size = 8)} df = DataFrame(d1) sums = df.groupby('color').sum().add_prefix('total_') items = df.groupby('item')['price','weight'].sum() means = items['price']/items['weight'] means = DataFrame(means,columns=['means_price']) df2 = pd.merge(df,sums,left_on = 'color',right_index = True) df3 = pd.merge(df2,means,left_on = 'item',right_index = True) display(df2,df3) #輸出: color item price weight 0 white luobo 9 2 1 white baicai 5 9 2 red lajiao 5 8 3 green donggua 1 1 4 white luobo 7 4 5 white baicai 8 0 6 red lajiao 6 8 7 green donggua 4 3 total_price total_weight color green 5 4 red 11 16 white 29 15 pandas.core.frame.DataFrame pandas.core.frame.DataFrame Out[141]: color item price weight total_price total_weight 0 white luobo 9 2 29 15 1 white baicai 5 9 29 15 4 white luobo 7 4 29 15 5 white baicai 8 0 29 15 2 red lajiao 5 8 11 16 6 red lajiao 6 8 11 16 3 green donggua 1 1 5 4 7 green donggua 4 3 5 4
2 可使用transform和apply實現相同功能
使用transform
d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'], 'color':['white','white','red','green','white','white','red','green'], 'weight':np.random.randint(10,size = 8), 'price':np.random.randint(10,size = 8)} df = DataFrame(d1) sum1 = df.groupby('color')['price','weight'].sum().add_prefix("total_") sums2 = df.groupby('color')['price','weight'].transform(lambda x:x.sum()).add_prefix('total_') sums3 = df.groupby('color')['price','weight'].transform(sum).add_prefix('total_') display(sum,df,sum1,sums2,sums3) 輸出: <function sum> color item price weight 0 white luobo 7 7 1 white baicai 7 7 2 red lajiao 2 7 3 green donggua 6 6 4 white luobo 1 2 5 white baicai 3 6 6 red lajiao 7 0 7 green donggua 0 2 total_price total_weight color green 6 8 red 9 7 white 18 22 total_price total_weight 0 18 22 1 18 22 2 9 7 3 6 8 4 18 22 5 18 22 6 9 7 7 6 8 total_price total_weight 0 18 22 1 18 22 2 9 7 3 6 8 4 18 22 5 18 22 6 9 7 7 6 8
使用apply
def sum_price(x): return x.sum() sums3 = df.groupby('color')['price','weight'].apply(lambda x:x.sum()).add_prefix('total_') sums4 = df.groupby('color')['price','weight'].apply(sum_price).add_prefix('total_') display(df,sums3,sums4) 輸出: color item price weight 0 white luobo 4 4 1 white baicai 0 3 2 red lajiao 0 4 3 green donggua 7 5 4 white luobo 3 1 5 white baicai 3 3 6 red lajiao 0 6 7 green donggua 0 7 total_price total_weight color green 7 12 red 0 10 white 10 11 totals_price totals_weight color green 7 12 red 0 10 white 10 11