數據分析之pandas常見的數據處理(四)

時間 2019-12-11

原文原文鏈接

常見聚合方法

方法	說明
count	計數
describe	給出各列的經常使用統計量
min,max	最大最小值
argmin,argmax	最大最小值的索引位置（整數）
idxmin,idxmax	最大最小值的索引值
quantile	計算樣本分位數
sum,mean	對列求和，均值
mediam	中位數
mad	根據平均值計算平均絕對離差
var,std	方差，標準差
skew	偏度（三階矩）
Kurt	峯度（四階矩）
cumsum	累積和
Cummins，cummax	累計組大體和累計最小值
cumprod	累計積
diff	一階差分
pct_change	計算百分數變化

1 清洗無效數據

df[df.isnull()]  #判斷是夠是Nan,None返回的是個true或false的Series對象
df[df.notnull()]

#dropna(): 過濾丟失數據
#df3.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
df.dropna()                 #將全部含有nan項的row刪除
df.dropna(axis=1,thresh=3)  #將在列的方向上三個爲NaN的項刪除
df.dropna(how='ALL')        #將所有項都是nan的row刪除

df.dropna()與data[data.notnull()]  #效果一致

#fillna(): 填充丟失數據
#前置填充  axis = 0 行
#後置填充  axis = 1 列
df3.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
df.fillna({1:0, 2:0.5})         #對第一列nan值賦0，第二列賦值0.5
df.fillna(method='ffill')   #在列方向上之前一個值做爲值賦給NaN

2 drop函數使用

drop函數的使用：刪除行、刪除列

print frame.drop(['a'])
print frame.drop(['Ohio'], axis = 1)

drop函數默認刪除行，列須要加axis = 1python

drop函數的使用：inplace參數

採用drop方法，有下面三種等價的表達式：數組

1. DF= DF.drop('column_name', axis=1)；
2. DF.drop('column_name',axis=1, inplace=True)
3. DF.drop([DF.columns[[0,1, 3]]], axis=1, inplace=True)

注意：凡是會對原數組做出修改並返回一個新數組的，每每都有一個 inplace可選參數。若是手動設定爲True（默認爲False），那麼原數組直接就被替換。也就是說，採用inplace=True以後，原數組名（如2和3狀況所示）對應的內存值直接改變；app

而採用inplace=False以後，原數組名對應的內存值並不改變，須要將新的結果賦給一個新的數組或者覆蓋原數組的內存位置（如1狀況所示）。dom

drop函數的使用：數據類型轉換

df['Name'] = df['Name'].astype(np.datetime64)

DataFrame.astype() 方法可對整個DataFrame或某一列進行數據格式轉換，支持Python和NumPy的數據類型。函數

3 pandas數據處理方法

(1) 刪除重複數據

df.duplicated() 返回boolean列表,重複爲Trueexcel

df.drop_duplicates() 刪除重複元素即值爲True的列行code

參數列表orm

subset : column label or sequence of labels, optional
用來指定特定的列，默認全部列
keep : {‘first’, ‘last’, False}, default ‘first’
刪除重複項並保留第一次出現的項
inplace : boolean, default False
是直接在原來數據上修改仍是保留一個副本

# 判斷是否重複
data.duplicated()`
#移除重複數據
data.drop_duplicated()
#對指定列判斷是否存在重複值，而後刪除重複數據
data.drop_duplicated(['key1'])

df = DataFrame({'color':['white','white','red','red','white'],
               'value':[2,1,3,3,2]})
display(df,df.duplicated(),df.drop_duplicates())

#輸出:
color   value
0   white   2
1   white   1
2   red 3
3   red 3
4   white   2
0    False
1    False
2    False
3     True
4     True
dtype: bool
color   value
0   white   2
1   white   1
2   red     3

(2) 映射

1 replace() 替換元素 replace({索引鍵值對})對象

df = DataFrame({'item':['ball','mug','pen'],
               'color':['white','rosso','verde'],
               'price':[5.56,4.20,1.30]})
newcolors = {'rosso':'red','verde':'green'}
display(df,df.replace(newcolors))

#輸出：
    color   item    price
0   white   ball    5.56
1   rosso   mug 4.20
2   verde   pen 1.30
    color   item    price
0   white   ball    5.56
1   red mug 4.20
2   green   pen 1.30

2.replace還常常用來替換NaN元素

df2 = DataFrame({'math':[100,139,np.nan],'English':[146,None,119]},index = ['張三','李四','Tom'])
newvalues = {np.nan:100}
display(df2,df2.replace(newvalues))

#輸出：
    English math
張三  146.0   100.0
李四  NaN 139.0
Tom 119.0   NaN
English math
張三  146.0   100.0
李四  100.0   139.0
Tom 119.0   100.0

2 map()函數：新建一列排序

map(函數,可迭代對象) map(函數/{索引鍵值對})

map中返回的數據是一個具體值，不能迭代.

df3 = DataFrame({'color':['red','green','blue'],'project':['math','english','chemistry']})
price = {'red':5.56,'green':3.14,'chemistry':2.79}
df3['price'] = df3['color'].map(price)
display(df3)

#輸出：
color   project price
0   red     math        5.56
1   green   english     3.14
2   blue    chemistry   NaN


df3 = DataFrame({'zs':[129,130,34],'ls':[136,98,8]},index = ['張三','李四','倩倩'])
display(df3)
display(df3['zs'].map({129:'你好',130:'很是好',34:'不錯'}))
display(df3['zs'].map({129:120}))
def mapscore(score):
    if score<90:
        return 'failed'
    elif score>120:
        return 'excellent'
    else:
        return 'pass'
df3['status'] = ddd['zs'].map(mapscore)
df3

輸出：
      zs  ls
張三  129 136
李四  130 98
倩倩  34  8

張三     你好
李四    很是好
倩倩     不錯
Name: zs, dtype: object
        
張三    120.0
李四      NaN
倩倩      NaN
Name: zs, dtype: float64
Out[96]:
ls    zs    status
張三  136 129 excellent
李四  98  130 excellent
倩倩  8   34  failed

3 rename()函數：替換索引 rename({索引鍵值對})

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
display(df4,df4.rename(new_index))

#輸出：
    color   value
0   white   2
1   gray    0
2   purple  9
3   blue    2
4   green   0
color   value
first   white   2
two     gray    0
three   purple  9
four    blue    2
five    green   0

(3) 異常值檢測與過濾

1 使用describe()函數查看每一列的描述性統計量

df = DataFrame(np.random.randint(10,size = 10))
display(df.describe())      
        0
count   10.000000
mean    5.900000
std 2.685351
min 1.000000
25% 6.000000
50% 7.000000
75% 7.750000
max 8.000000

2 使用std()函數能夠求得DataFrame對象每一列的標準差

df.std()

#輸出：
0    3.306559
dtype: float64

3 根據每一列的標準差，對DataFrame元素進行過濾。
藉助any()函數，對每一列應用篩選條件,any過濾出全部符合條件的數據

display(df[(df>df.std()*3).any(axis = 1)])
df.drop(df[(np.abs(df) > (3*df.std())).any(axis=1)].index,inplace=True)
display(df,df.shape)

輸出：
    0   1
2   7   9
6   8   8
9   8   1
0   1
0   5   0
1   3   3
3   3   5
4   2   4
5   7   6
7   1   6
8   7   7
(7, 2)

(4) 排序

使用take()函數排序
能夠藉助np.random.permutation()函數隨機排序

df5 = DataFrame(np.arange(25).reshape(5,5))
new_order = np.random.permutation(5)
display(new_order)
display(df5,df5.take(new_order))

#輸出
array([4, 2, 3, 1, 0])
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2   10  11  12  13  14
3   15  16  17  18  19
4   20  21  22  23  24
    0   1   2   3   4
4   20  21  22  23  24
2   10  11  12  13  14
3   15  16  17  18  19
1   5   6   7   8   9
0   0   1   2   3   4

(5) 數據分類分組

groupby()函數

import pandas as pd
df = pd.DataFrame([{'col1':'a', 'col2':1, 'col3':'aa'}, {'col1':'b', 'col2':2, 'col3':'bb'}, {'col1':'c', 'col2':3, 'col3':'cc'}, {'col1':'a', 'col2':44, 'col3':'aa'}])
display(df)
# 按col1分組並按col2求和
display(df.groupby(by='col1').agg({'col2':sum}).reset_index())
# 按col1分組並按col2求最值
display(df.groupby(by='col1').agg({'col2':['max', 'min']}).reset_index())
# 按col1 ，col3分組並按col2求和
display(df.groupby(by=['col1', 'col3']).agg({'col2':sum}).reset_index())

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from datetime import datetime
'''
分組groupby
'''
df=pd.DataFrame({'key1':['a','a','b','b','a'],
                 'key2':['one','two','one','two','one'],
                 'data1':np.arange(5),
                 'data2':np.arange(5)})
print(df)
#   key1 key2  data1  data2
# 0    a  one      0      0
# 1    a  two      1      1
# 2    b  one      2      2
# 3    b  two      3      3
# 4    a  one      4      4
 
'''
根據分組進行計算
'''
#按key1分組,計算data1的平均值
grouped=df['data1'].groupby(df['key1'])
print(grouped.mean())
# a    1.666667
# b    2.500000
 
#按key1和key2分組，計算data1的平均值
groupedmean=df['data1'].groupby([df['key1'],df['key2']]).mean()
print(groupedmean)
# key1  key2
# a     one     2
#       two     1
# b     one     2
#       two     3
 
#列變行
print(groupedmean.unstack())
# key2  one  two
# key1
# a       2    1
# b       2    3
 
df['key1']#獲取出來的數據series數據
 
#groupby分組鍵能夠是series還能夠是數組
states=np.array(['Oh','Ca','Ca','Oh','Oh'])
years=np.array([2005,2005,2006,2005,2006])
print(df['data1'].groupby([states,years]).mean())
# Ca  2005    1.0
#     2006    2.0
# Oh  2005    1.5
#     2006    4.0
 
#直接將列名進行分組，非數據項不在其中，非數據項會自動排除分組
print(df.groupby('key1').mean())
#          data1     data2
# key1
# a     1.666667  1.666667
# b     2.500000  2.500000
 
#將入key2分組
print(df.groupby(['key1','key2']).mean())
#            data1  data2
# key1 key2
# a    one       2      2
#      two       1      1
# b    one       2      2
#      two       3      3
 
#size()方法，返回含有分組大小的Series，獲得分組的數量
print(df.groupby(['key1','key2']).size())
# key1  key2
# a     one     2
#       two     1
# b     one     1
#       two     1
 
'''
對分組信息進行迭代
'''
 
#將a,b進行分組
for name,group in df.groupby('key1'):
    print(name)
    print(group)
# a
#   key1 key2  data1  data2
# 0    a  one      0      0
# 1    a  two      1      1
# 4    a  one      4      4
# b
#   key1 key2  data1  data2
# 2    b  one      2      2
# 3    b  two      3      3
 
#根據多個建進行分組
for (k1,k2),group in df.groupby(['key1','key2']):
    print(name)
    print(group)
#  key1 key2  data1  data2
# 0    a  one      0      0
# 4    a  one      4      4
# b
#   key1 key2  data1  data2
# 1    a  two      1      1
# b
#   key1 key2  data1  data2
# 2    b  one      2      2
# b
#   key1 key2  data1  data2
# 3    b  two      3      3
 

'''
選取一個或一組列，返回的Series的分組對象
'''
#對於groupBy對象，若是用一個或一組列名進行索引。就會聚合
print(df.groupby(df['key1'])['data1'])#根據key1分組，生成data1的數據
 
 
print(df.groupby(['key1'])[['data1','data2']].mean())#根據key1分組，生成data1，data2的數據
#        data1     data2
# key1
# a     1.666667  1.666667
# b     2.500000  2.500000
 
print(df.groupby(['key1','key2'])['data1'].mean())
# key1  key2
# a     one     2
#       two     1
# b     one     2
#       two     3
 
 
'''
經過函數進行分組
'''
#加入你能根據人名長度進行分組的話，就直接傳入len函數
 
print(people.groupby(len,axis=1).sum())#杭州3是三個字母
#       2     3
# a  30.0  20.0
# b  23.0  21.0
# c  26.0  22.0
# d  42.0  23.0
# e  46.0  24.0
 
#還能夠和數組、字典、列表、Series混合使用
key_list=['one','one','one','two','two']
print(people.groupby([len,key_list],axis=1).min())
#     2           3
#    one   two   two
# a  0.0  15.0  20.0
# b  1.0  16.0  21.0
# c  2.0  17.0  22.0
# d  3.0  18.0  23.0
# e  4.0  19.0  24.0
 
'''
根據索引級別分組
'''
columns=pd.MultiIndex.from_arrays([['US',"US",'US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor'])
hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)
print(hier_df)
# cty          US                            JP
# tenor         1         3         5         1         3
# 0     -1.507729  2.112678  0.841736 -0.158109 -0.645219
# 1      0.355262  0.765209 -0.287648  1.134998 -0.440188
# 2      1.049813  0.763482 -0.362013 -0.428725 -0.355601
# 3     -0.868420 -1.213398 -0.386798  0.137273  0.678293
 
#根據級別分組
print(hier_df.groupby(level='cty',axis=1).count())
# cty  JP  US
# 0     2   3
# 1     2   3
# 2     2   3
# 3     2   3

(6) 高級數據聚合

1 可使用pd.merge()函數包聚合操做的計算結果添加到df的每一行

d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],
   'color':['white','white','red','green','white','white','red','green'],
   'weight':np.random.randint(10,size = 8),
   'price':np.random.randint(10,size = 8)}
df = DataFrame(d1)
sums = df.groupby('color').sum().add_prefix('total_')

items = df.groupby('item')['price','weight'].sum()

means = items['price']/items['weight']

means = DataFrame(means,columns=['means_price'])

df2 = pd.merge(df,sums,left_on = 'color',right_index = True)

df3 = pd.merge(df2,means,left_on = 'item',right_index = True)
display(df2,df3)


#輸出：
color   item    price   weight
0   white   luobo   9   2
1   white   baicai  5   9
2   red lajiao  5   8
3   green   donggua 1   1
4   white   luobo   7   4
5   white   baicai  8   0
6   red lajiao  6   8
7   green   donggua 4   3
total_price total_weight
color       
green   5   4
red 11  16
white   29  15
pandas.core.frame.DataFrame
pandas.core.frame.DataFrame
Out[141]:
        color   item    price   weight  total_price total_weight
0       white   luobo   9       2           29          15
1       white   baicai  5       9           29          15
4       white   luobo   7       4           29          15
5       white   baicai  8       0           29          15
2       red     lajiao  5       8           11          16
6       red     lajiao  6       8           11          16
3       green   donggua 1       1           5           4
7       green   donggua 4       3           5           4

2 可使用transform和apply實現相同功能

使用transform

d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'],
   'color':['white','white','red','green','white','white','red','green'],
   'weight':np.random.randint(10,size = 8),
   'price':np.random.randint(10,size = 8)}
df = DataFrame(d1)
sum1 = df.groupby('color')['price','weight'].sum().add_prefix("total_")
sums2 = df.groupby('color')['price','weight'].transform(lambda x:x.sum()).add_prefix('total_')
sums3 = df.groupby('color')['price','weight'].transform(sum).add_prefix('total_')
display(sum,df,sum1,sums2,sums3)

輸出：
<function sum>
color   item    price   weight
0   white   luobo   7   7
1   white   baicai  7   7
2   red lajiao  2   7
3   green   donggua 6   6
4   white   luobo   1   2
5   white   baicai  3   6
6   red lajiao  7   0
7   green   donggua 0   2
total_price total_weight
color       
green   6   8
red 9   7
white   18  22
total_price total_weight
0   18  22
1   18  22
2   9   7
3   6   8
4   18  22
5   18  22
6   9   7
7   6   8
total_price total_weight
0   18  22
1   18  22
2   9   7
3   6   8
4   18  22
5   18  22
6   9   7
7   6   8

使用apply

def sum_price(x):
    return x.sum()
sums3 = df.groupby('color')['price','weight'].apply(lambda x:x.sum()).add_prefix('total_')
sums4 = df.groupby('color')['price','weight'].apply(sum_price).add_prefix('total_')
display(df,sums3,sums4)

輸出：
color   item    price   weight
0   white   luobo   4   4
1   white   baicai  0   3
2   red lajiao  0   4
3   green   donggua 7   5
4   white   luobo   3   1
5   white   baicai  3   3
6   red lajiao  0   6
7   green   donggua 0   7
total_price total_weight
color       
green   7   12
red 0   10
white   10  11
totals_price    totals_weight
color       
green   7   12
red 0   10
white   10  11