數據分析—NaN數據處理

時間 2019-12-09

原文原文鏈接

目的數據結構

　　1.查找NaN值（定位到哪一列、在列的哪一個索引位置）函數

　　2.填充NaN值（向上填充、向下填充、線性填充等）spa

　　3.過濾NaN值code

構建簡單的Dataframe數據結構環境blog

import pandas as pd
import numpy as np

#在df中nan和None都會被自動填充爲NaN
df=pd.DataFrame({'a':[np.nan,1,2,3],'b':[None,5,6,7],'c':[8,9,10,11]})
print(df)

'''結果 
     a    b   c
0  NaN  NaN   8
1  1.0  5.0   9
2  2.0  6.0  10
3  3.0  7.0  11

'''

注意點：排序

　　1.None、nan在構建dataframe數據結構中都會被識別爲NaN索引

　　2.None與nan的類型是不同的pandas

#nan是個特殊的float類型
print(type(np.nan))  #<class 'float'>

#None是NoneType
print(type(None))   #<class 'NoneType'>

#打印空值看輸出結果
print(df['a'][0])  #nan

#做爲字典的key時，會被認爲是不一樣的key
dic={None:1, NaN:2}
print(dic)  #{None: 1, nan: 2}

目的1：查找NaN值it

原始數據：class

df=pd.DataFrame({'a':[0,1,np.nan,3,4],'b':[4,5,6,np.nan,8],'c':[9,10,np.nan,np.nan,13]})
print(df)

'''結果
     a    b     c
0  0.0  4.0   9.0
1  1.0  5.0  10.0
2  NaN  6.0   NaN
3  3.0  NaN   NaN
4  4.0  8.0  13.0

'''

1.1 輸出NaN值所在具體位置（調用math下面的isnan作判斷）

from math import isnan for i in df.columns:
    # print(df[i].values) 
    for k in range(len( df[i].values)):
        if isnan(df[i].values[k]):
            print('字段%s存在NaN值:' % i + '索引位置是：%s'%k)

1.2 分別找到NaN值所在的列名、行名：np.isnan()

import pandas as pd
import numpy as np
from math import isnan

#在df中nan和None都會被自動填充爲NaN
df=pd.DataFrame({'a':[np.nan,1,2,3,4],'b':[4,5,6,np.nan,np.nan],'c':[9,10,np.nan,np.nan,np.nan]},index=['index_0','index_1','index_2','index_3','index_4'])
print(df)
'''原始數據
           a    b     c
index_0  NaN  4.0   9.0
index_1  1.0  5.0  10.0
index_2  2.0  6.0   NaN
index_3  3.0  NaN   NaN
index_4  4.0  NaN   NaN
'''
#利用np.isnan(df)，對整個df的NaN值作判斷
res=np.isnan(df) print(res)
'''
             a      b      c
index_0   True  False  False
index_1  False  False  False
index_2  False  False   True
index_3  False   True   True
index_4  False   True   True

'''

nan_res=np.where(np.isnan(df))
print(nan_res)  #(array([0, 2, 3, 3, 4, 4], dtype=int32), array([0, 2, 1, 2, 1, 2], dtype=int32))
'''
#獲得一個元祖 
(array([0, 2, 3, 3, 4, 4], dtype=int32), array([0, 2, 1, 2, 1, 2], dtype=int32))
元祖第一個值：表示行索引（都是以數字表示）
元祖第二個值：表示列索引（都是以數字表示）
注意：看先後行、列索引列表的排序，發現是每一行的數據去逐一檢索的
'''

#查看NaN值默認行號（數字）(有重複值時表示是這一行裏有多個NaN值)
print(nan_res[0]) #[0 2 3 3 4 4]
print(nan_res[1]) #[0 2 1 2 1 2]

#查看NaN值的行號名（即實際行號名字）
nan_index_info=df.index[np.where(np.isnan(df))[0]]
print(nan_index_info) #Index(['index_0', 'index_2', 'index_3', 'index_3', 'index_4', 'index_4'], dtype='object')

#查看NaN值的列名
nan_columns_info=df.columns[np.where(np.isnan(df))[1]]
print(nan_columns_info) #Index(['a', 'c', 'b', 'c', 'b', 'c'], dtype='object')

1.3 isnull、notnull

isnull_res=df.isnull()
print(isnull_res)
'''
       a      b      c
0  False  False  False
1  False  False  False
2   True  False   True
3  False   True   True
4  False  False  False

'''

1.4 isnull().values

# 2.1 isnull().values
isnull_res=df.isnull().values
print(isnull_res)
'''
[[False False False]
 [False False False]
 [ True False  True]
 [False  True  True]
 [False False False]]
 
'''

1.5 isnull().sum() 統計顯示

#2.3 isnull().sum() 統計
isnull_res=df.isnull().sum()
print(isnull_res)
'''
a    1
b    1
c    2
dtype: int64

'''

1.6 isnull().any() 布爾值，顯示行裏是否是有NaN值

df=pd.DataFrame({'a':[0,1,np.nan,3,4],'b':[4,5,6,np.nan,8],'c':[9,10,np.nan,np.nan,13]})
print(df)
'''原始數據
     a    b     c
0  0.0  4.0   9.0
1  1.0  5.0  10.0
2  NaN  6.0   NaN
3  3.0  NaN   NaN
4  4.0  8.0  13.0

'''

any_res=df.isnull().any()
print(any_res)
'''
a    True
b    True
c    True
dtype: bool

'''

目的2：填充NaN值

原始數據樣子：

df=pd.DataFrame({'a':[0,1,np.nan,3,4],'b':[4,5,6,np.nan,8],'c':[9,10,np.nan,np.nan,13]})
print(df)

'''結果
     a    b     c
0  0.0  4.0   9.0
1  1.0  5.0  10.0
2  NaN  6.0   NaN
3  3.0  NaN   NaN
4  4.0  8.0  13.0

'''

2.1 fillna直接填充

#2.1.fillna直接填充
df.fillna(0,inplace=True) #表示更新到源數據
print(df)
'''
     a    b     c
0  0.0  4.0   9.0
1  1.0  5.0  10.0
2  0.0  6.0   0.0
3  3.0  0.0   0.0
4  4.0  8.0  13.0

'''

#另一種方式

res=df.where(df.notna(),100)

fillna({字典})，字典key：表示列索引 value：表示要填充的值

df=pd.DataFrame({'a':[0,1,2,3,4],'b':[4,5,6,np.nan,8],'c':[9,10,np.nan,np.nan,13]})

#意爲將b列空值：填充爲00 c列空值填充爲：11
df_new=df.fillna({'b':00,'c':11}) print(df_new)
'''
   a    b     c
0  0  4.0   9.0
1  1  5.0  10.0
2  2  6.0  11.0
3  3  0.0  11.0
4  4  8.0  13.0

'''

2.2 向上填充（即取NaN後面一個數值做爲填充）：bfill

fillna_res=df.fillna(method='bfill')
print(fillna_res)
'''
     a    b     c
0  0.0  4.0   9.0
1  1.0  5.0  10.0
2  3.0  6.0  13.0
3  3.0  8.0  13.0
4  4.0  8.0  13.0

'''

2.3 向下填充（即取NaN前面一個數值做爲填充）：ffill

fillna_res=df.fillna(method='ffill')
print(fillna_res)
'''
     a    b     c
0  0.0  4.0   9.0
1  1.0  5.0  10.0
2  1.0  6.0  10.0
3  3.0  6.0  10.0
4  4.0  8.0  13.0

'''

2.4 用一個數據代替NaN：pad （功能相似於向下填充ffill）

fillna_res=df.fillna(method='pad')
print(fillna_res)
'''
     a    b     c
0  0.0  4.0   9.0
1  1.0  5.0  10.0
2  1.0  6.0  10.0
3  3.0  6.0  10.0
4  4.0  8.0  13.0


'''

2.5 平均值替換：df.mean()

#平均值替換：df.mean()
fillna_res=df.fillna(df.mean())
print(fillna_res)
'''
     a     b          c
0  0.0  4.00   9.000000
1  1.0  5.00  10.000000
2  2.0  6.00  10.666667
3  3.0  5.75  10.666667
4  4.0  8.00  13.000000

'''

2.6 指定替換具體列的NaN值

#選擇指定替換哪些列
choose_fillna=df.fillna(df.mean()['a':'b'])
print(choose_fillna)
'''
     a     b     c
0  0.0  4.00   9.0
1  1.0  5.00  10.0
2  2.0  6.00   NaN
3  3.0  5.75   NaN
4  4.0  8.00  13.0

'''

2.7 限制每列替換NaN的個數：limit=

limit_res=df.fillna(df.mean(),limit=1)
print(limit_res)
'''
     a     b          c
0  0.0  4.00   9.000000
1  1.0  5.00  10.000000
2  2.0  6.00  10.666667
3  3.0  5.75        NaN
4  4.0  8.00  13.000000

'''

2.8 線性插值 df.interpolate(),若是存在第一個值是NaN的狀況，這個NaN值沒法被替換，須要單獨再判斷

df=pd.DataFrame({'a':[2,1,2,3,4],'b':[4,5,6,np.nan,2],'c':[9,10,np.nan,np.nan,4]},index=['index_0','index_1','index_2','index_3','index_4'])
print(df)
'''
         a    b     c
index_0  2  4.0   9.0
index_1  1  5.0  10.0
index_2  2  6.0   NaN
index_3  3  NaN   NaN
index_4  4  2.0   4.0

'''
#實際上上前一個值和後一個值得平均數，由於interpolate（）假設函數是直線形式
print(df.interpolate())
'''
         a    b     c
index_0  2  4.0   9.0
index_1  1  5.0  10.0
index_2  2  6.0   8.0
index_3  3  4.0   6.0
index_4  4  2.0   4.0

'''

特殊狀況的填充：

　　重點：假設每一列數據的第一或最後爲空（再用上面的普通方法就填充時就容易忽略掉1和最後的值）

解決思路：

#解決的一個問題是開頭和結果都是NaN的狀況
1.判斷若是開頭是NaN取下面最近的數值填充
2.若是結尾是NaN取上面最近的數值填充
3.最後再對總體的中間NaN進行替換，就能夠向上或向下取值了代碼

初始數據結構：

df=pd.DataFrame({'a':[np.nan,1,np.nan,3,4],'b':[None,5,np.nan,7,None],'c':[9,10,np.nan,12,13]})
print(df)
df=pd.DataFrame({'a':[np.nan,1,np.nan,3,4],'b':[None,5,np.nan,7,None],'c':[9,10,np.nan,12,13]})
print(df)
'''
     a    b     c
0  NaN  NaN   9.0
1  1.0  5.0  10.0
2  NaN  NaN   NaN
3  3.0  7.0  12.0
4  4.0  NaN  13.0

'''

代碼實現3個步驟：

#循環沒一個列的value值（判斷）
for i in df.columns:
    n=1
    for k in range(len(df[i].values)):
        #若是第一個值是NaN
        if isnan(df[i].values[0]):
            #找下面最近的不是NaN的值作填充
            if not isnan(df[i].values[k]):
                df[i].values[0]=df[i].values[k]

            # print('說明第一個值是0，要替換成下面離他最近的一個數值')

        #若是最後一個值是NaN
        elif isnan(df[i].values[len(df[i].values)-1]):
            n+=1
            #再判斷倒數第二個值看是不是NaN（依次往上找，找到不是NaN的值作替換）
            if not isnan(df[i].values[len(df[i].values)-n]):
                df[i].values[len(df[i].values) - 1]=df[i].values[len(df[i].values)-n]

            # print('說明最後一個值是0，要替換成上面離他最近的一個數值')
            
    #最終對總體沒一列中存在的NaN數據進行向前取值
    df=df.fillna(method='pad')

print(df)

目的3：過濾NaN值

初始數據

df=pd.DataFrame({'a':[0,1,2,3,4],'b':[4,5,6,np.nan,8],'c':[9,10,np.nan,np.nan,13]})
print(df)
'''
   a    b     c
0  0  4.0   9.0
1  1  5.0  10.0
2  2  6.0   NaN
3  3  NaN   NaN
4  4  8.0  13.0

'''

3.1 刪除存在NaN的行或列：dropna()

#默認刪除存在NaN的行：
# res=df.dropna()
# print(res)
'''
   a    b     c
0  0  4.0   9.0
1  1  5.0  10.0
4  4  8.0  13.0

'''

3.2 刪除存在NaN的列：axis=1

res=df.dropna(axis=1) #刪除存在NaN的列
print(res)
'''
   a
0  0
1  1
2  2
3  3
4  4

'''

3.3 刪除全爲NaN的行：how='all'

#刪除全爲NaN的行
all_res=df.dropna(how='all')
print(all_res)

3.4 刪除全爲NaN的列（axis=1,how='all'）

#刪除全爲NaN的列(axis=1,how='all')
df=pd.DataFrame({'a':[0,1,2,np.nan,4],'b':[4,5,6,np.nan,8],'c':[np.nan,np.nan,np.nan,np.nan,np.nan]})
print(df)
'''
     a    b   c
0  0.0  4.0 NaN
1  1.0  5.0 NaN
2  2.0  6.0 NaN
3  NaN  NaN NaN
4  4.0  8.0 NaN
'''
all_loc_res=df.dropna(axis=1,how='all')
print(all_loc_res)
'''
     a    b
0  0.0  4.0
1  1.0  5.0
2  2.0  6.0
3  NaN  NaN
4  4.0  8.0

'''

3.5 dropna(thresh=2參數)，過濾NaN時再作近一步的條件篩選

#在df中nan和None都會被自動填充爲NaN
df=pd.DataFrame({'a':[np.nan,1,2,3,4],'b':[4,5,6,np.nan,np.nan],'c':[9,10,np.nan,np.nan,np.nan]},index=['index_0','index_1','index_2','index_3','index_4'])
print(df)
'''原始數據
           a    b     c
index_0  NaN  4.0   9.0
index_1  1.0  5.0  10.0
index_2  2.0  6.0   NaN
index_3  3.0  NaN   NaN
index_4  4.0  NaN   NaN

'''

#過濾取值，保留在行方向上至少有3個非空的項：dropna(thresh=2)
res=df.dropna(thresh=2)
print(res)
'''
           a    b     c
index_0  NaN  4.0   9.0
index_1  1.0  5.0  10.0
index_2  2.0  6.0   NaN

'''

#過濾取值，保留在列方向上至少有3個非空的項
res=df.dropna(thresh=3,axis=1)
print(res)
'''
           a    b
index_0  NaN  4.0
index_1  1.0  5.0
index_2  2.0  6.0
index_3  3.0  NaN
index_4  4.0  NaN

'''