內容目錄python
數據準備spa
import pandas as pd import numpy as np index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name") data = { "age": [18, 30, np.nan, 40, np.nan, 30], "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen", np.nan, " "], "sex": [None, "male", "female", "male", np.nan, "unknown"], "birth": ["2000-02-10", "1988-10-17", None, "1978-08-08", np.nan, "1988-10-17"] } user_info = pd.DataFrame(data=data, index=index) user_info Out[181]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female None James 40.0 ShenZhen male 1978-08-08 Andy NaN NaN NaN NaN Alice 30.0 unknown 1988-10-17
將出生日期轉化爲日期類型對象
user_info.birth = pd.to_datetime(user_info.birth) user_info Out[182]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James 40.0 ShenZhen male 1978-08-08 Andy NaN NaN NaN NaT Alice 30.0 unknown 1988-10-17
能夠看到,用戶 Tom 的性別爲 None,用戶 Mary 的年齡爲 NAN,生日爲 NaT。在 Pandas 的眼中,
這些都屬於缺失值,可使用 isnull() 或 notnull() 方法來操做。blog
1.判斷缺失值索引
user_info.isna() Out[183]: age city sex birth name Tom False False True False Bob False False False False Mary True False False True James False False False False Andy True True True True Alice False False False False user_info.isnull() Out[184]: age city sex birth name Tom False False True False Bob False False False False Mary True False False True James False False False False Andy True True True True Alice False False False False
2. 過濾掉年齡爲空的用戶ci
user_info[user_info.age.notnull()] Out[185]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17
Seriese 使用 dropna 比較簡單,對於 DataFrame 來講,能夠設置更多的參數。字符串
user_info.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
#series序列丟棄缺失值 user_info.age.dropna() Out[187]: name Tom 18.0 Bob 30.0 James 40.0 Alice 30.0 Name: age, dtype: float64 #一行數據只要有 user_info.dropna(axis=0,how='any') Out[188]: age city sex birth name Bob 30.0 ShangHai male 1988-10-17 James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17 # 一行數據全部字段都爲空值才刪除 user_info.dropna(axis=0,how='all') Out[189]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17 # 一行數據中只要 city 或 sex 存在空值即刪除 user_info.dropna(axis=0, how="any", subset=["city", "sex"]) Out[190]: age city sex birth name Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James 40.0 ShenZhen male 1978-08-08 Alice 30.0 unknown 1988-10-17
除了能夠丟棄缺失值外,也能夠填充缺失值,最多見的是使用 fillna 完成填充。
fillna 這名字一看就是用來填充缺失值的。
填充缺失值時,常見的一種方式是使用一個標量來填充。例如,這裏我樣有缺失的年齡都填充爲 0。pandas
user_info.age.fillna(0) Out[191]: name Tom 18.0 Bob 30.0 Mary 0.0 James 40.0 Andy 0.0 Alice 30.0 Name: age, dtype: float64 user_info.age.fillna(method="ffill") Out[192]: name Tom 18.0 Bob 30.0 Mary 30.0 James 40.0 Andy 40.0 Alice 30.0 Name: age, dtype: float64 user_info.age.fillna(method="backfill") Out[193]: name Tom 18.0 Bob 30.0 Mary 40.0 James 40.0 Andy 30.0 Alice 30.0 Name: age, dtype: float64 user_info.age.interpolate() Out[194]: name Tom 18.0 Bob 30.0 Mary 35.0 James 40.0 Andy 35.0 Alice 30.0 Name: age, dtype: float64
例如,在咱們的存儲的用戶信息中,假定咱們限定用戶都是青年,出現了年齡爲 40 的,咱們就能夠認爲這是一個異常值。再好比,咱們都知道性別分爲男性(male)和女性(female),在記錄用戶性別的時候,對於未知的用戶性別都記爲了 「unknown」,很明顯,咱們也能夠認爲「unknown」是缺失值。此外,有的時候會出現空白字符串,這些也能夠認爲是缺失值。對於上面的這種狀況,咱們可使用 replace 方法來替換缺失值。it
user_info.age.replace(40,np.nan) Out[195]: name Tom 18.0 Bob 30.0 Mary NaN James NaN Andy NaN Alice 30.0 Name: age, dtype: float64 user_info.age.replace({40: np.nan})#制定一個映射字典 Out[196]: name Tom 18.0 Bob 30.0 Mary NaN James NaN Andy NaN Alice 30.0 Name: age, dtype: float64 user_info.replace({"age": 40, "birth": pd.Timestamp("1978-08-08")}, np.nan) Out[197]: age city sex birth name Tom 18.0 BeiJing None 2000-02-10 Bob 30.0 ShangHai male 1988-10-17 Mary NaN GuangZhou female NaT James NaN ShenZhen male NaT Andy NaN NaN NaN NaT Alice 30.0 unknown 1988-10-17 user_info.sex.replace("unknown", np.nan) Out[198]: name Tom None Bob male Mary female James male Andy NaN Alice NaN Name: sex, dtype: object user_info.city.replace(r'\s+', np.nan, regex=True) Out[199]: name Tom BeiJing Bob ShangHai Mary GuangZhou James ShenZhen Andy NaN Alice NaN Name: city, dtype: object
除了咱們本身手動丟棄、填充已經替換缺失值以外,咱們還可使用其餘對象來填充。
例若有兩個關於用戶年齡的 Series,其中一個有缺失值,另外一個沒有,咱們能夠將沒有的缺失值的 Series 中的元素傳給有缺失值的。class
age_new = user_info.age.copy() age_new.fillna(20, inplace=True) age_new Out[200]: name Tom 18.0 Bob 30.0 Mary 20.0 James 40.0 Andy 20.0 Alice 30.0 Name: age, dtype: float64 user_info.age.combine_first(age_new) Out[201]: name Tom 18.0 Bob 30.0 Mary 20.0 James 40.0 Andy 20.0 Alice 30.0 Name: age, dtype: float64