Pandas系列（五）-分類數據處理

時間 2019-12-04

原文原文鏈接

內容目錄python

1. 建立對象
2. 經常使用操做
3. 內存使用量的陷阱

1、建立對象

1.基本概念：分類數據直白來講就是取值爲有限的，或者說是固定數量的可能值。例如：性別、血型。
2.建立分類數據：這裏以血型爲例，假定每一個用戶有如下的血型，咱們如何建立一個關於血型的分類對象呢？

方法一：明確指定 dtype="category"api

index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name")
user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="blood_type", dtype="category")
user_info
Out[6]: 
name
Tom        A
Bob       AB
Mary     NaN
James     AB
Andy       O
Alice      B
Name: blood_type, dtype: category
Categories (4, object): [A, AB, B, O]

方法二：使用 pd.Categorical 來構建分類數據。spa

pd.Categorical(["A", "AB", np.nan, "AB", "O", "B"])
Out[7]: 
[A, AB, NaN, AB, O, B]
Categories (4, object): [A, AB, B, O]

3.本身制定類別數據全部可能的取值。

假定咱們認爲血型只有 A、B 以及 AB 這三類，那麼咱們能夠這樣操做。code

#定製分類數據全部可能的取值
pd.Categorical(["A", "AB", np.nan, "AB", "O", "B"], categories=["A", "B", "AB"])
Out[8]: 
[A, AB, NaN, AB, NaN, B]
Categories (3, object): [A, B, AB]

4.Series轉爲分類數據，astype

#將遺傳序列轉化爲分類數據
user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="blood_type")
user_info = user_info.astype("category")
user_info
Out[9]: 
name
Tom        A
Bob       AB
Mary     NaN
James     AB
Andy       O
Alice      B
Name: blood_type, dtype: category
Categories (4, object): [A, AB, B, O]

5.此外，一些其餘的方法返回的結果也是分類數據。如 cut 、 qcut。具體能夠見 Pandas基本功能詳解中的離散化部分。

2、經常使用操做

能夠對分類數據使用 .describe() 方法，它獲得的結果與 string類型的數據相同。
count 表示非空的數據有5條，unique 表示去重後的非空數據有4條，top 表示出現次數最多的值爲 AB，
freq 表示出現次數最多的值的次數爲2次。
咱們能夠使用 .cat.categories 來獲取分類數據全部可能的取值。
重命名分類數據：cat.rename_categories
添加分類數據：.cat.add_categories
刪除分類數據：.cat.remove_categories
查看數據分佈：.value_counts()
經過.str訪問
合併數據，用concat,類型變爲object
保留分類數據類型,union_categoricals

user_info.describe()
Out[86]: 
count      5
unique     4
top       AB
freq       2
Name: blood_type, dtype: object
user_info.cat.rename_categories(["A+", "AB+", "B+", "O+"])
Out[87]: 
name
Tom       A+
Bob      AB+
Mary     NaN
James    AB+
Andy      O+
Alice     B+
Name: blood_type, dtype: category
Categories (4, object): [A+, AB+, B+, O+]
user_info.str.contains('A')
Out[88]: 
name
Tom       True
Bob       True
Mary       NaN
James     True
Andy     False
Alice    False
Name: blood_type, dtype: object
#合併數據
blood_type1 = pd.Categorical(["A", "AB"])
blood_type2 = pd.Categorical(["B", "O"])
pd.concat([pd.Series(blood_type1), pd.Series(blood_type2)])
Out[89]: 
0     A
1    AB
0     B
1     O
dtype: object
#保留分類數據
from pandas.api.types import union_categoricals
union_categoricals([blood_type1, blood_type2])
Out[90]: 
[A, AB, B, O]
Categories (4, object): [A, AB, B, O]

　cat全部屬性對象

[name for name in user_info.cat.__dir__() if not name.startswith('_')]
Out[92]: 
['add_categories',
 'as_ordered',
 'as_unordered',
 'categories',
 'codes',
 'ordered',
 'remove_categories',
 'remove_unused_categories',
 'rename_categories',
 'reorder_categories',
 'set_categories']

3、內存使用量的陷阱

Categorical 的內存使用量是與分類數乘以數據長度成正比，object 類型的數據是一個常數乘以數據的長度。

blood_type = pd.Series(["AB","O"]*1000)
blood_type.nbytes
Out[79]: 16000
blood_type.astype("category").nbytes
Out[80]: 2016
blood_type = pd.Series(['AB%4d' % i for i in range(2000)])
blood_type.nbytes
Out[81]: 16000
blood_type.astype("category").nbytes
Out[82]: 20000