數據處理過程的數據類型
- 當利用pandas進行數據處理的時候,常常會遇到數據類型的問題,當拿到數據的時候,首先須要肯定拿到的是正確類型的數據,通常經過數據類型的轉化,這篇文章就介紹pandas裏面的數據類型(data types也就是經常使用的dtyps),以及pandas與numpy之間的數據對應關係。
![](http://static.javashuo.com/static/loading.gif)
- 主要介紹object,int64,float64,datetime64,bool等幾種類型,category與timedelta兩種類型會單獨的在其餘文章中進行介紹。固然本文中也會涉及簡單的介紹。
數據類型的問題通常都是出了問題以後纔會發現的,因此有了一些經驗以後就會拿到數據以後,就直接看數據類型,是否與本身想要處理的數據格式一致,這樣能夠從一開始避免一些尷尬的問題出現。那麼咱們以一個簡單的例子,利用jupyter notebook進行一個數據類型的介紹。
####按照慣例導入兩個經常使用的數據處理的包,numpy與pandas
import numpy as np
import pandas as pd
# 從csv文件讀取數據,數據表格中只有5行,裏面包含了float,string,int三種數據python類型,也就是分別對應的pandas的float64,object,int64
# csv文件中共有六列,第一列是表頭,其他是數據。
df = pd.read_csv("sales_data_types.csv")
print(df)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125,000.00 $162,500.00
1 552278 Smith Plumbing $920,000.00 $1,012,000.00
2 23477 ACME Industrial $50,000.00 $62,500.00
3 24900 Brekke LTD $350,000.00 $490,000.00
4 651029 Harbor Co $15,000.00 $12,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
df.dtypes
Customer Number int64
Customer Name object
2016 object
2017 object
Percent Growth object
Jan Units object
Month int64
Day int64
Year int64
Active object
dtype: object
# 假如想獲得2016年與2017年的數據總和,能夠嘗試,但並非咱們須要的答案,由於這兩列中的數據類型是object,執行該操做以後,獲得是一個更加長的字符串,
# 固然咱們能夠經過df.info() 來得到關於數據框的更多的詳細信息,
df['2016']+df['2017']
0 $125,000.00 $162,500.00
1 $920,000.00 $1,012,000.00
2 $50,000.00 $62,500.00
3 $350,000.00 $490,000.00
4 $15,000.00 $12,750.00
dtype: object
df.info()
# Customer Number 列是float64,然而應該是int64
# 2016 2017兩列的數據是object,並非float64或者int64格式
# Percent以及Jan Units 也是objects而不是數字格式
# Month,Day以及Year應該轉化爲datetime64[ns]格式
# Active 列應該是布爾值
# 若是不作數據清洗,很難進行下一步的數據分析,爲了進行數據格式的轉化,pandas裏面有三種比較經常使用的方法
# 1. astype()強制轉化數據類型
# 2. 經過建立自定義的函數進行數據轉化
# 3. pandas提供的to_nueric()以及to_datetime()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number 5 non-null int64
Customer Name 5 non-null object
2016 5 non-null object
2017 5 non-null object
Percent Growth 5 non-null object
Jan Units 5 non-null object
Month 5 non-null int64
Day 5 non-null int64
Year 5 non-null int64
Active 5 non-null object
dtypes: int64(4), object(6)
memory usage: 480.0+ bytes
首先介紹最經常使用的astype()
好比能夠經過astype()將第一列的數據轉化爲整數int類型
df['Customer Number'].astype("int")
# 這樣的操做並無改變原始的數據框,而只是返回的一個拷貝
0 10002
1 552278
2 23477
3 24900
4 651029
Name: Customer Number, dtype: int32
# 想要真正的改變數據框,一般須要經過賦值來進行,好比
df["Customer Number"] = df["Customer Number"].astype("int")
print(df)
print("--------"*10)
print(df.dtypes)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125,000.00 $162,500.00
1 552278 Smith Plumbing $920,000.00 $1,012,000.00
2 23477 ACME Industrial $50,000.00 $62,500.00
3 24900 Brekke LTD $350,000.00 $490,000.00
4 651029 Harbor Co $15,000.00 $12,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
--------------------------------------------------------------------------------
Customer Number int32
Customer Name object
2016 object
2017 object
Percent Growth object
Jan Units object
Month int64
Day int64
Year int64
Active object
dtype: object
# 經過賦值在原始的數據框基礎上進行了數據轉化,能夠從新看一下咱們新生成的數據框
print(df)
Customer Number Customer Name 2016 2017 \
0 10002 Quest Industries $125,000.00 $162,500.00
1 552278 Smith Plumbing $920,000.00 $1,012,000.00
2 23477 ACME Industrial $50,000.00 $62,500.00
3 24900 Brekke LTD $350,000.00 $490,000.00
4 651029 Harbor Co $15,000.00 $12,750.00
Percent Growth Jan Units Month Day Year Active
0 30.00% 500 1 10 2015 Y
1 10.00% 700 6 15 2014 Y
2 25.00% 125 3 29 2016 Y
3 4.00% 75 10 27 2015 Y
4 -15.00% Closed 2 2 2014 N
# 而後像2016,2017 Percent Growth,Jan Units 這幾列帶有特殊符號的object是不能直接經過astype("flaot)方法進行轉化的,
# 這與python中的字符串轉化爲浮點數,都要求原始的字符都只能含有數字自己,不能含有其餘的特殊字符
# 咱們能夠試着將將Active列轉化爲布爾值,看一下到底會發生什麼,五個結果全是True,說明並無起到什麼做用
#df["Active"].astype("bool")
df['2016'].astype('float')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-47cc9d68cd65> in <module>()
----> 1 df['2016'].astype('float')
C:\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, raise_on_error, **kwargs)
3052 # else, only a single dtype is given
3053 new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054 raise_on_error=raise_on_error, **kwargs)
3055 return self._constructor(new_data).__finalize__(self)
3056
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
3187
3188 def astype(self, dtype, **kwargs):
-> 3189 return self.apply('astype', dtype=dtype, **kwargs)
3190
3191 def convert(self, **kwargs):
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
3054
3055 kwargs['mgr'] = self
-> 3056 applied = getattr(b, f)(**kwargs)
3057 result_blocks = _extend_blocks(applied, result_blocks)
3058
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs)
459 **kwargs):
460 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461 values=values, **kwargs)
462
463 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
502
503 # _astype_nansafe works fine with 1-d only
--> 504 values = _astype_nansafe(values.ravel(), dtype, copy=True)
505 values = values.reshape(self.shape)
506
C:\Anaconda3\lib\site-packages\pandas\types\cast.py in _astype_nansafe(arr, dtype, copy)
535
536 if copy:
--> 537 return arr.astype(dtype)
538 return arr.view(dtype)
539
ValueError: could not convert string to float: '$15,000.00 '
以上的問題說明了一些問題
- 若是數據是純淨的數據,能夠轉化爲數字
- astype基本也就是兩種用做,數字轉化爲單純字符串,單純數字的字符串轉化爲數字,含有其餘的非數字的字符串是不能經過astype進行轉化的。
- 須要引入其餘的方法進行轉化,也就有了下面的自定義函數方法
經過自定義函數清理數據
def convert_currency(var):
"""
convert the string number to a float
_ 去除$
- 去除逗號,
- 轉化爲浮點數類型
"""
new_value = var.replace(",","").replace("$","")
return float(new_value)
# 經過replace函數將$以及逗號去掉,而後字符串轉化爲浮點數,讓pandas選擇pandas認爲合適的特定類型,float或者int,該例子中將數據轉化爲了float64
# 經過pandas中的apply函數將2016列中的數據所有轉化
df["2016"].apply(convert_currency)
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
# 固然能夠經過lambda 函數將這個比較簡單的函數一行帶過
df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
0 125000.0
1 920000.0
2 50000.0
3 350000.0
4 15000.0
Name: 2016, dtype: float64
#一樣能夠利用lambda表達式將PercentGrowth進行數據清理
df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
0 0.30
1 0.10
2 0.25
3 0.04
4 -0.15
Name: Percent Growth, dtype: float64
# 一樣能夠經過自定義函數進行解決,結果同上
# 最後一個自定義函數是利用np.where() function 將Active 列轉化爲布爾值。
df["Active"] = np.where(df["Active"] == "Y", True, False)
df["Active"]
0 True
1 True
2 True
3 True
4 False
Name: Active, dtype: bool
# 此時可查看一下數據格式
df["2016"]=df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["2017"]=df["2017"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["Percent Growth"]=df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
df.dtypes
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units object
Month int64
Day int64
Year int64
Active bool
dtype: object
# 再次查看DataFrame
# 此時只有Jan Units中格式須要轉化,以及年月日的合併,能夠利用pandas中自帶的幾個函數進行處理
print(df)
Customer Number Customer Name 2016 2017 Percent Growth \
0 10002 Quest Industries 125000.0 162500.0 0.30
1 552278 Smith Plumbing 920000.0 1012000.0 0.10
2 23477 ACME Industrial 50000.0 62500.0 0.25
3 24900 Brekke LTD 350000.0 490000.0 0.04
4 651029 Harbor Co 15000.0 12750.0 -0.15
Jan Units Month Day Year Active
0 500 1 10 2015 True
1 700 6 15 2014 True
2 125 3 29 2016 True
3 75 10 27 2015 True
4 Closed 2 2 2014 False
利用pandas中函數進行處理
# pandas中pd.to_numeric()處理Jan Units中的數據
pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0)
0 500.0
1 700.0
2 125.0
3 75.0
4 0.0
Name: Jan Units, dtype: float64
# 最後利用pd.to_datatime()將年月日進行合併
pd.to_datetime(df[['Month', 'Day', 'Year']])
0 2015-01-10
1 2014-06-15
2 2016-03-29
3 2015-10-27
4 2014-02-02
dtype: datetime64[ns]
# 作到這裏不要忘記從新賦值,不然原始數據並無變化
df["Jan Units"] = pd.to_numeric(df["Jan Units"],errors='coerce')
df["Start_date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
df
|
Customer Number |
Customer Name |
2016 |
2017 |
Percent Growth |
Jan Units |
Month |
Day |
Year |
Active |
Start_date |
0 |
10002 |
Quest Industries |
125000.0 |
162500.0 |
0.30 |
500.0 |
1 |
10 |
2015 |
True |
2015-01-10 |
1 |
552278 |
Smith Plumbing |
920000.0 |
1012000.0 |
0.10 |
700.0 |
6 |
15 |
2014 |
True |
2014-06-15 |
2 |
23477 |
ACME Industrial |
50000.0 |
62500.0 |
0.25 |
125.0 |
3 |
29 |
2016 |
True |
2016-03-29 |
3 |
24900 |
Brekke LTD |
350000.0 |
490000.0 |
0.04 |
75.0 |
10 |
27 |
2015 |
True |
2015-10-27 |
4 |
651029 |
Harbor Co |
15000.0 |
12750.0 |
-0.15 |
NaN |
2 |
2 |
2014 |
False |
2014-02-02 |
df.dtypes
Customer Number int32
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
Start_date datetime64[ns]
dtype: object
# 將這些轉化整合在一塊兒
def convert_percent(val):
"""
Convert the percentage string to an actual floating point percent
- Remove %
- Divide by 100 to make decimal
"""
new_val = val.replace('%', '')
return float(new_val) / 100
df_2 = pd.read_csv("sales_data_types.csv",dtype={"Customer_Number":"int"},converters={
"2016":convert_currency,
"2017":convert_currency,
"Percent Growth":convert_percent,
"Jan Units":lambda x:pd.to_numeric(x,errors="coerce"),
"Active":lambda x: np.where(x=="Y",True,False)
})
df_2.dtypes
Customer Number int64
Customer Name object
2016 float64
2017 float64
Percent Growth float64
Jan Units float64
Month int64
Day int64
Year int64
Active bool
dtype: object
df_2
|
Customer Number |
Customer Name |
2016 |
2017 |
Percent Growth |
Jan Units |
Month |
Day |
Year |
Active |
0 |
10002 |
Quest Industries |
125000.0 |
162500.0 |
0.30 |
500.0 |
1 |
10 |
2015 |
True |
1 |
552278 |
Smith Plumbing |
920000.0 |
1012000.0 |
0.10 |
700.0 |
6 |
15 |
2014 |
True |
2 |
23477 |
ACME Industrial |
50000.0 |
62500.0 |
0.25 |
125.0 |
3 |
29 |
2016 |
True |
3 |
24900 |
Brekke LTD |
350000.0 |
490000.0 |
0.04 |
75.0 |
10 |
27 |
2015 |
True |
4 |
651029 |
Harbor Co |
15000.0 |
12750.0 |
-0.15 |
NaN |
2 |
2 |
2014 |
False |
至此,pandas裏面數據類型目前還有timedelta以及category兩個,以後會着重介紹category類型,這是類型是參考了R中的category設計的,在pandas 0.16 以後添加的,以後還會根據須要進行整理pandas的經常使用方法。