pandas 數據類型轉換

時間 2019-11-30

標籤 pandas 數據類型轉換简体版

原文原文鏈接

數據處理過程的數據類型

當利用pandas進行數據處理的時候，常常會遇到數據類型的問題，當拿到數據的時候，首先須要肯定拿到的是正確類型的數據，通常經過數據類型的轉化，這篇文章就介紹pandas裏面的數據類型（data types也就是經常使用的dtyps），以及pandas與numpy之間的數據對應關係。
主要介紹object，int64，float64，datetime64，bool等幾種類型，category與timedelta兩種類型會單獨的在其餘文章中進行介紹。固然本文中也會涉及簡單的介紹。

數據類型的問題通常都是出了問題以後纔會發現的，因此有了一些經驗以後就會拿到數據以後，就直接看數據類型，是否與本身想要處理的數據格式一致，這樣能夠從一開始避免一些尷尬的問題出現。那麼咱們以一個簡單的例子，利用jupyter notebook進行一個數據類型的介紹。

####按照慣例導入兩個經常使用的數據處理的包，numpy與pandas
import numpy as np
import pandas as pd
# 從csv文件讀取數據，數據表格中只有5行，裏面包含了float，string，int三種數據python類型，也就是分別對應的pandas的float64，object，int64
# csv文件中共有六列，第一列是表頭，其他是數據。
df = pd.read_csv("sales_data_types.csv")
print(df)

Customer Number     Customer Name          2016            2017  \
0            10002  Quest Industries  $125,000.00     $162,500.00    
1           552278    Smith Plumbing  $920,000.00   $1,012,000.00    
2            23477   ACME Industrial   $50,000.00      $62,500.00    
3            24900        Brekke LTD  $350,000.00     $490,000.00    
4           651029         Harbor Co   $15,000.00      $12,750.00    

  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N

df.dtypes

Customer Number     int64
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object

# 假如想獲得2016年與2017年的數據總和，能夠嘗試,但並非咱們須要的答案，由於這兩列中的數據類型是object，執行該操做以後，獲得是一個更加長的字符串，
# 固然咱們能夠經過df.info() 來得到關於數據框的更多的詳細信息，
df['2016']+df['2017']

0      $125,000.00 $162,500.00 
1    $920,000.00 $1,012,000.00 
2        $50,000.00 $62,500.00 
3      $350,000.00 $490,000.00 
4        $15,000.00 $12,750.00 
dtype: object

df.info()
# Customer Number 列是float64，然而應該是int64
# 2016 2017兩列的數據是object，並非float64或者int64格式
# Percent以及Jan Units 也是objects而不是數字格式
# Month，Day以及Year應該轉化爲datetime64[ns]格式
# Active 列應該是布爾值
# 若是不作數據清洗，很難進行下一步的數據分析，爲了進行數據格式的轉化，pandas裏面有三種比較經常使用的方法
# 1. astype()強制轉化數據類型
# 2. 經過建立自定義的函數進行數據轉化
# 3. pandas提供的to_nueric()以及to_datetime()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number    5 non-null int64
Customer Name      5 non-null object
2016               5 non-null object
2017               5 non-null object
Percent Growth     5 non-null object
Jan Units          5 non-null object
Month              5 non-null int64
Day                5 non-null int64
Year               5 non-null int64
Active             5 non-null object
dtypes: int64(4), object(6)
memory usage: 480.0+ bytes

首先介紹最經常使用的astype()

好比能夠經過astype()將第一列的數據轉化爲整數int類型

df['Customer Number'].astype("int")
#  這樣的操做並無改變原始的數據框，而只是返回的一個拷貝

0     10002
1    552278
2     23477
3     24900
4    651029
Name: Customer Number, dtype: int32

# 想要真正的改變數據框，一般須要經過賦值來進行，好比
df["Customer Number"] = df["Customer Number"].astype("int")
print(df)
print("--------"*10)
print(df.dtypes)

Customer Number     Customer Name          2016            2017  \
0            10002  Quest Industries  $125,000.00     $162,500.00    
1           552278    Smith Plumbing  $920,000.00   $1,012,000.00    
2            23477   ACME Industrial   $50,000.00      $62,500.00    
3            24900        Brekke LTD  $350,000.00     $490,000.00    
4           651029         Harbor Co   $15,000.00      $12,750.00    

  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N  
--------------------------------------------------------------------------------
Customer Number     int32
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object

# 經過賦值在原始的數據框基礎上進行了數據轉化，能夠從新看一下咱們新生成的數據框
print(df)

Customer Number     Customer Name          2016            2017  \
0            10002  Quest Industries  $125,000.00     $162,500.00    
1           552278    Smith Plumbing  $920,000.00   $1,012,000.00    
2            23477   ACME Industrial   $50,000.00      $62,500.00    
3            24900        Brekke LTD  $350,000.00     $490,000.00    
4           651029         Harbor Co   $15,000.00      $12,750.00    

  Percent Growth Jan Units  Month  Day  Year Active  
0         30.00%       500      1   10  2015      Y  
1         10.00%       700      6   15  2014      Y  
2         25.00%       125      3   29  2016      Y  
3          4.00%        75     10   27  2015      Y  
4        -15.00%    Closed      2    2  2014      N

# 而後像2016,2017 Percent Growth，Jan Units 這幾列帶有特殊符號的object是不能直接經過astype("flaot)方法進行轉化的，
# 這與python中的字符串轉化爲浮點數，都要求原始的字符都只能含有數字自己，不能含有其餘的特殊字符
# 咱們能夠試着將將Active列轉化爲布爾值，看一下到底會發生什麼,五個結果全是True，說明並無起到什麼做用

#df["Active"].astype("bool")

df['2016'].astype('float')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-19-47cc9d68cd65> in <module>()
----> 1 df['2016'].astype('float')


C:\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, raise_on_error, **kwargs)
   3052         # else, only a single dtype is given
   3053         new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054                                      raise_on_error=raise_on_error, **kwargs)
   3055         return self._constructor(new_data).__finalize__(self)
   3056 


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
   3187 
   3188     def astype(self, dtype, **kwargs):
-> 3189         return self.apply('astype', dtype=dtype, **kwargs)
   3190 
   3191     def convert(self, **kwargs):


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3054 
   3055             kwargs['mgr'] = self
-> 3056             applied = getattr(b, f)(**kwargs)
   3057             result_blocks = _extend_blocks(applied, result_blocks)
   3058 


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs)
    459                **kwargs):
    460         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461                             values=values, **kwargs)
    462 
    463     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,


C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
    502 
    503                 # _astype_nansafe works fine with 1-d only
--> 504                 values = _astype_nansafe(values.ravel(), dtype, copy=True)
    505                 values = values.reshape(self.shape)
    506 


C:\Anaconda3\lib\site-packages\pandas\types\cast.py in _astype_nansafe(arr, dtype, copy)
    535 
    536     if copy:
--> 537         return arr.astype(dtype)
    538     return arr.view(dtype)
    539 


ValueError: could not convert string to float: '$15,000.00 '

以上的問題說明了一些問題

若是數據是純淨的數據，能夠轉化爲數字
astype基本也就是兩種用做，數字轉化爲單純字符串，單純數字的字符串轉化爲數字，含有其餘的非數字的字符串是不能經過astype進行轉化的。
須要引入其餘的方法進行轉化，也就有了下面的自定義函數方法

經過自定義函數清理數據

經過下面的函數能夠將貨幣進行轉化

def convert_currency(var):
    """
    convert the string number to a float
    _ 去除$
    - 去除逗號，
    - 轉化爲浮點數類型
    """
    new_value = var.replace(",","").replace("$","")
    return float(new_value)

# 經過replace函數將$以及逗號去掉，而後字符串轉化爲浮點數，讓pandas選擇pandas認爲合適的特定類型，float或者int，該例子中將數據轉化爲了float64
# 經過pandas中的apply函數將2016列中的數據所有轉化
df["2016"].apply(convert_currency)

0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64

# 固然能夠經過lambda 函數將這個比較簡單的函數一行帶過
df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")

0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64

#一樣能夠利用lambda表達式將PercentGrowth進行數據清理
df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100

0    0.30
1    0.10
2    0.25
3    0.04
4   -0.15
Name: Percent Growth, dtype: float64

# 一樣能夠經過自定義函數進行解決，結果同上
# 最後一個自定義函數是利用np.where() function 將Active 列轉化爲布爾值。
df["Active"] = np.where(df["Active"] == "Y", True, False)

df["Active"]

0     True
1     True
2     True
3     True
4    False
Name: Active, dtype: bool

# 此時可查看一下數據格式
df["2016"]=df["2016"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["2017"]=df["2017"].apply(lambda x: x.replace(",","").replace("$","")).astype("float64")
df["Percent Growth"]=df["Percent Growth"].apply(lambda x: x.replace("%","")).astype("float")/100
df.dtypes

Customer Number      int32
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units           object
Month                int64
Day                  int64
Year                 int64
Active                bool
dtype: object

# 再次查看DataFrame
# 此時只有Jan Units中格式須要轉化，以及年月日的合併，能夠利用pandas中自帶的幾個函數進行處理
print(df)

Customer Number     Customer Name      2016       2017  Percent Growth  \
0            10002  Quest Industries  125000.0   162500.0            0.30   
1           552278    Smith Plumbing  920000.0  1012000.0            0.10   
2            23477   ACME Industrial   50000.0    62500.0            0.25   
3            24900        Brekke LTD  350000.0   490000.0            0.04   
4           651029         Harbor Co   15000.0    12750.0           -0.15   

  Jan Units  Month  Day  Year Active  
0       500      1   10  2015   True  
1       700      6   15  2014   True  
2       125      3   29  2016   True  
3        75     10   27  2015   True  
4    Closed      2    2  2014  False

利用pandas中函數進行處理

# pandas中pd.to_numeric()處理Jan Units中的數據
pd.to_numeric(df["Jan Units"],errors='coerce').fillna(0)

0    500.0
1    700.0
2    125.0
3     75.0
4      0.0
Name: Jan Units, dtype: float64

# 最後利用pd.to_datatime()將年月日進行合併
pd.to_datetime(df[['Month', 'Day', 'Year']])

0   2015-01-10
1   2014-06-15
2   2016-03-29
3   2015-10-27
4   2014-02-02
dtype: datetime64[ns]

# 作到這裏不要忘記從新賦值，不然原始數據並無變化
df["Jan Units"] = pd.to_numeric(df["Jan Units"],errors='coerce')
df["Start_date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])

df

	Customer Number	Customer Name	2016	2017	Percent Growth	Jan Units	Month	Day	Year	Active	Start_date
0	10002	Quest Industries	125000.0	162500.0	0.30	500.0	1	10	2015	True	2015-01-10
1	552278	Smith Plumbing	920000.0	1012000.0	0.10	700.0	6	15	2014	True	2014-06-15
2	23477	ACME Industrial	50000.0	62500.0	0.25	125.0	3	29	2016	True	2016-03-29
3	24900	Brekke LTD	350000.0	490000.0	0.04	75.0	10	27	2015	True	2015-10-27
4	651029	Harbor Co	15000.0	12750.0	-0.15	NaN	2	2	2014	False	2014-02-02

df.dtypes

Customer Number             int32
Customer Name              object
2016                      float64
2017                      float64
Percent Growth            float64
Jan Units                 float64
Month                       int64
Day                         int64
Year                        int64
Active                       bool
Start_date         datetime64[ns]
dtype: object

# 將這些轉化整合在一塊兒
def convert_percent(val):
    """
    Convert the percentage string to an actual floating point percent
    - Remove %
    - Divide by 100 to make decimal
    """
    new_val = val.replace('%', '')
    return float(new_val) / 100

df_2 = pd.read_csv("sales_data_types.csv",dtype={"Customer_Number":"int"},converters={
    "2016":convert_currency,
    "2017":convert_currency,
    "Percent Growth":convert_percent,
    "Jan Units":lambda x:pd.to_numeric(x,errors="coerce"),
    "Active":lambda x: np.where(x=="Y",True,False)
})

df_2.dtypes

Customer Number      int64
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units          float64
Month                int64
Day                  int64
Year                 int64
Active              bool
dtype: object

df_2

	Customer Number	Customer Name	2016	2017	Percent Growth	Jan Units	Month	Day	Year	Active
0	10002	Quest Industries	125000.0	162500.0	0.30	500.0	1	10	2015	True
1	552278	Smith Plumbing	920000.0	1012000.0	0.10	700.0	6	15	2014	True
2	23477	ACME Industrial	50000.0	62500.0	0.25	125.0	3	29	2016	True
3	24900	Brekke LTD	350000.0	490000.0	0.04	75.0	10	27	2015	True
4	651029	Harbor Co	15000.0	12750.0	-0.15	NaN	2	2	2014	False