如何用Pandas處理文本數據？

時間 2021-03-02

標籤 python git 正則表達式 api app ide 函數編碼 spa 3d 欄目 Python 简体版

原文原文鏈接

文本數據是指不能參與算術運算的任何字符，也稱爲字符型數據。如英文字母、漢字、不做爲數值使用的數字(以單引號開頭)和其餘可輸入的字符。python

文本數據具備數據維度高、數據量大且語義複雜等特色，是一種較爲複雜的數據類型。今天，咱們就來一塊兒看看如何使用Pandas對文本數據進行數據處理。git

本文目錄
正則表達式

1. string類型的性質api

1.1. string與object的區別app

1.2. string類型的轉換ide

2. 拆分與拼接函數

2.1. str.split方法編碼

2.2. str.cat方法spa

3. 替換3d

3.1. str.replace的常見用法

3.2. 子組與函數替換

3.3. 關於str.replace的注意事項

4. 字串匹配與提取

4.1. str.extract方法

4.2. str.extractall方法

4.3. str.contains和str.match

5. 經常使用字符串方法

5.1. 過濾型方法

5.2. isnumeric方法

6. 問題及練習

6.1. 問題

6.2. 練習

1、string類型的性質

1. 1 string與object的區別

string類型和object不一樣之處有三點：

① 字符存取方法（string accessor methods，如str.count）會返回相應數據的Nullable類型，而object會隨缺失值的存在而改變返回類型；

② 某些Series方法不能在string上使用，例如：Series.str.decode()，由於存儲的是字符串而不是字節；

③ string類型在缺失值存儲或運算時，類型會廣播爲pd.NA，而不是浮點型np.nan

其他所有內容在當前版本下徹底一致，但迎合Pandas的發展模式，咱們仍然所有用string來操做字符串。

1.2 string類型的轉換

首先，導入須要使用的包

import pandas as pdimport numpy as np

若是將一個其餘類型的容器直接轉換string類型可能會出錯：

#pd.Series([1,'1.']).astype('string') #報錯#pd.Series([1,2]).astype('string') #報錯#pd.Series([True,False]).astype('string') #報錯

當下正確的方法是分兩部轉換，先轉爲str型object，在轉爲string類型：

pd.Series([1,'1.']).astype('str').astype('string')

0     1
1     1
dtype: string

pd.Series([1,2]).astype('str').astype('string')

0    1
1    2
dtype: string

pd.Series([True,False]).astype('str').astype('string')

0     True
1    False
dtype: string

2、拆分與拼接

2.1 str.split方法

（a）分割符與str的位置元素選取

s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")s

0    a_b_c
1    c_d_e
2     <NA>
3    f_g_h
dtype: string

根據某一個元素分割，默認爲空格

s.str.split('_')

0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

這裏須要注意split後的類型是object，由於如今Series中的元素已經不是string，而包含了list，且string類型只能含有字符串。

對於str方法能夠進行元素的選擇，若是該單元格元素是列表，那麼str[i]表示取出第i個元素，若是是單個元素，則先把元素轉爲列表在取出。

s.str.split('_').str[1]

0       b
1       d
2    <NA>
3       g
dtype: object

pd.Series(['a_b_c', ['a','b','c']], dtype="object").str[1] #第一個元素先轉爲['a','_','b','_','c']

0    _
1    b
dtype: object

（b）其餘參數

expand參數控制了是否將列拆開，n參數表明最多分割多少次

s.str.split('_',expand=True)

s.str.split('_',n=1)

0    [a, b_c]
1    [c, d_e]
2        <NA>
3    [f, g_h]
dtype: object

s.str.split('_',expand=True,n=1)

2.2 str.cat方法

（a）不一樣對象的拼接模式

cat方法對於不一樣對象的做用結果並不相同，其中的對象包括：單列、雙列、多列

① 對於單個Series而言，就是指全部的元素進行字符合併爲一個字符串

s = pd.Series(['ab',None,'d'],dtype='string')s

0      ab
1    <NA>
2       d
dtype: string

s.str.cat()

'abd'

其中可選sep分隔符參數，和缺失值替代字符na_rep參數

s.str.cat(sep=',')

'ab,d'

s.str.cat(sep=',',na_rep='*')

'ab,*,d'

② 對於兩個Series合併而言，是對應索引的元素進行合併

s2 = pd.Series(['24',None,None],dtype='string')s2

0      24
1    <NA>
2    <NA>
dtype: string

s.str.cat(s2)

0    ab24
1    <NA>
2    <NA>
dtype: string

一樣也有相應參數，須要注意的是兩個缺失值會被同時替換

s.str.cat(s2,sep=',',na_rep='*')

0    ab,24
1      *,*
2      d,*
dtype: string

③ 多列拼接能夠分爲表的拼接和多Series拼接

表的拼接

s.str.cat(pd.DataFrame({0:['1','3','5'],1:['5','b',None]},dtype='string'),na_rep='*')

0 ab15
1 *3b
2 d5*
dtype: string

多個Series拼接

s.str.cat([s+'0',s*2])

0    abab0abab
1         <NA>
2        dd0dd
dtype: string

（b）cat中的索引對齊

當前版本中，若是兩邊合併的索引不相同且未指定join參數，默認爲左鏈接，設置join='left'

s2 = pd.Series(list('abc'),index=[1,2,3],dtype='string')s2

1    a
2    b
3    c
dtype: string

s.str.cat(s2,na_rep='*')

0    ab*
1     *a
2     db
dtype: string

3、替換

廣義上的替換，就是指str.replace函數的應用，fillna是針對缺失值的替換，上一章已經說起。

提到替換，就不可避免地接觸到正則表達式，這裏默認讀者已掌握常見正則表達式知識點，若對其還不瞭解的，能夠經過這份資料來熟悉

3.1 str.replace的常見用法

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'],dtype="string")s

0       A
1       B
2       C
3    Aaba
4    Baca
5
6    <NA>
7    CABA
8     dog
9     cat
dtype: string

第一個值寫r開頭的正則表達式，後一個寫替換的字符串

s.str.replace(r'^[AB]','***')

0       ***
1       ***
2         C
3    ***aba
4    ***aca
5
6      <NA>
7      CABA
8       dog
9       cat
dtype: string

3.2 子組與函數替換

經過正整數調用子組（0返回字符自己，從1開始纔是子組）

s.str.replace(r'([ABC])(\w+)',lambda x:x.group(2)[1:]+'*')

0       A
1       B
2       C
3     ba*
4     ca*
5
6    <NA>
7     BA*
8     dog
9     cat
dtype: string

利用?P<....>表達式能夠對子組命名調用

s.str.replace(r'(?P<one>[ABC])(?P<two>\w+)',lambda x:x.group('two')[1:]+'*')

0       A
1       B
2       C
3     ba*
4     ca*
5
6    <NA>
7     BA*
8     dog
9     cat
dtype: string

3.3 關於str.replace的注意事項

首先，要明確str.replace和replace並非一個東西：

str.replace針對的是object類型或string類型，默認是以正則表達式爲操做，目前暫時不支持DataFrame上使用；
replace針對的是任意類型的序列或數據框，若是要以正則表達式替換，須要設置regex=True，該方法經過字典可支持多列替換。

但如今因爲string類型的初步引入，用法上出現了一些問題，這些issue有望在之後的版本中修復。

（a）str.replace賦值參數不得爲pd.NA

這聽上去很是不合理，例如對知足某些正則條件的字符串替換爲缺失值，直接更改成缺失值在當下版本就會報錯

#pd.Series(['A','B'],dtype='string').str.replace(r'[A]',pd.NA) #報錯#pd.Series(['A','B'],dtype='O').str.replace(r'[A]',pd.NA) #報錯

此時，能夠先轉爲object類型再轉換回來，曲線救國：

pd.Series(['A','B'],dtype='string').astype('O').replace(r'[A]',pd.NA,regex=True).astype('string')

0    <NA>
1       B
dtype: string

至於爲何不用replace函數的regex替換（但string類型replace的非正則替換是能夠的），緣由在下面一條

（b）對於string類型Series

在使用replace函數時不能使用正則表達式替換，該bug如今還未修復

pd.Series(['A','B'],dtype='string').replace(r'[A]','C',regex=True)

0    A
1    B
dtype: string

pd.Series(['A','B'],dtype='O').replace(r'[A]','C',regex=True)

0    C
1    B
dtype: object

（c）string類型序列若是存在缺失值，不能使用replace替換

#pd.Series(['A',np.nan],dtype='string').replace('A','B') #報錯

pd.Series(['A',np.nan],dtype='string').str.replace('A','B')

0       B
1    <NA>
dtype: string

綜上，概況的說，除非須要賦值元素爲缺失值（轉爲object再轉回來），不然請使用str.replace方法

4、子串匹配與提取

4.1 str.extract方法

（a）常見用法

pd.Series(['10-87', '10-88', '10-89'],dtype="string").str.extract(r'([\d]{2})-([\d]{2})')

使用子組名做爲列名

pd.Series(['10-87', '10-88', '-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})')

利用?正則標記選擇部分提取

pd.Series(['10-87', '10-88', '-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})?-(?P<name_2>[\d]{2})')

pd.Series(['10-87', '10-88', '10-'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})?')

（b）expand參數（默認爲True）

對於一個子組的Series，若是expand設置爲False，則返回Series，若大於一個子組，則expand參數無效，所有返回DataFrame。

對於一個子組的Index，若是expand設置爲False，則返回提取後的Index，若大於一個子組且expand爲False，報錯。

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")s.index

Index(['A11', 'B22', 'C33'], dtype='object')

s.str.extract(r'([\w])')

s.str.extract(r'([\w])',expand=False)

A11    a
B22    b
C33    c
dtype: string

s.index.str.extract(r'([\w])')

s.index.str.extract(r'([\w])',expand=False)

Index(['A', 'B', 'C'], dtype='object')

s.index.str.extract(r'([\w])([\d])')

#s.index.str.extract(r'([\w])([\d])',expand=False) #報錯

4.2 str.extractall方法

與extract只匹配第一個符合條件的表達式不一樣，extractall會找出全部符合條件的字符串，並創建多級索引（即便只找到一個）

s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],dtype="string")two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'s.str.extract(two_groups, expand=True)

s.str.extractall(two_groups)

s['A']='a1's.str.extractall(two_groups)

若是想查看第i層匹配，可以使用xs方法

s = pd.Series(["a1a2", "b1b2", "c1c2"], index=["A", "B", "C"],dtype="string")s.str.extractall(two_groups).xs(1,level='match')

4.3 str.contains和str.match

前者的做用爲檢測是否包含某種正則模式

pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains(r'[0-9][a-z]')

0    False
1     <NA>
2     True
3     True
4     True
dtype: boolean

可選參數爲na

pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains('a', na=False)

0    False
1    False
2     True
3    False
4    False
dtype: boolean

str.match與其區別在於，match依賴於python的re.match，檢測內容爲是否從頭開始包含該正則模式

pd.Series(['1', None, '3a_', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False)

0    False
1    False
2     True
3     True
4    False
dtype: boolean

pd.Series(['1', None, '_3a', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False)

0    False
1    False
2    False
3     True
4    False
dtype: boolean

5、經常使用字符串方法

5.1 過濾型方法

（a）str.strip

經常使用於過濾空格

pd.Series(list('abc'),index=[' space1  ','space2  ','  space3'],dtype="string").index.str.strip()

Index(['space1', 'space2', 'space3'], dtype='object')

（b）str.lower和str.upper

pd.Series('A',dtype="string").str.lower()

0    a
dtype: string

pd.Series('a',dtype="string").str.upper()

0    A
dtype: string

（c）str.swapcase和str.capitalize

分別表示交換字母大小寫和大寫首字母

pd.Series('abCD',dtype="string").str.swapcase()

0    ABcd
dtype: string

pd.Series('abCD',dtype="string").str.capitalize()

0    Abcd
dtype: string

5.2 isnumeric方法

檢查每一位是否都是數字，請問如何判斷是不是數值？（問題二）

pd.Series(['1.2','1','-0.3','a',np.nan],dtype="string").str.isnumeric()

0    False
1     True
2    False
3    False
4     <NA>
dtype: boolean

6、問題與練習

6.1 問題

【問題一】 str對象方法和df/Series對象方法有什麼區別？

【問題二】給出一列string類型，如何判斷單元格是不是數值型數據？

【問題三】 rsplit方法的做用是什麼？它在什麼場合下適用？

【問題四】在本章的第二到第四節分別介紹了字符串類型的5類操做，請思考它們各自應用於什麼場景？

6.2 練習

【練習一】現有一份關於字符串的數據集，請解決如下問題：

（a）現對字符串編碼存儲人員信息（在編號後添加ID列），使用以下格式：「×××（名字）：×國人，性別×，生於×年×月×日」

# 方法一> ex1_ori = pd.read_csv('data/String_data_one.csv',index_col='人員編號')> ex1_ori.head()  姓名  國籍  性別  出生年  出生月  出生日人員編號            1  aesfd  2  男  1942  8  102  fasefa  5  女  1985  10  43  aeagd  4  女  1946  10  154  aef  4  男  1999  5  135  eaf  1  女  2010  6  24
> ex1 = ex1_ori.copy()> ex1['冒號'] = '：'> ex1['逗號'] = '，'> ex1['國人'] = '國人'> ex1['性別2'] = '性別'> ex1['生於'] = '生於'> ex1['年'] = '年'> ex1['月'] = '月'> ex1['日'] = '日'> ID = ex1['姓名'].str.cat([ex1['冒號'],                    ex1['國籍'].astype('str'),                     ex1['國人'],                   ex1['逗號'],                   ex1['性別2'],                   ex1['性別'],                   ex1['逗號'],                   ex1['生於'],                   ex1['出生年'].astype('str'),                   ex1['年'],                   ex1['出生月'].astype('str'),                   ex1['月'],                   ex1['出生日'].astype('str'),                   ex1['日']                  ])> ex1_ori['ID'] = ID> ex1_ori  姓名  國籍  性別  出生年  出生月  出生日  ID人員編號              1  aesfd  2  男  1942  8  10  aesfd：2國人，性別男，生於1942年8月10日2  fasefa  5  女  1985  10  4  fasefa：5國人，性別女，生於1985年10月4日3  aeagd  4  女  1946  10  15  aeagd：4國人，性別女，生於1946年10月15日4  aef  4  男  1999  5  13  aef：4國人，性別男，生於1999年5月13日5  eaf  1  女  2010  6  24  eaf：1國人，性別女，生於2010年6月24日

（b）將（a）中的人員生日信息部分修改成用中文表示（如一九七四年十月二十三日），其他返回格式不變。

（c）將（b）中的ID列結果拆分爲原列表相應的5列，並使用equals檢驗是否一致。

# 參考答案> dic_year = {i[0]:i[1] for i in zip(list('零一二三四五六七八九'),list('0123456789'))}> dic_two = {i[0]:i[1] for i in zip(list('十一二三四五六七八九'),list('0123456789'))}> dic_one = {'十':'1','二十':'2','三十':'3',None:''}> df_res = df_new['ID'].str.extract(r'(?P<姓名>[a-zA-Z]+):(?P<國籍>[\d])國人，性別(?P<性別>[\w])，生於(?P<出生年>[\w]{4})年(?P<出生月>[\w]+)月(?P<出生日>[\w]+)日')> df_res['出生年'] = df_res['出生年'].str.replace(r'(\w)+',lambda x:''.join([dic_year[x.group(0)[i]] for i in range(4)]))> df_res['出生月'] = df_res['出生月'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'0','10')> df_res['出生日'] = df_res['出生日'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'^0','10')> df_res.head()  姓名  國籍  性別  出生年  出生月  出生日人員編號            1  aesfd  2  男  1942  8  102  fasefa  5  女  1985  10  43  aeagd  4  女  1946  10  154  aef  4  男  1999  5  135  eaf  1  女  2010  6  24

【練習二】 現有一份半虛擬的數據集，第一列包含了新型冠狀病毒的一些新聞標題，請解決如下問題：

（a）選出全部關於北京市和上海市新聞標題的所在行。

（b）求col2的均值。

ex2.col2.str.rstrip('-`').str.lstrip('/').astype(float).mean()

-0.984

（c）求col3的均值。

ex2.columns = ex2.columns.str.strip(' ')
## ！！！用於尋找髒數據def is_number(x):      try:          float(x)          return True      except (SyntaxError, ValueError) as e:          return Falseex2[~ex2.col3.map(is_number)]

 ex2.col3.str.replace(r'[`\\{]', '').astype(float).mean()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。