Python之Pandas庫學習（二）：數據讀寫

時間 2019-11-26

原文原文鏈接

1. I/O API工具

讀取函數	寫入函數
read_csv	to_csv
read_excel	to_excel
read_hdf	to_hdf
read_sql	to_sql
read_json	to_json
read_html	to_html
read_stata	to_stata
read_clipboard	to_clipboard
read_pickle	to_pickle
read_msgpack	to_mspack
read_gbq	to_gbq

2. 讀寫CSV文件

文件的每一行的元素是用逗號隔開，這種格式的文件就叫CSV文件。html

2.1. 從CSV中讀取數據

簡單讀取python

excited.csvweb

white,read,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse

code.py正則表達式

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv')
>>> csvframe
   white  read  blue  green animal
0      1     5     2      3    cat
1      2     7     8      5    dog
2      3     3     6      7  horse
3      2     2     8      3   duck
4      4     4     2      1  mouse

用header和names指定表頭sql

excited.csvjson

1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse

code.pyapi

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv', header=None)
>>> csvframe
   0  1  2  3      4
0  1  5  2  3    cat
1  2  7  8  5    dog
2  3  3  6  7  horse
3  2  2  8  3   duck
4  4  4  2  1  mouse

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv', names=['white', 'red', 'blue', 'green', 'animal'])
>>> csvframe
   white  red  blue  green animal
0      1    5     2      3    cat
1      2    7     8      5    dog
2      3    3     6      7  horse
3      2    2     8      3   duck
4      4    4     2      1  mouse

建立等級結構的DataFrame數組

excited.csvapp

color,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4

code.py函數

>>> csvframe = pd.read_csv('E:\\Python\\Codes\\excited.csv', index_col=['color', 'status'])
>>> csvframe
              item1  item2  item3
color status                     
black up          3      4      6
      down        2      6      7
white up          5      5      5
      down        3      3      2
      left        1      2      1
red   up          2      2      2
      down        1      1      4

2.2. 寫入數據到CSV中

簡單寫入

code.py

>>> frame = pd.DataFrame(np.arange(16).reshape((4,4)), columns = ['red', 'blue', 'orange', 'black'], index = ['a', 'b', 'c', 'd'])
>>> frame
   red  blue  orange  black
a    0     1       2      3
b    4     5       6      7
c    8     9      10     11
d   12    13      14     15
>>> frame.to_csv('E:\\Python\\Codes\\excited.csv')

excited.csv

,red,blue,orange,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15

能夠發現第一行的前面有一個','，由於列名前面有一個空白。

取消索引和列的寫入

code.py

>>> frame.to_csv('E:\\Python\\Codes\\excited.csv', index = False, header = False)

excited.csv

0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15

處理NaN元素

code.py

>>> frame = pd.DataFrame([[3, 2, np.NaN], [np.NaN, np.NaN, np.NaN], [2, 3, 3]], index = ['a', 'b', 'c'], columns = ['red', 'black', 'orange'])
>>> frame
   red  black  orange
a  3.0    2.0     NaN
b  NaN    NaN     NaN
c  2.0    3.0     3.0
>>> frame.to_csv('E:\\Python\\Codes\\excited.csv')

使用np_rep參數把空字段替換
>>> frame.to_csv('E:\\Python\\Codes\\excited.csv', na_rep = 'lalala')

excited.csv

,red,black,orange
a,3.0,2.0,
b,,,
c,2.0,3.0,3.0
能夠發現全部的NaN就是爲空的

替換
,red,black,orange
a,3.0,2.0,lalala
b,lalala,lalala,lalala
c,2.0,3.0,3.0
這裏發現列首的第一個仍是沒有東西，由於它自己不存在？

3. 讀寫TXT文件

TXT文件不必定是以逗號或者分號分割數據的，這種時候要用正則表達式。一般還要配合'*'號表示匹配任意多個。

例如'\s*'.

符號	意義
.	換行符之外的單個字符
\d	數字
\D	非數字字符
\s	空白字符
\S	非空白字符
\n	換行符
\t	製表符
\uxxxx	用十六進制數字xxxx表示的Unicode字符

簡單讀取

excited.txt

亂加空格和製表符
white red blue green
 1   5 2 3
2 7  8   5
2 3 3 3

code.py

>>> pd.read_table('E:\\Python\\Codes\\excited.txt', sep = '\s*')
__main__:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
E:\Python\Python3\lib\site-packages\pandas\io\parsers.py:2137: FutureWarning: split() requires a non-empty pattern match.
  yield pat.split(line.strip())
E:\Python\Python3\lib\site-packages\pandas\io\parsers.py:2139: FutureWarning: split() requires a non-empty pattern match.
  yield pat.split(line.strip())
   white  red  blue  green
0      1    5     2      3
1      2    7     8      5
2      2    3     3      3
第一次嘗試的時候報錯了,因而按照提示加上

>>> pd.read_table('E:\\Python\\Codes\\excited.txt', sep = '\s*', engine = 'python')
   white  red  blue  green
0      1    5     2      3
1      2    7     8      5
2      2    3     3      3
成功了，其中'*'號的意思是匹配任意多個

讀取時排除一些行

excited.txt

12#$@!%$!$#!@$!@$!@
#$%^$^%$#!
@#%!
white red blue green
!$#$!@$#!@$
 1   5 2 3
2 7  8   5
2 3 3 3
^&##$^@FGSDQAS

code.py

>>> pd.read_table('E:\\Python\\Codes\\excited.txt', sep = '\s*', engine = 'python', skiprows = [0, 1, 2, 4, 8])
   white  red  blue  green
0      1    5     2      3
1      2    7     8      5
2      2    3     3      3
列表內表明要跳過的行

讀取部分數據

sep也能夠用在read_csv啊原來。nrows表明讀取幾行的數據，例如nrows=3那麼就讀取3行的數據。

chunksize是把文件分割成一塊一塊的，chunksize=3的話就是每一塊的行數爲3.

excited.txt

white red blue green black orange golden
 1   5 2 3 111 222 233
100 7    8   5 2333 23333 233333
20 3 3 3 12222 1222 23232
2000 7   8   5 2333 23333 233333
300 3 3 3 12222 1222 23232

code.py

>>> frame = pd.read_csv('E:\\Python\\Codes\\excited.txt', sep = '\s*', skiprows=[2], nrows = 3, engine = 'python')
>>> frame
   white  red  blue  green  black  orange  golden
0      1    5     2      3    111     222     233
1     20    3     3      3  12222    1222   23232
2   2000    7     8      5   2333   23333  233333
從頭開始讀三行，而且跳過了第三行

>>> pieces = pd.read_csv('E:\\Python\\Codes\\excited.txt', sep = '\s*', chunksize = 2, engine = 'python')
>>> for piece in pieces:
...   print (piece)
...   print (type(piece))
... 
   white  red  blue  green  black  orange  golden
0      1    5     2      3    111     222     233
1    100    7     8      5   2333   23333  233333
<class 'pandas.core.frame.DataFrame'>
   white  red  blue  green  black  orange  golden
2     20    3     3      3  12222    1222   23232
3   2000    7     8      5   2333   23333  233333
<class 'pandas.core.frame.DataFrame'>
   white  red  blue  green  black  orange  golden
4    300    3     3      3  12222    1222   23232
<class 'pandas.core.frame.DataFrame'>
每兩個爲一塊。而且類型都是DataFrame。

3.2. 寫入數據到TXT中

寫入數據的話和csv是同樣的。

4. 讀寫HTML文件

4.1. 寫入數據到HTML文件中

先看看to_html()方法

code.py

>>> frame
   white  red  blue  green  black  orange  golden
0      1    5     2      3    111     222     233
1    100    7     8      5   2333   23333  233333
2     20    3     3      3  12222    1222   23232
3   2000    7     8      5   2333   23333  233333
4    300    3     3      3  12222    1222   23232
>>> print(frame.to_html())
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>white</th>
      <th>red</th>
      <th>blue</th>
      <th>green</th>
      <th>black</th>
      <th>orange</th>
      <th>golden</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>5</td>
      <td>2</td>
      <td>3</td>
      <td>111</td>
      <td>222</td>
      <td>233</td>
    </tr>
    <tr>
      <th>1</th>
      <td>100</td>
      <td>7</td>
      <td>8</td>
      <td>5</td>
      <td>2333</td>
      <td>23333</td>
      <td>233333</td>
    </tr>
    <tr>
      <th>2</th>
      <td>20</td>
      <td>3</td>
      <td>3</td>
      <td>3</td>
      <td>12222</td>
      <td>1222</td>
      <td>23232</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2000</td>
      <td>7</td>
      <td>8</td>
      <td>5</td>
      <td>2333</td>
      <td>23333</td>
      <td>233333</td>
    </tr>
    <tr>
      <th>4</th>
      <td>300</td>
      <td>3</td>
      <td>3</td>
      <td>3</td>
      <td>12222</td>
      <td>1222</td>
      <td>23232</td>
    </tr>
  </tbody>
</table>

能夠發現DataFrame.to_html()能夠將DataFrame直接變成html的表格內容。所以咱們要把一個DataFrame變成能夠瀏覽的html文件的時候，只須要插入一些其餘的東西。

code.py

>>> s = ['<HTML>']
>>> s.append('<HEAD><TITLE>DataFrame</TITLE></HEAD>')
>>> s.append('<BODY>')
>>> s.append(frame.to_html())
>>> s.append('</BODY></HTML>')
>>> html = ''.join(s)
>>> html_file = open('E:\\Python\\Codes\\DataFrame.html', 'w')
>>> html_file.write(html)
1193
>>> html_file.close()

DataFrame.html

	white	red	blue	green	black	orange	golden
0	1	5	2	3	111	222	233
1	100	7	8	5	2333	23333	233333
2	20	3	3	3	12222	1222	23232
3	2000	7	8	5	2333	23333	233333
4	300	3	3	3	12222	1222	23232

4.2. 從HTML文件中讀取數據

read_html()方法會返回頁面全部的表格，所以獲得的是一個DataFrame數組。

code.py

從上例讀取
>>> web_frames = pd.read_html('E:\\Python\\Codes\\DataFrame.html')
>>> for web_frame in web_frames:
...   print (web_frame)
... 
   Unnamed: 0  white  red  blue  green  black  orange  golden
0           0      1    5     2      3    111     222     233
1           1    100    7     8      5   2333   23333  233333
2           2     20    3     3      3  12222    1222   23232
3           3   2000    7     8      5   2333   23333  233333
4           4    300    3     3      3  12222    1222   23232

最厲害的是，read_html()能夠以網址做爲參數，直接解析並抽取網頁中的表格。

因而試了試百度百科四謊的劇集

code.py

>>> favors = pd.read_html('http://baike.baidu.com/item/%E5%9B%9B%E6%9C%88%E6%98%AF%E4%BD%A0%E7%9A%84%E8%B0%8E%E8%A8%80/13382872#viewPageContent')
>>> now = favors[0].copy()
>>> now = now.set_index(0)
>>> now.columns = now.ix['話']
>>> now.index.name = None
>>> now.drop('話')
話             標題(日/中)               劇本  \
1    モノトーン・カラフル 單調·多彩          吉 岡 孝 夫   
2             友人A 友人A             石黑恭平   
3             春の中 春光裏              神戶守   
4              旅立ち 啓程  巖田和也 河野亞矢子 石黑恭平   
5          どんてんもよう 陰天             石濱真史   
6              帰り道 歸途             井端義秀   
7         カゲささやく 暗影低語              神戶守   
8               響け 迴響             後藤圭二   
9               共鳴 共鳴              神戶守   
10     君といた景色 與你共賞的景色             中村章子   
11           命の燈 生命之光             朝倉海鬥   
12  トゥインクル リトルスター 小星星              神戶守   
13         愛の悲しみ 愛的憂傷             倉田綾子   
14              足跡 足跡             柴山智隆   
15            うそつき 騙子              神戶守   
16        似たもの同士 類似的人             黑木美幸   
17          トワイライト 暮光              神戶守   
18          心重ねる 心心相印             石井俊匡   
19     さよならヒーロー 再見了英雄             井端義秀   
20            手と手 手與手              神戶守   
21                雪 雪        倉田綾子 柴山智隆   
22              春風 春風             石黑恭平   
23            MOMENTS             巖田和也   

話                                         分鏡  \
1                                       石黑恭平   
2                                       原田孝宏   
3                                       巖田和也   
4   三木俊明 河合拓也 牧田昌也 野野下伊織 山田慎也 菅井愛明 小泉初榮 淺賀和行   
5                                  石濱真史 小島崇史   
6                                      野野下伊織   
7                                       間島崇寬   
8                                       高橋英俊   
9                                       黑木美幸   
10                                      原田孝宏   
11                                 石黑恭平 川越崇弘   
12                                      福島利規   
13                                     野野下伊織   
14                                      小泉初榮   
15                                       矢島武   
16                 山田真也 野野下伊織 小泉初榮 三木俊明 淺賀和行   
17                                     河野亞矢子   
18                                      河合拓也   
19                                       こさや   
20                                       矢島武   
21            野野下伊織 小泉初榮 門之園惠美 高野綾 河合拓也 山田真也   
22                                 石黑恭平 黑木美幸   
23                      愛敬由紀子 奧田佳子 山田真也 伊藤香織   

話                                                  演出       做畫監督 演奏 做畫監督 總做畫監督  
1                                               愛敬由紀子       淺賀和行       -   NaN  
2                                           三木俊明 小林惠祐      愛敬由紀子     NaN   NaN  
3                                                河合拓也        NaN     NaN   NaN  
4                                           淺賀和行 倉田綾子  愛敬由紀子 高野綾     NaN   NaN  
5                                                小島崇史          -   愛敬由紀子   NaN  
6                                                淺賀和行        NaN     NaN   NaN  
7                                                山田真也          -     NaN   NaN  
8                                                河合拓也       淺賀和行     NaN   NaN  
9                                                小泉初榮        NaN     NaN   NaN  
10                                                高野綾        NaN     NaN   NaN  
11                                           山下惠 中野彰子          -     NaN   NaN  
12                                               長森佳容       淺賀和行     NaN   NaN  
13                                                NaN        NaN     NaN   NaN  
14                                                  -        NaN     NaN   NaN  
15                  北島勇樹 山下惠 C Company NAMU Animation       淺賀和行     NaN   NaN  
16                                                  -        高野綾     NaN   NaN  
17                                           三木俊明 高田晃       淺賀和行   愛敬由紀子   NaN  
18                                                NaN        NaN     NaN   NaN  
19                           小泉初榮 野野下伊織 高野綾 山田真也 河合拓也        NaN     NaN   NaN  
20  野野下伊織 小泉初榮 河合拓也 山田真也 高野綾 薗部愛子 奧田佳子 加藤萬由子 高田晃 藪本和彥        NaN     NaN   NaN  
21                                                NaN        NaN     NaN   NaN  
22       奧田桂子 河合拓也 野野下伊織 高野綾 小泉初榮 伊藤香織 淺賀和行 高田晃 愛敬由紀子        NaN     NaN   NaN  
23                                                NaN        NaN     NaN   NaN

很強大。可是由於外移了一行..搞了挺久終於完美顯示了。