[Pandas] 04 - Efficient I/O

時間 2019-11-18

標籤 pandas efficient 简体版

原文原文鏈接

SQLITE3接口 to Arrary

——從數據庫加載數據到dataframe/numpy中。sql

調動 SQLITE3數據庫

import sqlite3 as sq3
query = 'CREATE TABLE numbs (Date date, No1 real, No2 real)'

con = sq3.connect(path + 'numbs.db')
con.execute(query)
con.commit()

commit 命令

COMMIT 命令是用於把事務調用的更改保存到數據庫中的事務命令。數據庫

COMMIT 命令把自上次 COMMIT 或 ROLLBACK 命令以來的全部事務保存到數據庫express

返回值處理

返回全部值，就用 fetchall()。數組

con.execute('SELECT * FROM numbs').fetchmany(10)

pointer = con.execute('SELECT * FROM numbs')
for i in range(3):
    print(pointer.fetchone())


Output:
-------------------------------------------------
('2017-11-18 11:18:51.443295', 0.12, 7.3)
('2017-11-18 11:18:51.466328', 0.9791, -0.01914)
('2017-11-18 11:18:51.466580', -0.88736, 0.19104)

保存到NumPy

第一步、經過初始化直接格式變換便可。app

query = 'SELECT * FROM numbers WHERE No1 > 0 AND No2 < 0'

res = np.array( con.execute(query).fetchall() ).round(3)

第二步、可視化數據 by resampling，也就是少取一些點。dom

res = res[::100]  # every 100th result

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(res[:, 0], res[:, 1], 'ro')

plt.grid(True); 
plt.xlim(-0.5, 4.5); 
plt.ylim(-4.5, 0.5)
# tag: scatter_query
# title: Plot of the query result
# size: 60

SQLITE3接口 to DataFrame

讀取整個表

一張表一般內存能夠搞定，所有讀取也不是避諱的事情。fetch

import sqlite3 as sq3

filename = path + 'numbs'
con = sq3.Connection(filename + '.db')    

%time data = pd.read_sql('SELECT * FROM numbers', con)
data.head()

表操做

其實已經演變爲 ndarray操做。this

「與」條件

%time data[(data['No1'] > 0) & (data['No2'] < 0)].head()

「或」條件

%%time
res = data[['No1', 'No2']][((data['No1'] > 0.5) | (data['No1'] < -0.5))
                     & ((data['No2'] < -1) | (data['No2'] > 1))]

PyTable的快速I/O

HDF5數據庫/文件標準。atom

"無壓縮" 建立一個大表

表定義

import numpy as np
import tables as tb
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline

filename = './data/tab.h5'
h5 = tb.open_file(filename, 'w') 

# 有幾行：多搞幾行，弄一個大表
rows = 2000000

# 有幾列
row_des = {
    'Date': tb.StringCol(26, pos=1),
    'No1': tb.IntCol(pos=2),
    'No2': tb.IntCol(pos=3),
    'No3': tb.Float64Col(pos=4),
    'No4': tb.Float64Col(pos=5)
}

建立表

filters = tb.Filters(complevel=0)  # no compression

tab = h5.create_table('/', 'ints_floats', row_des,
                      title='Integers and Floats',
                      expectedrows=rows, filters=filters)

新增數據

此時，表還在內存中，向這個表內添加數據。lua

(1) 一個關鍵的列表形式。

pointer = tab.row

(2) 生成隨機數填充。

ran_int = np.random.randint(0, 10000, size=(rows, 2))
ran_flo = np.random.standard_normal((rows, 2)).round(5)

(3) 賦值給內存中的表。

傳統策略，使用了繁瑣的循環。

%%time
for i in range(rows):
    pointer['Date'] = dt.datetime.now()
    pointer['No1']  = ran_int[i, 0]
    pointer['No2']  = ran_int[i, 1] 
    pointer['No3']  = ran_flo[i, 0]
    pointer['No4']  = ran_flo[i, 1] 
    pointer.append()
      # this appends the data and
      # moves the pointer one row forward

tab.flush() 　　# 至關於SQLITE3中的commit命令

矩陣策略，省掉了循環。

%%time
sarray['Date'] = dt.datetime.now()
sarray['No1'] = ran_int[:, 0]
sarray['No2'] = ran_int[:, 1]
sarray['No3'] = ran_flo[:, 0]
sarray['No4'] = ran_flo[:, 1]

「壓縮」建立一個大表

建立壓縮表

因rows中其實已經有了數據，因此建立的同時就同步寫入文件。

filename = './data/tab.h5c'
h5c = tb.open_file(filename, 'w') 
filters = tb.Filters(complevel=4, complib='blosc')

tabc = h5c.create_table('/', 'ints_floats', sarray,
                        title='Integers and Floats',
                      expectedrows=rows, filters=filters)

dnarray讀取

讀取內存數據，返回 numpy.ndarray。

%time arr_com = tabc.read()
h5c.close()

內存外計算

好比，處理一個若干GB的數組。

建立一個外存數組 EArray

filename = './data/array.h5'
h5 = tb.open_file(filename, 'w') 

n = 100
ear = h5.create_earray(h5.root, 'ear',
                      atom=tb.Float64Atom(),
                      shape=(0, n))

%%time
rand = np.random.standard_normal((n, n))
for i in range(750):
    ear.append(rand)
ear.flush()

ear.size_on_disk　　# 查看一下，這個E Array是個大數組

建立一個對應的 EArray

第一步、設置外存 workspace。

out = h5.create_earray(h5.root, 'out', atom=tb.Float64Atom(), shape=(0, n))

第二步、經過外存來計算ear大數組。

expr = tb.Expr('3 * sin(ear) + sqrt(abs(ear))')　　　　# 這裏是 import tables as tb 中的 Expr，而不是import numexpr as ne
  # the numerical expression as a string object

expr.set_output(out, append_mode=True)
  # target to store results is disk-based array

%time expr.eval()
# evaluation of the numerical expression
# and storage of results in disk-based array

第三步、從外存讀入內存，傳的天然是「變量「，而非」workspace"。

%time imarray = ear.read()
  # read whole array into memory

End.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。