Python : pyarrow與parquet、feather格式比較

最近看到Arrow格式，感受設計很牛B，具體就不介紹了。因此實操瞭解一下。bash

1、材料準備
準備了一個csv文件，大約約59萬行，14列，大小約61M.
spa

table shape row:  589680
table shape col:  14

有了這個csv材料能夠轉成Dataframe,轉成parquet格式。設計

2、具體代碼code

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import time as t

# 生成arrow格式
print("write parquet file....")
csv_path = "C:\\Users\\songroom\\Desktop\\test.csv"
time_0 = t.time()
df = pd.read_csv(csv_path)
time_1 =t.time()
print("read csv cost :", time_1-time_0)
print("type of df : ",type(df))
time_2 = t.time()
table = pa.Table.from_pandas(df)
time_3 = t.time()
print("type of table :",type(table))
print("Dataframe convert table:", time_3-time_2)
#
print("write to parquet to disk.....")
time_4 = t.time()
pq_path = "C:\\Users\\songroom\\Desktop\\test.parquet"
pq.write_table(table, pq_path)
time_5 = t.time()
print("write parquet cost :", time_5-time_4)

print("read parquet file from disk ....")
table2 = pq.read_table(pq_path)

time_6 = t.time()
print("read parquet cost :", time_6-time_5)
print("type of table2 :",type(table2))
print("table shape row: ",table2.shape[0])
print("table shape col: ",table2.shape[1])

3、文件大小比較pandas

生成parquet文件，大約是11.3M，和原來的csv文件比，大約是20%，這個很省空間呀。it

讀寫速度具體比較以下：table

write parquet file....
read csv cost : 1.0619995594024658
type of df :  <class 'pandas.core.frame.DataFrame'>
type of table : <class 'pyarrow.lib.Table'>
Dataframe convert table: 0.08900094032287598
write to parquet to disk.....
write  parquet cost : 0.3249986171722412
read parquet file from disk ....
read  parquet cost : 0.05600690841674805
type of table2 : <class 'pyarrow.lib.Table'>
table2 shape row:  589680
table2 shape col:  14

也就是說，parquet讀的用時大約是csv的50%不到，文件大小約20%。固然這個數量級和不一樣運行環境並不必定相同，謹供參考。class

4、和Feather比較test

仍是同一個csv文件，咱們用feather處理一下，比較一下讀的速度。import

using DataFrames
using CSV
using Feather

csv_path = s"C:\Users\songroom\Desktop\test.csv"
println("csv => DataFrame: ")
df = @time CSV.File(csv_path) |> DataFrame;
ft_path = s"C:\Users\songroom\Desktop\ft.ft"
println("DataFrame=> Feather:")
@time ft_file = Feather.write(ft_path,_df)
println("read Feather")
@time ft = Feather.read(ft_path)