最近看到Arrow格式,感受設計很牛B,具體就不介紹了。因此實操瞭解一下。bash
1、材料準備
準備了一個csv文件,大約約59萬行,14列,大小約61M.
spa
table shape row: 589680 table shape col: 14
有了這個csv材料能夠轉成Dataframe,轉成parquet格式。設計
2、具體代碼code
import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import time as t # 生成arrow格式 print("write parquet file....") csv_path = "C:\\Users\\songroom\\Desktop\\test.csv" time_0 = t.time() df = pd.read_csv(csv_path) time_1 =t.time() print("read csv cost :", time_1-time_0) print("type of df : ",type(df)) time_2 = t.time() table = pa.Table.from_pandas(df) time_3 = t.time() print("type of table :",type(table)) print("Dataframe convert table:", time_3-time_2) # print("write to parquet to disk.....") time_4 = t.time() pq_path = "C:\\Users\\songroom\\Desktop\\test.parquet" pq.write_table(table, pq_path) time_5 = t.time() print("write parquet cost :", time_5-time_4) print("read parquet file from disk ....") table2 = pq.read_table(pq_path) time_6 = t.time() print("read parquet cost :", time_6-time_5) print("type of table2 :",type(table2)) print("table shape row: ",table2.shape[0]) print("table shape col: ",table2.shape[1])
3、文件大小比較pandas
生成parquet文件,大約是11.3M,和原來的csv文件比,大約是20%,這個很省空間呀。it
讀寫速度具體比較以下:table
write parquet file.... read csv cost : 1.0619995594024658 type of df : <class 'pandas.core.frame.DataFrame'> type of table : <class 'pyarrow.lib.Table'> Dataframe convert table: 0.08900094032287598 write to parquet to disk..... write parquet cost : 0.3249986171722412 read parquet file from disk .... read parquet cost : 0.05600690841674805 type of table2 : <class 'pyarrow.lib.Table'> table2 shape row: 589680 table2 shape col: 14
也就是說,parquet讀的用時大約是csv的50%不到,文件大小約20%。固然這個數量級和不一樣運行環境並不必定相同,謹供參考。class
4、和Feather比較test
仍是同一個csv文件,咱們用feather處理一下,比較一下讀的速度。import
using DataFrames using CSV using Feather csv_path = s"C:\Users\songroom\Desktop\test.csv" println("csv => DataFrame: ") df = @time CSV.File(csv_path) |> DataFrame; ft_path = s"C:\Users\songroom\Desktop\ft.ft" println("DataFrame=> Feather:") @time ft_file = Feather.write(ft_path,_df) println("read Feather") @time ft = Feather.read(ft_path)