【原創】大數據基礎之Parquet（1）簡介

時間 2019-11-21

標籤原創數據基礎 parquet 簡介简体版

原文原文鏈接

http://parquet.apache.org算法

層次結構：apache

file -> row groups -> column chunks -> pages(data/index/dictionary)數據結構

Motivation

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.app

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.ide

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.oop

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.性能

Parquet是爲了讓Hadoop生態的任何項目均可以利用壓縮和列式存儲的優勢；Parquet生來就支持複雜的嵌套數據結構，使用了Dremel論文裏提到的記錄分片和整合算法；Parquet支持高效的壓縮和編碼scheme，不少項目都證實了這會極大的提高查詢性能；ui

Glossary

Block (hdfs block): This means a block in hdfs and the meaning is unchanged for describing this file format. The file format is designed to work well on top of hdfs.this

File: A hdfs file that must include the metadata for the file. It does not need to actually contain the data.編碼

Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.

Column chunk: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file.

Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk.

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.

一個file包含一個或多個row group，一個row group裏每一個column都包含惟一一個column chunk，一個column chunk包含一個或多個page；

Metadata

There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol.

The file metadata contains the locations of all the column metadata start locations.

Metadata is written after the data to allow for single pass writing.

Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

有3種元數據：file metadata，column metadata和page header metadata；file metadata包含了全部column metadata的起始位置；reader應該先讀file metadata來找到它們感興趣的column chunk；

The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.