Data Lake Analytics + OSS數據文件格式處理大全

0. 前言

Data Lake Analytics是Serverless化的雲上交互式查詢分析服務。用戶可使用標準的SQL語句,對存儲在OSS、TableStore上的數據無需移動,直接進行查詢分析。html

目前該產品已經正式登錄阿里雲,歡迎你們申請試用,體驗更便捷的數據分析服務。
請參考https://help.aliyun.com/document_detail/70386.html 進行產品開通服務申請。java

在上一篇教程中,咱們介紹瞭如何分析CSV格式的TPC-H數據集。除了純文本文件(例如,CSV,TSV等),用戶存儲在OSS上的其餘格式的數據文件,也可使用Data Lake Analytics進行查詢分析,包括ORC, PARQUET, JSON, RCFILE, AVRO甚至ESRI規範的地理JSON數據,還能夠用正則表達式匹配的文件等。node

本文詳細介紹如何根據存儲在OSS上的文件格式使用Data Lake Analytics (下文簡稱 DLA)進行分析。DLA內置了各類處理文件數據的SerDe(Serialize/Deserilize的簡稱,目的是用於序列化和反序列化)實現,用戶無需本身編寫程序,基本上能選用DLA中的一款或多款SerDe來匹配您OSS上的數據文件格式。若是還不能知足您特殊文件格式的處理需求,請聯繫咱們,儘快爲您實現。mysql

1. 存儲格式與SerDe

用戶能夠依據存儲在OSS上的數據文件進行建表,經過STORED AS 指定數據文件的格式。
例如,git

CREATE EXTERNAL TABLE nation ( N_NATIONKEY INT, N_NAME STRING, N_REGIONKEY INT, N_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';

建表成功後可使用SHOW CREATE TABLE語句查看原始建表語句。github

mysql> show create table nation; +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Result | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | CREATE EXTERNAL TABLE `nation`( `n_nationkey` int, `n_name` string, `n_regionkey` int, `n_comment` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS `TEXTFILE` LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation'| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (1.81 sec)

下表中列出了目前DLA已經支持的文件格式,當針對下列格式的文件建表時,能夠直接使用STORED AS,DLA會選擇合適的SERDE/INPUTFORMAT/OUTPUTFORMAT。正則表達式

 

存儲格式
描述
STORED AS TEXTFILE
數據文件的存儲格式爲純文本文件。默認的文件類型。
文件中的每一行對應表中的一條記錄。
STORED AS ORC
數據文件的存儲格式爲ORC。
STORED AS PARQUET
數據文件的存儲格式爲PARQUET。
STORED AS RCFILE
數據文件的存儲格式爲RCFILE。
STORED AS AVRO
數據文件的存儲格式爲AVRO。
STORED AS JSON
數據文件的存儲格式爲JSON (Esri ArcGIS的地理JSON數據文件除外)。

 

在指定了STORED AS 的同時,還能夠根據具體文件的特色,指定SerDe (用於解析數據文件並映射到DLA表),特殊的列分隔符等。
後面的部分會作進一步的講解。sql

2. 示例

2.1 CSV文件

CSV文件,本質上仍是純文本文件,可使用STORED AS TEXTFILE。
列與列之間以逗號分隔,能夠經過ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 表示。shell

普通CSV文件

例如,數據文件oss://bucket-for-testing/oss/text/cities/city.csv的內容爲apache

Beijing,China,010
ShangHai,China,021
Tianjin,China,022

建表語句能夠爲

CREATE EXTERNAL TABLE city ( city STRING, country STRING, code INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities';

使用OpenCSVSerde__處理引號__引用的字段

OpenCSVSerde在使用時須要注意如下幾點:

  1. 用戶能夠爲行的字段指定字段分隔符、字段內容引用符號和轉義字符,例如:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\" );
  2. 不支持字段內嵌入的行分割符;
  3. 全部字段定義STRING類型;
  4. 其餘數據類型的處理,能夠在SQL中使用函數進行轉換。
    例如,
CREATE EXTERNAL TABLE test_csv_opencsvserde ( id STRING, name STRING, location STRING, create_date STRING, create_timestamp STRING, longitude STRING, latitude STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties( 'separatorChar'=',', 'quoteChar'='"', 'escapeChar'='\\' ) STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';

自定義分隔符

須要自定義列分隔符(FIELDS TERMINATED BY),轉義字符(ESCAPED BY),行結束符(LINES TERMINATED BY)。
須要在建表語句中指定

ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'

忽略CSV文件中的HEADER

在csv文件中,有時會帶有HEADER信息,須要在數據讀取時忽略掉這些內容。這時須要在建表語句中定義skip.header.line.count。

例如,數據文件oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的內容以下:

N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT 0|ALGERIA|0| haggle. carefully final deposits detect slyly agai| 1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon| 2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special | 3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold| 4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d| 5|ETHIOPIA|0|ven packages wake quickly. regu|

相應的建表語句爲:

CREATE EXTERNAL TABLE nation_header ( N_NATIONKEY INT, N_NAME STRING, N_REGIONKEY INT, N_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl' TBLPROPERTIES ("skip.header.line.count"="1");

skip.header.line.count的取值x和數據文件的實際行數n有以下關係:

  • 當x<=0時,DLA在讀取文件時,不會過濾掉任何信息,即所有讀取;
  • 當0
  • 當x>=n時,DLA在讀取文件時,會過濾掉全部的文件內容。

2.2 TSV文件

與CSV文件相似,TSV格式的文件也是純文本文件,列與列之間的分隔符爲Tab。

例如,數據文件oss://bucket-for-testing/oss/text/cities/city.tsv的內容爲

Beijing    China    010
ShangHai    China    021
Tianjin    China    022

建表語句能夠爲

CREATE EXTERNAL TABLE city ( city STRING, country STRING, code INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities';

2.3 多字符數據字段分割符文件

假設您的數據字段的分隔符包含多個字符,可採用以下示例建表語句,其中每行的數據字段分割符爲「||」,能夠替換爲您具體的分割符字符串。

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)

示例:

CREATE EXTERNAL TABLE test_csv_multidelimit ( id STRING, name STRING, location STRING, create_date STRING, create_timestamp STRING, longitude STRING, latitude STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' with serdeproperties( "field.delim"="||" ) STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities/';

2.4 JSON文件

DLA能夠處理的JSON文件一般以純文本的格式存儲,數據文件的編碼方式須要是UTF-8。
在JSON文件中,每行必須是一個完整的JSON對象。
例如,下面的文件格式是不被接受的

{"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"} {"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"} {"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"} {"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

須要改寫成:

{"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"} {"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"} {"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"} {"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

不含嵌套的JSON數據

建表語句能夠寫

CREATE EXTERNAL TABLE t1 (id int, name string, c3 timestamp) STORED AS JSON LOCATION 'oss://path/to/t1/directory';

含有嵌套的JSON文件

使用struct和array結構定義嵌套的JSON數據。
例如,用戶原始數據(注意:不管是否嵌套,一條完整的JSON數據都只能放在一行上,才能被Data Lake Analytics處理):

{       "DocId": "Alibaba", "User_1": { "Id": 1234, "Username": "bob1234", "Name": "Bob", "ShippingAddress": { "Address1": "969 Wenyi West St.", "Address2": null, "City": "Hangzhou", "Province": "Zhejiang" }, "Orders": [{ "ItemId": 6789, "OrderDate": "11/11/2017" }, { "ItemId": 4352, "OrderDate": "12/12/2017" } ] } }

使用在線JSON格式化工具格式化後,數據內容以下:

{
    "DocId": "Alibaba", "User_1": { "Id": 1234, "Username": "bob1234", "Name": "Bob", "ShippingAddress": { "Address1": "969 Wenyi West St.", "Address2": null, "City": "Hangzhou", "Province": "Zhejiang" }, "Orders": [ { "ItemId": 6789, "OrderDate": "11/11/2017" }, { "ItemId": 4352, "OrderDate": "12/12/2017" } ] } }

則建表語句能夠寫成以下(注意:LOCATION中指定的路徑必須是JSON數據文件所在的目錄,該目錄下的全部JSON文件都能被識別爲該表的數據):

CREATE EXTERNAL TABLE json_table_1 ( docid string, user_1 struct< id:INT, username:string, name:string, shippingaddress:struct< address1:string, address2:string, city:string, province:string >, orders:array< struct< itemid:INT, orderdate:string > > > ) STORED AS JSON LOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/';

對該表進行查詢:

select * from json_table_1; +---------+----------------------------------------------------------------------------------------------------------------+ | docid | user_1 | +---------+----------------------------------------------------------------------------------------------------------------+ | Alibaba | [1234, bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352, 12/12/2017]]] | +---------+----------------------------------------------------------------------------------------------------------------+

對於struct定義的嵌套結構,能夠經過「.」進行層次對象引用,對於array定義的數組結構,能夠經過「[數組下標]」(注意:數組下標從1開始)進行對象引用。

select DocId, User_1.Id, User_1.ShippingAddress.Address1, User_1.Orders[1].ItemId from json_table_1 where User_1.Username = 'bob1234' and User_1.Orders[2].OrderDate = '12/12/2017'; +---------+------+--------------------+-------+ | DocId | id | address1 | _col3 | +---------+------+--------------------+-------+ | Alibaba | 1234 | 969 Wenyi West St. | 6789 | +---------+------+--------------------+-------+

使用JSON函數處理數據

例如,把「value_string」的嵌套JSON值做爲字符串存儲:

{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}

使用在線JSON格式化工具格式化後,數據內容以下:

{
    "data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", "ts": 1524550275112, "value_string": "{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}" }

建表語句爲

CREATE external TABLE json_table_2 ( data_key string, ts bigint, value_string string ) STORED AS JSON LOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/';

表建好後,可進行查詢:

select * from json_table_2; +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | data_key | ts | value_string | +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","張三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false} | +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

下面SQL示例json_parse,json_extract_scalar,json_extract等經常使用JSON函數的使用方式:

mysql> select json_extract_scalar(json_parse(value), '$.owners[1]') from json_table_2; +--------+ | _col0 | +--------+ | 張三 | +--------+ mysql> select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') from ( select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2 ) json_obj where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao'; +-----------+ | _col0 | +-----------+ | 0.0.0.0/0 | +-----------+ mysql> with json_obj as (select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2) select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') from json_obj where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao'; +-----------+ | _col0 | +-----------+ | 0.0.0.0/0 | +-----------+

2.5 ORC文件

Optimized Row Columnar(ORC)是Apache開源項目Hive支持的一種優化的列存儲文件格式。與CSV文件相比,不只能夠節省存儲空間,還能夠獲得更好的查詢性能。

對於ORC文件,只須要在建表時指定 STORED AS ORC。
例如,

CREATE EXTERNAL TABLE orders_orc_date ( O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING ) STORED AS ORC LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';

2.6 PARQUET文件

Parquet是Apache開源項目Hadoop支持的一種列存儲的文件格式。
使用DLA建表時,須要指定STORED AS PARQUET便可。
例如,

CREATE EXTERNAL TABLE orders_parquet_date ( O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING ) STORED AS PARQUET LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';

2.7 RCFILE文件

Record Columnar File (RCFile), 列存儲文件,能夠有效地將關係型表結構存儲在分佈式系統中,而且能夠被高效地讀取和處理。
DLA在建表時,須要指定STORED AS RCFILE。
例如,

CREATE EXTERNAL TABLE lineitem_rcfile_date ( L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING ) STORED AS RCFILE LOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'

2.8 AVRO文件

DLA針對AVRO文件建表時,須要指定STORED AS AVRO,而且定義的字段須要符合AVRO文件的schema。

若是不肯定能夠經過使用Avro提供的工具,得到schema,並根據schema建表。
在Apache Avro官網下載avro-tools-.jar到本地,執行下面的命令得到Avro文件的schema:

java -jar avro-tools-1.8.2.jar getschema /path/to/your/doctors.avro
{
  "type" : "record", "name" : "doctors", "namespace" : "testing.hive.avro.serde", "fields" : [ { "name" : "number", "type" : "int", "doc" : "Order of playing the role" }, { "name" : "first_name", "type" : "string", "doc" : "first name of actor playing role" }, { "name" : "last_name", "type" : "string", "doc" : "last name of actor playing role" } ] }

建表語句以下,其中fields中的name對應表中的列名,type須要參考本文檔中的表格轉成DLA支持的類型

CREATE EXTERNAL TABLE doctors( number int, first_name string, last_name string) STORED AS AVRO LOCATION 'oss://mybucket-for-testing/directory/to/doctors';

大多數狀況下,Avro的類型能夠直接轉換成DLA中對應的類型。若是該類型在DLA不支持,則會轉換成接近的類型。具體請參照下表:

Avro類型 對應DLA類型
null void
boolean boolean
int int
long bigint
float float
double double
bytes binary
string string
record struct
map map
list array
union union
enum string
fixed binary

2.9 能夠用正則表達式匹配的文件

一般此類型的文件是以純文本格式存儲在OSS上的,每一行表明表中的一條記錄,而且每行能夠用正則表達式匹配。
例如,Apache WebServer日誌文件就是這種類型的文件。

某日誌文件的內容爲:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"

每行文件能夠用下面的正則表達式表示,列之間使用空格分隔:

([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?

針對上面的文件格式,建表語句能夠表示爲:

CREATE EXTERNAL TABLE serde_regex( host STRING, identity STRING, userName STRING, time STRING, request STRING, status STRING, size INT, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?" ) STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/datasets/serde/regex';

查詢結果

mysql> select * from serde_regex; +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+ | host | identity | userName | time | request | status | size | referer | agent | +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+ | 127.0.0.1 | - | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0" | 200 | 2326 | NULL | NULL | | 127.0.0.1 | - | - | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200 | 5864 | - | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" | +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+

2.10 Esri ArcGIS的地理JSON數據文件

DLA支持Esri ArcGIS的地理JSON數據文件的SerDe處理,關於這種地理JSON數據格式說明,能夠參考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats

示例:

CREATE EXTERNAL TABLE IF NOT EXISTS california_counties (    Name string,    BoundaryShape binary ) ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde' STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'oss://test_bucket/datasets/geospatial/california-counties/'

3. 總結

經過以上例子能夠看出,DLA能夠支持大部分開源存儲格式的文件。對於同一份數據,使用不一樣的存儲格式,在OSS中存儲文件的大小,DLA的查詢分析速度上會有較大的差異。推薦使用ORC格式進行文件的存儲和查詢。

爲了得到更快的查詢速度,DLA還在不斷的優化中,後續也會支持更多的數據源,爲用戶帶來更好的大數據分析體驗。

 

#阿里雲開年Hi購季#幸運抽好禮!
點此抽獎:https://www.aliyun.com/acts/product-section-2019/yq-lottery?utm_content=g_1000042901

原文連接本文爲雲棲社區原創內容,未經容許不得轉載。

相關文章
相關標籤/搜索