Amazon Athena學習筆記

時間 2021-01-19

標籤 html 數據庫 json 數組 less 函數 atom spa scala 欄目硅谷简体版

原文原文鏈接

Amazon Athena概覽

快速瞭解Athena 是什麼？關鍵字：html

交互式查詢服務
ad-hoc查詢
支持標準SQL
指定S3中的數據造成表(相似hive)
快速響應(seconds級別)
serverless
支持JDBC鏈接和Java API鏈接

Amazon Athena is an interactive query service that lets you use standard SQL to analyze data directly in Amazon S3. You can point Athena at your data in Amazon S3 and run ad-hoc queries and get results in seconds. Athena is serverless, so there is no infrastructure to set up or manage. You pay only for the queries you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large datasets and complex queries.數據庫

If you connect to Athena using the JDBC driver, use version 1.1.0 of the driver or later with the Amazon Athena API. Earlier version drivers do not support the API. For more information and to download the driver, see Accessing Amazon Athena with JDBC.json

For code samples using the AWS SDK for Java, see Examples and Code Samples數組

Athena數據庫名，表名，字段名規範

數據庫名字，表名字，列名字必須是小寫less
特殊字符"_"支持，其餘的則不支持函數
若是名字以"_"開頭，則須要使用``來修飾ui

建立Athena表加載數據

1.數據在s3，建立athena表經過location參數指定加載s3上的數據atom

NOTE：這個好像必須建立外部表才行，後續驗證spa

CREATE EXTERNAL TABLE IF NOT EXISTS default.self_learning_old(rowkey STRING,windspd INT,directh INT,directv INT,func STRING,value INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://com.kong.bp.cn.test/test_folder/'

2.基於已有的表，建立分區表demoscala

CREATE table self_learning
WITH (format='PARQUET',
parquet_compression='SNAPPY',
partitioned_by=array['year'],
external_location = 's3://com.kong.bp.cn.test/test_folder/self_learning_old/')
AS
SELECT
       windspd,
       directh,
       directv,
       func,
       value,
　　　　 cast(substr(split(rowkey,':')[2],1,4) AS bigint) as year
FROM default.self_learning_old

Athena查詢json數據

關於Athena加載json數據參考文檔中的：Querying JSON

JSON樣例數據：

{
 "name": "Bob Smith",
 "org": "engineering",
 "projects": [{
  "name": "project1",
  "completed": false
 }, {
  "name": "project2",
  "completed": true
 }]
}

1.使用json_extract函數解析數據：

WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},
{"name":"project2", "completed":true}]}'
AS blob
)
SELECT
json_extract(blob, '$.name') AS name,
json_extract(blob, '$.projects') AS projects
FROM dataset

返回結果：

2.使用json_extract_scalar函數

json_extract_scalar相似json_extract函數，可是json_extract_scalar只返回scalar values (Boolean, number, or string)。

NOTE：此函數不適用於arrays, maps, or structs，這裏的"scalar"我理解爲對應的數據類型

好比使用json_extract_scalar解析出對應的數據：

WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT
json_extract_scalar(blob, '$.name') AS name,
json_extract_scalar(blob, '$.projects') AS projects
FROM dataset

查詢的結果：

+---------------------------+
| name       | projects   |
+---------------------------+
| Susan Smith |             |
+---------------------------+

由於json中的projects是一個數組類型，因此這裏使用json_extract_scalar沒法識別

3.使用json_array_get函數

對於這種數組類型，能夠使用json_array_get函數，好比：

WITH dataset AS (
SELECT '{"name": "Bob Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT json_array_get(json_extract(blob, '$.projects'), 0) AS item
FROM dataset

先使用json_extract函數得到projects項數據，獲得的是一個數組類型，再使用json_array_get函數按下標(index)來獲取。返回的結果：

+---------------------------------------+
| item                                 |
+---------------------------------------+
| {"name":"project1","completed":false} |
+---------------------------------------+