關於Data Lake的概念,更多閱讀能夠參考:
https://en.wikipedia.org/wiki/Data_lakehtml
以及AWS和Azure關於Data Lake的解讀:
https://amazonaws-china.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
https://azure.microsoft.com/en-us/solutions/data-lake/數據庫
終於,阿里雲如今也有了本身的數據湖分析產品:https://www.aliyun.com/product/datalakeanalyticsjson
能夠點擊申請使用(目前公測階段還屬於邀測模式,咱們會盡快審批申請),體驗本教程的TPC-H CSV數據格式的數據分析之旅。異步
產品文檔:https://help.aliyun.com/product/70174.html性能
若是您已經開通,能夠跳過該步驟。若是沒有開通,能夠參考:https://help.aliyun.com/document_detail/70386.html
進行產品開通服務申請。測試
能夠從這下載TPC-H 100MB的數據集:
https://public-datasets-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/tpch_100m_data.zip阿里雲
登陸阿里雲官網的OSS控制檯:https://oss.console.aliyun.com/overview
規劃您要使用的OSS bucket,建立或選擇好後,點擊「文件管理」,由於有8個數據文件,爲每一個數據文件建立對應的文件目錄:spa
建立好8個目錄以下:3d
點擊進入目錄,上傳相應的數據文件,例如,customer目錄,則上傳customer.tbl文件。code
上傳好後,以下圖。而後,依次把其餘7個數據文件也上傳到對應的目錄下。
至此,8個數據文件都上傳到了您的OSS bucket中:
oss://xxx/tpch_100m/customer/customer.tbl oss://xxx/tpch_100m/lineitem/lineitem.tbl oss://xxx/tpch_100m/nation/nation.tbl oss://xxx/tpch_100m/orders/orders.tbl oss://xxx/tpch_100m/part/part.tbl oss://xxx/tpch_100m/partsupp/partsupp.tbl oss://xxx/tpch_100m/region/region.tbl oss://xxx/tpch_100m/supplier/supplier.tbl
https://openanalytics.console.aliyun.com/
點擊「登陸數據庫」,輸入開通服務時分配的用戶名和密碼,登陸Data Lake Analytics控制檯。
輸入建立SCHEMA的語句,點擊「同步執行」。
CREATE SCHEMA tpch_100m with DBPROPERTIES( LOCATION = 'oss://test-bucket-julian-1/tpch_100m/', catalog='oss' );
(注意:目前在同一個阿里雲region,Data Lake Analytics的schema名全局惟一,建議schema名儘可能根據業務定義,已有重名schema,在建立時會提示報錯,則請換一個schema名字。)
Schema建立好後,在「數據庫」的下拉框中,選擇剛剛建立的schema。而後在SQL文本框中輸入建表語句,點擊同步執行。
建表語句語法參考:https://help.aliyun.com/document_detail/72006.html
TPC-H對應的8個表的建表語句以下,分別貼入文檔框中執行(LOCATION子句中的數據文件位置請根據您的實際OSS bucket目錄相應修改)。(注意:目前控制檯中還不支持多個SQL語句執行,請單條語句執行。)
CREATE EXTERNAL TABLE nation ( N_NATIONKEY INT, N_NAME STRING, N_ID STRING, N_REGIONKEY INT, N_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation'; CREATE EXTERNAL TABLE lineitem ( L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/lineitem'; CREATE EXTERNAL TABLE orders ( O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/orders'; CREATE EXTERNAL TABLE supplier ( S_SUPPKEY INT, S_NAME STRING, S_ADDRESS STRING, S_NATIONKEY INT, S_PHONE STRING, S_ACCTBAL DOUBLE, S_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/supplier'; CREATE EXTERNAL TABLE partsupp ( PS_PARTKEY INT, PS_SUPPKEY INT, PS_AVAILQTY INT, PS_SUPPLYCOST DOUBLE, PS_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/partsupp'; CREATE EXTERNAL TABLE customer ( C_CUSTKEY INT, C_NAME STRING, C_ADDRESS STRING, C_NATIONKEY INT, C_PHONE STRING, C_ACCTBAL DOUBLE, C_MKTSEGMENT STRING, C_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/customer'; CREATE EXTERNAL TABLE part ( P_PARTKEY INT, P_NAME STRING, P_MFGR STRING, P_BRAND STRING, P_TYPE STRING, P_SIZE INT, P_CONTAINER STRING, P_RETAILPRICE DOUBLE, P_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/part'; CREATE EXTERNAL TABLE region ( R_REGIONKEY INT, R_NAME STRING, R_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/region';
建表完畢後,刷新頁面,在左邊導航條中能看到schema下的8張表。
TPC-H總共22條查詢,以下:
Q1:
SELECT l_returnflag, l_linestatus, Sum(l_quantity) AS sum_qty, Sum(l_extendedprice) AS sum_base_price, Sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, Avg(l_quantity) AS avg_qty, Avg(l_extendedprice) AS avg_price, Avg(l_discount) AS avg_disc, Count(*) AS count_order FROM lineitem WHERE l_shipdate <= date '1998-12-01' - INTERVAL '93' day GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus LIMIT 1;
Q2:
SELECT s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment FROM part, supplier, partsupp, nation, region WHERE p_partkey = ps_partkey AND s_suppkey = ps_suppkey AND p_size = 35 AND p_type LIKE '%NICKEL' AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'MIDDLE EAST'
Q3:
SELECT l_orderkey, Sum(l_extendedprice * (1 - l_discount)) AS revenue, o_orderdate, o_shippriority FROM customer, orders, lineitem WHERE c_mktsegment = 'AUTOMOBILE' AND c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate < date '1995-03-31' AND l_shipdate > date '1995-03-31' GROUP BY l_orderkey, o_orderdate, o_shippriority ORDER BY revenue DESC, o_orderdate LIMIT 10;
Q4:
SELECT o_orderpriority, Count(*) AS order_count FROM orders, lineitem WHERE o_orderdate >= date '1997-10-01' AND o_orderdate < date '1997-10-01' + INTERVAL '3' month AND l_orderkey = o_orderkey AND l_commitdate < l_receiptdate GROUP BY o_orderpriority ORDER BY o_orderpriority LIMIT 1;
Q5:
SELECT n_name, Sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1995-01-01' AND o_orderdate < date '1995-01-01' + INTERVAL '1' year GROUP BY n_name ORDER BY revenue DESC LIMIT 1;
Q6:
SELECT sum(l_extendedprice * l_discount) AS revenue FROM lineitem WHERE l_shipdate >= date '1995-01-01' AND l_shipdate < date '1995-01-01' + interval '1' year AND l_discount between 0.04 - 0.01 AND 0.04 + 0.01 AND l_quantity < 24 LIMIT 1;
Q7:
SELECT supp_nation, cust_nation, l_year, Sum(volume) AS revenue FROM ( SELECT n1.n_name AS supp_nation, n2.n_name AS cust_nation, Extract(year FROM l_shipdate) AS l_year, l_extendedprice * (1 - l_discount) AS volume FROM supplier, lineitem, orders, customer, nation n1, nation n2 WHERE s_suppkey = l_suppkey AND o_orderkey = l_orderkey AND c_custkey = o_custkey AND s_nationkey = n1.n_nationkey AND c_nationkey = n2.n_nationkey AND ( ( n1.n_name = 'GERMANY' AND n2.n_name = 'INDIA') OR ( n1.n_name = 'INDIA' AND n2.n_name = 'GERMANY') ) AND l_shipdate BETWEEN date '1995-01-01' AND date '1996-12-31' ) AS shipping GROUP BY supp_nation, cust_nation, l_year ORDER BY supp_nation, cust_nation, l_year LIMIT 1;
Q8:
SELECT o_year, Sum( CASE WHEN nation = 'INDIA' THEN volume ELSE 0 end) / Sum(volume) AS mkt_share FROM ( SELECT Extract(year FROM o_orderdate) AS o_year, l_extendedprice * (1 - l_discount) AS volume, n2.n_name AS nation FROM part, supplier, lineitem, orders, customer, nation n1, nation n2, region WHERE p_partkey = l_partkey AND s_suppkey = l_suppkey AND l_orderkey = o_orderkey AND o_custkey = c_custkey AND c_nationkey = n1.n_nationkey AND n1.n_regionkey = r_regionkey AND r_name = 'ASIA' AND s_nationkey = n2.n_nationkey AND o_orderdate BETWEEN date '1995-01-01' AND date '1996-12-31' AND p_type = 'STANDARD ANODIZED STEEL' ) AS all_nations GROUP BY o_year ORDER BY o_year LIMIT 1;
Q9:
SELECT nation, o_year, Sum(amount) AS sum_profit FROM ( SELECT n_name AS nation, Extract(year FROM o_orderdate) AS o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount FROM part, supplier, lineitem, partsupp, orders, nation WHERE s_suppkey = l_suppkey AND ps_suppkey = l_suppkey AND ps_partkey = l_partkey AND p_partkey = l_partkey AND o_orderkey = l_orderkey AND s_nationkey = n_nationkey AND p_name LIKE '%aquamarine%' ) AS profit GROUP BY nation, o_year ORDER BY nation, o_year DESC LIMIT 1;
Q10:
SELECT c_custkey, c_name, Sum(l_extendedprice * (1 - l_discount)) AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate >= date '1994-08-01' AND o_orderdate < date '1994-08-01' + INTERVAL '3' month AND l_returnflag = 'R' AND c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue DESC LIMIT 20;
Q11:
SELECT ps_partkey, Sum(ps_supplycost * ps_availqty) AS value FROM partsupp, supplier, nation WHERE ps_suppkey = s_suppkey AND s_nationkey = n_nationkey AND n_name = 'PERU' GROUP BY ps_partkey HAVING Sum(ps_supplycost * ps_availqty) > ( SELECT Sum(ps_supplycost * ps_availqty) * 0.0001000000 as sum_value FROM partsupp, supplier, nation WHERE ps_suppkey = s_suppkey AND s_nationkey = n_nationkey AND n_name = 'PERU' ) ORDER BY value DESC LIMIT 1;
Q12:
SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1 else 0 end) AS high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1 else 0 end) AS low_line_count FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_shipmode in ('MAIL', 'TRUCK') AND l_commitdate < l_receiptdate AND l_shipdate < l_commitdate AND l_receiptdate >= date '1996-01-01' AND l_receiptdate < date '1996-01-01' + interval '1' year GROUP BY l_shipmode ORDER BY l_shipmode LIMIT 1;
Q13:
SELECT c_count, count(*) AS custdist FROM ( SELECT c_custkey, count(o_orderkey) AS c_count FROM customer, orders WHERE c_custkey = o_custkey AND o_comment NOT LIKE '%pending%accounts%' GROUP BY c_custkey ) AS c_orders GROUP BY c_count ORDER BY custdist DESC, c_count DESC LIMIT 1;
Q14:
SELECT 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0 end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenue FROM lineitem, part WHERE l_partkey = p_partkey AND l_shipdate >= date '1996-01-01' AND l_shipdate < date '1996-01-01' + interval '1' month LIMIT 1;
Q15:
WITH revenue0 AS ( SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenue FROM lineitem WHERE l_shipdate >= date '1993-01-01' AND l_shipdate < date '1993-01-01' + interval '3' month GROUP BY l_suppkey ) SELECT s_suppkey, s_name, s_address, s_phone, total_revenue FROM supplier, revenue0 WHERE s_suppkey = supplier_no AND total_revenue IN ( SELECT max(total_revenue) FROM revenue0 ) ORDER BY s_suppkey;
Q16:
SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cnt FROM partsupp, part WHERE p_partkey = ps_partkey AND p_brand <> 'Brand#23' AND p_type NOT LIKE 'PROMO BURNISHED%' AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11) AND ps_suppkey NOT IN ( SELECT s_suppkey FROM supplier WHERE s_comment LIKE '%Customer%Complaints%' ) GROUP BY p_brand, p_type, p_size ORDER BY supplier_cnt DESC, p_brand, p_type, p_size LIMIT 1;
Q17:
SELECT sum(l_extendedprice) / 7.0 AS avg_yearly FROM lineitem, part WHERE p_partkey = l_partkey AND p_brand = 'Brand#44' AND p_container = 'WRAP PKG' AND l_quantity < ( SELECT 0.2 * avg(l_quantity) FROM lineitem, part WHERE l_partkey = p_partkey );
Q18:
SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) FROM customer, orders, lineitem WHERE o_orderkey IN ( SELECT l_orderkey FROM lineitem GROUP BY l_orderkey HAVING sum(l_quantity) > 315 ) AND c_custkey = o_custkey AND o_orderkey = l_orderkey GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice DESC, o_orderdate LIMIT 100;
Q19:
SELECT sum(l_extendedprice* (1 - l_discount)) AS revenue FROM lineitem, part WHERE ( p_partkey = l_partkey and p_brand = 'Brand#12' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 6 and l_quantity <= 6 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#13' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 10 and l_quantity <= 10 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 21 and l_quantity <= 21 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) LIMIT 1;
Q20:
with temp_table as ( select 0.5 * sum(l_quantity) as col1 from lineitem, partsupp where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date '1993-01-01' and l_shipdate < date '1993-01-01' + interval '1' year ) select s_name, s_address from supplier, nation where s_suppkey in ( select ps_suppkey from partsupp, temp_table where ps_partkey in ( select p_partkey from part where p_name like 'dark%' ) and ps_availqty > temp_table.col1 ) and s_nationkey = n_nationkey and n_name = 'JORDAN' order by s_name limit 1;
Q21:
select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'SAUDI ARABIA' group by s_name order by numwait desc, s_name limit 100;
Q22:
with temp_table_1 as ( select avg(c_acctbal) as avg_value from customer where c_acctbal > 0.00 and substring(c_phone from 1 for 2) in ('33', '29', '37', '35', '25', '27', '43') ), temp_table_2 as ( select count(*) as count1 from orders, customer where o_custkey = c_custkey ) select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from ( select substring(c_phone from 1 for 2) as cntrycode, c_acctbal from customer, temp_table_1, temp_table_2 where substring(c_phone from 1 for 2) in ('33', '29', '37', '35', '25', '27', '43') and c_acctbal > temp_table_1.avg_value and temp_table_2.count1 = 0) as custsale group by cntrycode order by cntrycode limit 1;
Data Lake Analytics支持「同步執行」模式和「異步執行」模式。「同步執行」模式下,控制檯界面等待執行結果返回;「異步執行」模式下,馬上返回查詢任務的ID。
點擊「執行狀態」,能夠看到該異步查詢任務的執行狀態,主要分爲:「RUNNING」,「SUCCESS」,「FAILURE」。
點擊「刷新」,當STATUS變爲「SUCCESS」時,表示查詢成功,同時可查看查詢耗時「ELAPSE_TIME」和查詢掃描的數據字節數「SCANNED_DATA_BYTES」。
點擊「執行歷史」,能夠看到您執行的查詢的歷史詳細信息,包括:
1)查詢語句;
2)查詢耗時與執行具體時間;
3)查詢結果返回行數;
4)查詢狀態;
5)查詢掃描的字節數;
6)結果集回寫到的目標OSS文件(Data Lake Analytics會將查詢結果集保存用戶的bucket中)。
查詢結果文件自動上傳到用戶同region的OSS bucket中,其中包括結果數據文件和結果集元數據描述文件。
{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv {QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata
其中QueryLocation爲:
aliyun-oa-query-results-<your_account_id>-<oss_region>
至此,本教程一步一步教您如何利用Data Lake Analytics雲產品分析您OSS上的CSV格式的數據文件。除了CSV文件外,Data Lake Analytics還支持Parquet、ORC、json、RCFile、AVRO等多種格式文件的數據分析能力。特別是Parquet、ORC,相比CSV文件,有極大的性能和成本優點(一樣內容的數據集,擁有更小的存儲空間、更快的查詢性能,這也意味着更低的分析成本)。
後續陸續會有更多教程和文章,手把手教您輕鬆使用Data Lake Analytics進行數據湖上數據分析和探索,開啓您的雲上低成本、即存即用的數據分析和探索之旅。
本文爲雲棲社區原創內容,未經容許不得轉載。