數據湖(Data Lake)是時下大數據行業熱門的概念:https://en.wikipedia.org/wiki/Data_lake。基於數據湖作分析,能夠不用作任何ETL、數據搬遷等前置過程,實現跨各類異構數據源進行大數據關聯分析,從而極大的節省成本和提高用戶體驗。關於Data Lake的概念。html
終於,阿里雲如今也有了本身的數據湖分析產品:https://www.aliyun.com/product/datalakeanalytics
能夠點擊申請使用(目前公測階段還屬於邀測模式),體驗本教程分析OTS數據之旅。
產品文檔:https://help.aliyun.com/product/70174.htmlmysql
ETL(https://en.wikipedia.org/wiki/Extract,_transform,_load)就是Extract、Transfrom、Load即抽取、轉換、加載,是傳統數倉和大數據的重要工具。sql
抽取:就是從源系統抽取須要的數據,這些源系統是同構或異構的:好比Excel表格、XML文件、關係型數據庫。
轉換:源系統的數據按照分析目的,轉換成目標系統要求的格式,或者作數據清洗和數據加工。
加載:把轉換後的數據裝載到目標數據庫,做爲聯機分析、數據挖掘、數據展現的基礎。數據庫
整個ETL過程就像是在源系統和目標系統之間構建一個管道,數據在這個管道里源源不斷的流動。微信
Data Placement Optimization(數據擺放優化)是目前雲平臺上的業務系統的主流架構方向和思路。架構師們會從讀寫性能、穩定性、強一致性、成本、易用性、開發效率等方面來考量不一樣存儲引擎給業務上帶來的好處,從而實現整個業務系統的完美的平衡狀態。架構
而這種跨異構數據源之間的數據搬遷,卻不是一件容易的事情。不少ELT工具基本上屬於框架級別,須要本身開發很多的輔助工具;同時表達能力也較弱,沒法知足不少場景;另外對異構數據源的抽象和兼容性也不是那麼完美。框架
反觀DLA,不管從哪方面來看,DLA都完美的契合ETL的需求場景。下圖是DLA的簡易架構圖,DLA一開始就是基於「MPP計算引擎+存儲計算分離+彈性高可用+異構數據集源」等架構原則來設計的,支持各類異構數據源讀寫是DLA的核心目標!less
經過鏈接異構數據源來執行select + join + subQuery等邏輯實現Extract,經過Filter+ Project + Aggregation + Sort + Functions等實現數據流轉換和映射Transform,而經過insert實現Load,下面是一個例子:異步
--基本格式 insert into target_table (col1, col2, col3, ....) --須要導入的列以及列的順序 select c1, c2, c3, .... --須要與導入列的類型兼容,順序要確認清楚 from ... --能夠是任何你想要查詢的數據目標 where ... --下面是一個例子 insert into target_table (id, name, age) select s1.pk1, s2.name, s1.age from source_table1 s1 join source_table2 s2 on s1.sid = s2.sid where s1.xxx = 'yyy'
下面咱們就嘗試往不一樣的數據源導入數據吧。工具
準備DLA帳號(已有測試帳號)
準備兩個來源表(兩個TPC-H的OSS表,customer和nation),用來作join和數據查詢;
準備一個TableStore(https://help.aliyun.com/document_detail/27280.html)的目標表;
執行導入SQL,寫入數據後校驗結果;
a)兩個來源表定義:
mysql> show create database tpch_50x_text; +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Database | Create Database | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | tpch_50x_text | CREATE DATABASE `tpch_50x_text` WITH DBPROPERTIES ( catalog = 'hive', location = 'oss://${您的bucket}/datasets/tpch/50x/text_date/' ) COMMENT '' | +---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.02 sec) mysql> show tables; +------------+ | Table_Name | +------------+ | customer | | nation | +------------+ 2 rows in set (0.03 sec) mysql> show create table customer; +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | customer | CREATE EXTERNAL TABLE `tpch_50x_text`.`customer` ( `c_custkey` int, `c_name` string, `c_address` string, `c_nationkey` int, `c_phone` string, `c_acctbal` double, `c_mktsegment` string, `c_comment` string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS `TEXTFILE` LOCATION 'oss://${您的bucket}/datasets/tpch/50x/text_date/customer_text' | +----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.90 sec) mysql> show create table nation; +------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | nation | CREATE EXTERNAL TABLE `tpch_50x_text`.`nation` ( `n_nationkey` int, `n_name` string, `n_regionkey` int, `n_comment` string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS `TEXTFILE` LOCATION 'oss://${您的bucket}/datasets/tpch/50x/text_date/nation_text' | +------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.73 sec)
b)準備TableStore的庫和表
## 建庫 mysql> show create database etl_ots_test; +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Database | Create Database | +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | etl_ots_test | CREATE DATABASE `etl_ots_test` WITH DBPROPERTIES ( catalog = 'ots', location = 'https://${您的instance}.cn-shanghai.ots-internal.aliyuncs.com', instance = '${您的instance}' ) COMMENT '' | +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.02 sec) ## 使用庫 mysql> use etl_ots_test; Database changed ## 建表 mysql> show create table test_insert; +-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | test_insert | CREATE EXTERNAL TABLE `test_insert` ( `id1_int` int NOT NULL COMMENT '客戶id主鍵', `c_address` varchar(20) NULL COMMENT '客戶的地址', `c_acctbal` double NULL COMMENT '客戶的account balance', PRIMARY KEY (`id1_int`) ) COMMENT '' | +-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.03 sec)
如下是實際數據的截圖:
c)開始導入數據,確保導入字段順序和類型兼容性:
## 檢查數據,都是空的 mysql> select * from etl_ots_test.test_insert; Empty set (0.31 sec)
mysql> use tpch_50x_text; Database changed ## 查詢下nation數據,其中CANADA的nationkey是3,後續要找這個數據 mysql> select n_nationkey, n_name from nation; +-------------+----------------+ | n_nationkey | n_name | +-------------+----------------+ | 0 | ALGERIA | | 1 | ARGENTINA | | 2 | BRAZIL | | 3 | CANADA | | 4 | EGYPT | | 5 | ETHIOPIA | | 6 | FRANCE | | 7 | GERMANY | | 8 | INDIA | | 9 | INDONESIA | | 10 | IRAN | | 11 | IRAQ | | 12 | JAPAN | | 13 | JORDAN | | 14 | KENYA | | 15 | MOROCCO | | 16 | MOZAMBIQUE | | 17 | PERU | | 18 | CHINA | | 19 | ROMANIA | | 20 | SAUDI ARABIA | | 21 | VIETNAM | | 22 | RUSSIA | | 23 | UNITED KINGDOM | | 24 | UNITED STATES | +-------------+----------------+ 25 rows in set (0.37 sec) ## 查詢下customer數據,咱們只關注nationkey=3以及c_mktsegment='BUILDING'的數據 mysql> select count(*) from customer where c_nationkey = 3 and c_mktsegment = 'BUILDING'; +----------+ | count(*) | +----------+ | 60350 | +----------+ 1 row in set (0.66 sec) ## 查詢下customer數據,咱們只關注nationkey=3以及c_mktsegment='BUILDING'的數據 mysql> select * from customer where c_nationkey = 3 and c_mktsegment = 'BUILDING' order by c_custkey limit 3; +-----------+--------------------+-------------------------+-------------+-----------------+-----------+--------------+----------------------------------------------------------------------------------------------------+ | c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment | +-----------+--------------------+-------------------------+-------------+-----------------+-----------+--------------+----------------------------------------------------------------------------------------------------+ | 13 | Customer#000000013 | nsXQu0oVjD7PM659uC3SRSp | 3 | 13-761-547-5974 | 3857.34 | BUILDING | ounts sleep carefully after the close frays. carefully bold notornis use ironic requests. blithely | | 27 | Customer#000000027 | IS8GIyxpBrLpMT0u7 | 3 | 13-137-193-2709 | 5679.84 | BUILDING | about the carefully ironic pinto beans. accoun | | 40 | Customer#000000040 | gOnGWAyhSV1ofv | 3 | 13-652-915-8939 | 1335.3 | BUILDING | rges impress after the slyly ironic courts. foxes are. blithely | +-----------+--------------------+-------------------------+-------------+-----------------+-----------+--------------+----------------------------------------------------------------------------------------------------+ 3 rows in set (0.78 sec)
導入以前咱們想清楚需求:把國家是'CANADA'的,客戶的market segmentation爲'BUILDING'的客戶找到,而後對c_custkey排序,選擇前10條數據,而後選擇他們的c_custkey、c_address、c_acctbal三列,清晰到OTS的test_insert表中,以備後續使用。
##先查詢下數據,看看有幾條數據 mysql> select c.c_custkey, c.c_address, c.c_acctbal -> from tpch_50x_text.customer c -> join tpch_50x_text.nation n -> on c.c_nationkey = n.n_nationkey -> where n.n_name = 'CANADA' -> and c.c_mktsegment = 'BUILDING' -> order by c.c_custkey -> limit 10; +-----------+--------------------------------+-----------+ | c_custkey | c_address | c_acctbal | +-----------+--------------------------------+-----------+ | 13 | nsXQu0oVjD7PM659uC3SRSp | 3857.34 | | 27 | IS8GIyxpBrLpMT0u7 | 5679.84 | | 40 | gOnGWAyhSV1ofv | 1335.3 | | 64 | MbCeGY20kaKK3oalJD,OT | -646.64 | | 255 | I8Wz9sJBZTnEFG08lhcbfTZq3S | 3196.07 | | 430 | s2yfPEGGOqHfgkVSs5Rs6 qh,SuVmR | 7905.17 | | 726 | 4w7DOLtN9Hy,xzZMR | 6253.81 | | 905 | f iyVEgCU2lZZPCebx5bGp5 | -600.73 | | 1312 | f5zgMB4MHLMSHaX0tDduHAmVd4 | 9459.5 | | 1358 | t23gsl4TdVXqTZha DioEHIq5w7y | 5149.23 | +-----------+--------------------------------+-----------+ 10 rows in set (1.09 sec) ##開始導入 mysql> insert into etl_ots_test.test_insert (id1_int ,c_address, c_acctbal) -> select c.c_custkey, c.c_address, c.c_acctbal -> from tpch_50x_text.customer c -> join tpch_50x_text.nation n -> on c.c_nationkey = n.n_nationkey -> where n.n_name = 'CANADA' -> and c.c_mktsegment = 'BUILDING' -> order by c.c_custkey -> limit 10; +------+ | rows | +------+ | 10 | +------+ 1 row in set (2.14 sec) ## 驗證結果,沒有問題: mysql> select * from etl_ots_test.test_insert; +---------+--------------------------------+-----------+ | id1_int | c_address | c_acctbal | +---------+--------------------------------+-----------+ | 13 | nsXQu0oVjD7PM659uC3SRSp | 3857.34 | | 27 | IS8GIyxpBrLpMT0u7 | 5679.84 | | 40 | gOnGWAyhSV1ofv | 1335.3 | | 64 | MbCeGY20kaKK3oalJD,OT | -646.64 | | 255 | I8Wz9sJBZTnEFG08lhcbfTZq3S | 3196.07 | | 430 | s2yfPEGGOqHfgkVSs5Rs6 qh,SuVmR | 7905.17 | | 726 | 4w7DOLtN9Hy,xzZMR | 6253.81 | | 905 | f iyVEgCU2lZZPCebx5bGp5 | -600.73 | | 1312 | f5zgMB4MHLMSHaX0tDduHAmVd4 | 9459.5 | | 1358 | t23gsl4TdVXqTZha DioEHIq5w7y | 5149.23 | +---------+--------------------------------+-----------+ 10 rows in set (0.27 sec)
d)注意點:
雖然有ETL工具快速導入導出,但也有些問題須要注意的,好比:
整個過程是否是很簡單?是否是想要導入其餘場景的數據源?對DLA而言,底層任何數據源都以相同方式處理,只要確保其餘數據源的庫、表在DLA中正常建立,就能夠正常的讀寫,實現ETL啦!趕忙試試吧!
其餘相關的文檔:
原文連接 更多技術乾貨 請關注阿里云云棲社區微信號 :yunqiinsight