1、數據倉庫java
數據倉庫是一個面向主題的、集成的、相對穩定的、反應歷史變化的數據集合,用於支持管理決策。python
l 面向主題:傳統的數據庫是面向事務處理的,而數據倉庫是面向某一領域而組織的數據集合,主題是指用戶關心的某一聯繫緊密的集合。mysql
l 集成:數據倉庫中數據來源於各個離散的業務系統數據庫、外部數據、非結構化數據的集合,數據倉庫數據是集成的。redis
l 相對穩定:數據倉庫中的數據不該該支持dml操做,而是經過批處理方式進行數據的處理。sql
l 反應歷史:數據倉庫保存了數據的歷史各個版本。mongodb
咱們今天所介紹的就是數據倉庫保留數據歷史版本的一種方法-拉鍊表。docker
這裏我簡單介紹一下咱們數據倉庫中掃採用的架構,主要包括貼源層、明細層、彙總層、集市層、報表層、維度層,簡單的介紹以下:數據庫
l 貼源層:採集的各個業務系統數據首先存儲在貼源層中,這裏須要注意的是採集業務源數據的方法,增量採集仍是全量採集,好的業務系統設計應該支持增量採集(這裏留一個問題做爲思考:增量採集數據應該知足哪些要求),這樣的好處減小了採集數據對倉庫資源和業務系統資源的消耗。架構
l 明細層:該層採用規範化方式存儲數據,處理數據主要來自於貼源層,實現的目的主要包括面向主題設計存儲結構、集成不一樣業務源數據、統一編碼規範、保留歷史數據(拉鍊表主要在這一層中進行設計實現)等倉庫基本要處理的oracle
l 彙總層:對於明細層整合的數據,針對須要彙總的指標按照業務口徑進行計算而且初步反規範化設計實現鏈接明細層的規範化數據成小寬表,目的方便下一步處理使用。
l 集市層:面向不一樣需求方,按照維度建模方法,進行星型模型設計, 這一層設計完成後的目的要達到能夠方便出具報表和平常提數任務。這裏有些倉庫設計人員還會用另外一個思路,即集市層不採用星型模型設計方法,而是設計大寬表,採用這種方式的設計人員主要理由是這種方式方便人們使用。
l 報表層:根據各個部門不一樣需求出具報表。
l 維度層:統一存儲數倉維表相關數據。
目前數據倉庫設計主要有兩個陣營,kimball和inmon架構,這裏不會針對與這兩種放進進行詳細說明。我的所接觸項目經驗,若是極端採用某一種架構,最後數倉項目成功機率都很低,所以我的建議結合兩種架構的優勢進行數倉設計(即三範式簡歷數倉明細層,集市層採用星型模型設計方法),合理結合兩種思路優勢能夠有效的避免業務驅動方式帶來的煩雜工做以及需求驅動所帶來的後期維護及擴展性問題。
這裏以一個虛擬的示例簡單介紹拉鍊表實現原理:
一、好比在2017-01-01日,咱們初始化了用戶數據到數據倉庫,咱們爲初始化到數據倉庫中的用戶表(customer)添加了一個start_date和end_date字段用來標識該條數據的生命週期,具體以下:
cus_id job start_date end_date
----------------------------------------------------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 3000-12-21
10004 java 2018-01-01 3000-12-21
10005 python 2018-01-01 3000-12-21
二、在2017-01-02這一天,10004用戶被刪除,同時增長了10006及10007用戶,10003用戶的job由mysql變成了mongodb,明細數據以下:
cus_id job start_date end_date
--------------------------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 2018-01-02
10003 mongodb 2018-01-02 3000-12-21
10004 java 2018-01-01 2018-01-02
10005 python 2018-01-01 3000-12-21
10006 docker 2018-01-02 3000-12-21
10007 redis 2018-01-02 3000-12-21
三、在2017-01-03這一天,10007用戶被刪除,同時10006工做由docker變成了openstack,10003用戶工做由mongodb變成了hive,而且增長了10008用戶數據,明細數據以下:
cus_id job start_date end_date
---------------- ----------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 2018-01-02
10003 mongodb 2018-01-02 2018-01-03
10003 hive 2018-01-03 3000-12-21
10004 java 2018-01-01 2018-01-02
10005 python 2018-01-01 3000-12-21
10006 docker 2018-01-02 2018-01-03
10006 openstack 2018-01-03 3000-12-21
10007 redis 2018-01-02 2018-01-03
10008 hadoop 2018-01-03 3000-12-21
拉鍊表原理分析:這裏以10003用戶爲例,經過記錄10003用戶數據變化時間線咱們能夠發現以下的規律:
2017-01-01 首次註冊,job爲mysql;
2017-01-02 工做變動,job變爲mongodb;
2017-01-03 工做變動,job變爲hive。
在上圖中,10003用戶工做變動的時間線上,咱們能夠發現每個時間點,10003用戶只有一個工做。在20170101~20170102期間內10003的job爲mysql,在20170102~20170103期間內10003的job爲mongodb,在20170103~30001231期間內10003的job爲hive。拉鍊表中每個記錄都知足上邊規律,下面讓咱們想一想怎麼樣準確的訪問拉鍊表數據呢?
拉鍊表訪問方法:
一、 訪問拉鍊表最新數據:
select * from customer t where t.end_date = '3000-12-31';
二、 訪問2017-01-01這天的歷史快照數據:
select * from customer t where t.start_date <= '2017-01-01' and t.end_date > '2017-01-01';
三、訪問2017-01-02這天的歷史快照數據:
select * from customer t where t.start_date <= '2017-01-02' and t.end_date > '2017-01-02';
四、訪問10003用戶全部歷史數據:
select * from customer t where t.cus_id = '10003';
一、準備數據:
1)2017-01-01初始化數據:
cus_id |
job |
start_date |
end_date |
dtype |
dw_status |
dw_ins_date |
10001 |
oracle |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10002 |
pgsql |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10003 |
mysql |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10004 |
java |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
10005 |
python |
2017-01-01 |
3000-12-31 |
C |
I |
2017-01-01 |
2)2017-01-02增量數據:
cus_id |
job |
dw_status |
dw_ins_date |
10003 |
mongodb |
U |
2017-01-02 |
10004 |
java |
D |
2017-01-02 |
10006 |
docker |
I |
2017-01-02 |
10007 |
redis |
I |
2017-01-02 |
3)2017-01-03增量數據:
cus_id |
job |
dw_status |
dw_ins_date |
10003 |
hive |
U |
2017-01-03 |
10007 |
redis |
D |
2017-01-03 |
10006 |
openstack |
U |
2017-01-03 |
10008 |
hadoop |
I |
2017-01-03 |
二、數據加載過程:
1) 初始化customer表:
drop table customer;
create table customer(
cus_id int,
job varchar2(20),
start_date varchar2(10),
end_date varchar2(10),
dtype varchar2(1),
dw_status varchar2(1),
dw_ins_date varchar2(10)
)
partition by list(end_date)
(
partition cus_par20170101 values('2017-01-01') tablespace users,
partition cus_par20170102 values('2017-01-02') tablespace users,
partition cus_par20170103 values('2017-01-03') tablespace users,
partition cus_par30001231 values('3000-12-31') tablespace users
);
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10001,'oracle','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10002,'pgsql','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10003,'mysql','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10004,'java','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10005,'python','2017-01-01','3000-12-31','C','I','2017-01-01');
2) 初始化2017-01-02號增量表:
create table customer_inc(
cus_id int,
job varchar2(20),
dw_status varchar2(1),
dw_ins_date varchar2(10)
);
truncate table customer_inc;
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10003,'mongodb','U','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10004,'java','D','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10006,'docker','I','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10007,'redis','I','2017-01-02');
3) 建立中間表:
drop table customer_tmp0;
create table customer_tmp0(
cus_id int,
job varchar2(20),
start_date varchar2(10),
end_date varchar2(10),
dtype varchar2(1),
dw_status varchar2(1),
dw_ins_date varchar2(10)
)
partition by list(dtype)
(
partition cus_dtype_H values('H') tablespace users,
partition cus_dtype_C values('C') tablespace users
);
三、刷新customer_inc表數據到customer表(2017-01-02):
1) customer表最新分區和customer_inc表中更新和刪除數據鏈接,處理customer最新分區中變化數據:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
t1.start_date,
case when t2.cus_id is null then t1.end_date else '2017-01-02' end as end_date,
case when t2.cus_id is null then 'C' else 'H' end dtype,
case when t2.cus_id is null then t1.dw_status else t2.dw_status end dw_status,
case when t2.cus_id is null then t1.dw_ins_date else t2.dw_ins_date end as dw_ins_date
from customer t1 left join customer_inc t2 on t1.cus_id = t2.cus_id and t2.dw_status in ('D','U')
where t1.end_date = '3000-12-31'
order by cus_id asc
;
2)將customer表中更新和插入數據插入到customer_tmp0臨時表中:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
'2017-01-02' as start_date,
'3000-12-31' as end_date,
'C' as dtype,
t1.dw_status,
'2017-01-03' as dw_ins_date
from customer_inc t1
where t1.dw_status in ('I','U')
;
3)同步表到customer事實表,這一步可使用交換分區操做:
alter table customer truncate partition cus_par30001231;
insert into customer
select * from customer_tmp0;
4)查看結果:
SQL> select * from customer order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10003 mongodb 2017-01-02 3000-12-31 C U 2017-01-03
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 3000-12-31 C I 2017-01-03
10007 redis 2017-01-02 3000-12-31 C I 2017-01-03
8 rows selected
SQL>
四、刷新customer_inc表數據到customer表(2017-01-03)
1)初始化2017-01-02號增量表:
truncate table customer_inc;
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10003,'hive','U','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10008,'hadoop','I','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10006,'openstack','U','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10007,'redis','D','2017-01-03');
2) customer表最新分區和customer_inc表中更新和刪除數據鏈接,處理customer最新分區中變化數據:
truncate table customer_tmp0;
insert into customer_tmp0
select
t1.cus_id,
t1.job,
t1.start_date,
case when t2.cus_id is null then t1.end_date else '2017-01-03' end as end_date,
case when t2.cus_id is null then 'C' else 'H' end dtype,
case when t2.cus_id is null then t1.dw_status else t2.dw_status end dw_status,
case when t2.cus_id is null then t1.dw_ins_date else t2.dw_ins_date end as dw_ins_date
from customer t1 left join customer_inc t2 on t1.cus_id = t2.cus_id and t2.dw_status in ('D','U')
where t1.end_date = '3000-12-31'
order by cus_id asc
;
3)將customer表中更新和插入數據插入到customer_tmp0臨時表中:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
'2017-01-03' as start_date,
'3000-12-31' as end_date,
'C' as dtype,
t1.dw_status,
'2017-01-04' as dw_ins_date
from customer_inc t1
where t1.dw_status in ('I','U')
;
4) 表到customer事實表,這一步可使用交換分區操做:
alter table customer truncate partition cus_par30001231;
insert into customer
select * from customer_tmp0;
5) 查看結果
SQL> select * from customer order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
----------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 2017-01-03 H U 2017-01-03
10006 openstack 2017-01-03 3000-12-31 C U 2017-01-04
10007 redis 2017-01-02 2017-01-03 H D 2017-01-03
10008 hadoop 2017-01-03 3000-12-31 C I 2017-01-04
11 rows selected
SQL>
五、查詢拉鍊表:
1) 查詢拉鍊表最新數據:
SQL> select * from customer where end_date = '3000-12-31' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
--------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 openstack 2017-01-03 3000-12-31 C U 2017-01-04
10008 hadoop 2017-01-03 3000-12-31 C I 2017-01-04
6 rows selected
SQL>
2) 查詢2017-01-01歷史快照數據:
SQL> select * from customer where start_date <= '2017-01-01' and end_date > '2017-01-01' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
--------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
SQL>
3)查詢2017-01-02歷史快照數據:
SQL> select * from customer where start_date <= '2017-01-02' and end_date > '2017-01-02' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 2017-01-03 H U 2017-01-03
10007 redis 2017-01-02 2017-01-03 H D 2017-01-03
6 rows selected
SQL>
4)查看10003用戶的全部數據:
SQL> select * from customer where cus_id = '10003';
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
SQL>