概要模式其實就是數據的全貌信息的獲取,主要分爲3種:sql
#HSQL SELECT MIN(num),MAX(num),COUNT(num) FROM table GROUP BY groupcol; #Pig b = GROUP a BY groupcol; c = FOREACH b GENERATE group, MIN(a.num), MAX(a.num), COUNT_STAR(a)
過濾模式是不改變原有記錄,而尋求子集的設計模式,主要應用於以下方面:數據庫
#HSQL SELECT * FROM table WHERE value<3; #Pig b = FILTER a BY value <3;
#HSQL SELECT * FROM table ORDER BY col DESC LIMIT 10; #Pig b = ORDER a BY col DESC; c = LIMIT b 10;
#HSQL SELECT DISTINCT * FROM table; #Pig b = DISTINCT a;
數據組織模式是將一組數據進行重組,重點在於將個別記錄的價值放大到全局,主要有以下幾個設計模式:設計模式
#HSQL ##在關係數據庫中,不多;在RDBMS中解決相似問題方法通常是先對數據進行鏈接,而後在結果上分析 #Pig ##pig對於分層數據結構有必定支持,包括層次化的包和元組。 a = LOAD '/data/a' AS PigStorage('|'); b = LOAD '/data/b' AS PigStorage(','); group_c = COGROUP a BY $2, b BY $1; annalyzed = FOREACH group_c GENERATE udfs.ananlyze(group ,$1 ,$2); ...
#HSQL SELECT * FROM table ORDER BY col DESC; #Pig b = ORDER a BY col DESC;
#HSQL SELECT * FROM table ORDER BY RAND(); #Pig b = GROUP a BY RANDOM(); #隨機 c = FOREACH b GENERATE FLATTEN(a);#分組打平
鏈接模式是對於多處數據進行組織的一種方法,主要有如下幾種:數據結構
#HSQL SELECT column_name FROM table_name1 LEFT JOIN table_name2 ON table_name1.column_name=table_name2.column_name #Pig C = JOIN A BY a1 LEFT OUTER,B BY b1; #左外,也能夠:{左右全}外 C = JOIN A BY a1,B BY b1; #內
#pig #只有內鏈接和左外才支持這種複製連接優化模式 #除了第一個數據集之外,要求全部的數據集都要在內存中 big = LOAD 'big_data' AS (b1,b2,b3); tiny = LOAD 'tiny_data' AS (t1,t2,t3); mini = LOAD 'mini_data' AS (m1,m2,m3); C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
#HSQL SELECT * FROM table a ,b; #Pig c = CROSS a , b ;