你可使用 Pig Latin 的 LOAD 運算符,從文件系統(HDFS / Local)將數據加載到Apache Pig中。html
load語句由兩部分組成,用「=」運算符分隔。在左側,須要提到咱們想要存儲數據的關係的名稱;而在右側,咱們須要定義如何存儲數據。下面給出了 Load 運算符的語法。node
Relation_name = LOAD 'Input file path' USING function as schema;
說明:web
relation_name - 咱們必須提到要存儲數據的關係。要與後面的=之間留一個空格,否則報錯shell
Input file path - 咱們必須提到存儲文件的HDFS目錄。(在MapReduce模式下)apache
function - 咱們必須從Apache Pig提供的一組加載函數中選擇一個函數( BinStorage,JsonLoader,PigStorage,TextLoader )。session
Schema - 咱們必須定義數據的模式,能夠定義所需的模式以下 -jvm
(column1 : data type, column2 : data type, column3 : data type);
須要加載的HDFS文件
[root@host ~]# hdfs dfs -cat hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29
600,null,2017-11-15 14:50:05.0,hunan changsha,0,91
650,null,2017-11-01 17:24:34.0,null,1,29ide
pig按以下方式執行pig latin語句:
1.pig 對全部語句的語法語意進行確認函數
2.若是遇到dump或者store命令,Pig將順序執行上面的全部語句。grunt
因此pig一些命令不會自動執行,須要經過其餘命令觸發,觸發後連續一次性執行完畢。
加載數據
grunt> customer =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> b =foreach customer generate roleid; //foreach 的做用是基於數據的列進行數據轉換
grunt> dump b
..................................
2018-06-15 14:54:22,671 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2018-06-15 14:54:22,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600)
(650)
(600)
(650)
(600)
(650)
(600)
(650)
grunt> b =foreach customer generate roleid,sex;
...................
2018-06-15 15:14:20,495 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
grunt> dump customer;
...........................
2018-06-15 14:59:53,355 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2018-06-15 14:59:53,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
存儲數據
grunt> store customer into ' hdfs://localhost:9000/pig' USING PigStorage(',');
......................................
2018-06-15 15:23:24,887 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2018-06-15 15:23:24,888 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.4 0.17.0 root 2018-06-15 15:23:18 2018-06-15 15:23:24 UNKNOWN
Success!
Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_local1251771877_0005 1 0 n/a n/a n/a n/a 0 0 0 0 customer MAP_ONLY hdfs://localhost:9000/pig,
Input(s): Successfully read 8 records (28781246 bytes) from: "hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003"
Output(s): Successfully stored 8 records (28779838 bytes) in: "hdfs://localhost:9000/pig"
Counters: Total records written : 8 Total bytes written : 28779838 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0
Job DAG: job_local1251771877_0005
2018-06-15 15:23:24,888 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,890 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,890 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2018-06-15 15:23:24,892 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
查看hdfs:
[root@host ~]# hdfs dfs -ls -R /pig
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
-rw-r--r-- 1 root supergroup 0 2018-06-15 15:23 /pig/_SUCCESS
-rw-r--r-- 1 root supergroup 432 2018-06-15 15:23 /pig/part-m-00000
[root@host ~]# hdfs dfs -ls -R /pig/20180615/
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
-rw-r--r-- 1 root supergroup 0 2018-06-15 15:27 /pig/20180615/_SUCCESS
-rw-r--r-- 1 root supergroup 432 2018-06-15 15:27 /pig/20180615/part-m-00000
[root@host ~]# hdfs dfs -cat /pig/20180615/part-m-00000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hive/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91
650,null,2017-11-01T17:24:34.000+08:00,null,1,29
Load 語句會簡單地將數據加載到Apache Pig中的指定關係中。要驗證Load語句的執行,必須使用Diagnostic運算符。Pig Latin提供四種不一樣類型的診斷運算符:
Dump 運算符用於運行Pig Latin語句,並在屏幕上顯示結果,它一般用於調試目的。上面已經作了演示。
describe 運算符用於查看relation(關係)的模式。
grunt> describe b
b: {roleid: int,sex: int}
grunt> describe customer
customer: {roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int}
explain 運算符用於顯示relation(關係)的邏輯,物理的和MapReduce的執行計劃。
grunt> explain b
2018-06-15 16:08:01,235 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2018-06-15 16:08:01,236 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2018-06-15 16:08:01,237 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for customer: $1, $2, $3, $5
#-----------------------------------------------
# New Logical Plan:
#-----------------------------------------------
b: (Name: LOStore Schema: roleid#453:int,sex#457:int)ColumnPrune:OutputUids=[453, 457]ColumnPrune:InputUids=[453, 457]
|
|---b: (Name: LOForEach Schema: roleid#453:int,sex#457:int)
| |
| (Name: LOGenerate[false,false] Schema: roleid#453:int,sex#457:int)
| | |
| | (Name: Cast Type: int Uid: 453)
| | |
| | |---roleid:(Name: Project Type: bytearray Uid: 453 Input: 0 Column: (*))
| | |
| | (Name: Cast Type: int Uid: 457)
| | |
| | |---sex:(Name: Project Type: bytearray Uid: 457 Input: 1 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: roleid#453:bytearray)
| |
| |---(Name: LOInnerLoad[1] Schema: sex#457:bytearray)
|
|---customer: (Name: LOLoad Schema: roleid#453:bytearray,sex#457:bytearray)ColumnPrune:OutputUids=[453, 457]ColumnPrune:InputUids=[453, 457]ColumnPrune:RequiredColumns=[0, 4]RequiredFields:[0, 4]
#-----------------------------------------------
# Physical Plan:
#-----------------------------------------------
b: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-98
|
|---b: New For Each(false,false)[bag] - scope-97
| |
| Cast[int] - scope-92
| |
| |---Project[bytearray][0] - scope-91
| |
| Cast[int] - scope-95
| |
| |---Project[bytearray][1] - scope-94
|
|---customer: Load(hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003:PigStorage(',')) - scope-90
2018-06-15 16:08:01,239 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2018-06-15 16:08:01,240 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2018-06-15 16:08:01,240 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-99
Map Plan
b: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-98
|
|---b: New For Each(false,false)[bag] - scope-97
| |
| Cast[int] - scope-92
| |
| |---Project[bytearray][0] - scope-91
| |
| Cast[int] - scope-95
| |
| |---Project[bytearray][1] - scope-94
|
|---customer: Load(hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003:PigStorage(',')) - scope-90--------
Global sort: false
----------------
illustrate 使用該操做對pig latin語句進行單步執行
grunt> illustrate b
....................................
2018-06-15 16:14:59,291 [main] INFO org.apache.pig.impl.util.SpillableMemoryManager - Selected heap (PS Old Gen) of size 699400192 to monitor. collectionUsageThreshold = 489580128, usageThreshold = 489580128
2018-06-15 16:14:59,292 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2018-06-15 16:14:59,292 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: customer[1,10],customer[-1,-1],b[4,3] C: R:
-----------------------------------------------------------------------------------------------------------------------------------------
| customer | roleid:int | name:chararray | dateid:datetime | addr:chararray | sex:int | level:int |
-----------------------------------------------------------------------------------------------------------------------------------------
| | 600 | null | 2017-11-15T14:50:05.000+08:00 | hunan changsha | 0 | 91 |
-----------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------
| b | roleid:int | sex:int |
----------------------------------------
| | 600 | 0 |
----------------------------------------
pig latin中,關係名,域名,函數名是區分大小寫的,參數名和關鍵字是不區分大小寫的。
GROUP 運算符用於在一個或多個關係中對數據進行分組,它收集具備相同key的數據。
grunt> dump b
....
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
(600,0)
(650,1)
grunt> group_id =group b by sex;
grunt> describe group_id
group_id: {group: int,b: {(roleid: int,sex: int)}}
grunt> dump group_id
...................................................
2018-06-15 16:27:20,311 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(0,{(600,0),(600,0),(600,0),(600,0)})
(1,{(650,1),(650,1),(650,1),(650,1)})
能夠按多列分組
grunt> group_idsex =group b by (sex,roleid);
grunt> describe group_idsex
group_idsex: {group: (sex: int,roleid: int),b: {(roleid: int,sex: int)}}
grunt> dump group_idsex
((0,600),{(600,0),(600,0),(600,0),(600,0)})
((1,650),{(650,1),(650,1),(650,1),(650,1)})
也能夠按全部列分組
grunt> group_all =group b ALL;
grunt> describe group_all;
group_all: {group: chararray,b: {(roleid: int,sex: int)}}
grunt> dump group_all;
...........
(all,{(650,1),(600,0),(650,1),(600,0),(650,1),(600,0),(650,1),(600,0)})
COGROUP 運算符的運做方式與 GROUP 運算符相同。兩個運算符之間的惟一區別是 group 運算符一般用於一個關係,而 cogroup 運算符用於涉及兩個或多個關係的語句。
grunt> distinctcustid =distinct b;
grunt> describe distinctcustid
distinctcustid: {roleid: int,sex: int}
grunt> dump distinctcustid
...........
(600,0)
(650,1)
grunt> cogroup1 =cogroup b by sex,distinctcustid by sex;
grunt> describe cogroup1;
cogroup1: {group: int,b: {(roleid: int,sex: int)},distinctcustid: {(roleid: int,sex: int)}}
grunt> dump cogroup1
(0,{(600,0),(600,0),(600,0),(600,0)},{(600,0)})
(1,{(650,1),(650,1),(650,1),(650,1)},{(650,1)})
grunt> cogroup2 =cogroup customer by sex,distinctcustid by sex;
grunt> describe cogroup2
cogroup2: {group: int,customer: {(roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int)},distinctcustid: {(roleid: int,sex: int)}}
grunt> dump cogroup2
............................
(0,{(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91),(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)},{(600,0)})
(1,{(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29),(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)},{(650,1)})
JOIN 運算符用於組合來自兩個或多個關係的記錄。在執行鏈接操做時,咱們從每一個關係中聲明一個(或一組)元組做爲key。 當這些key匹配時,兩個特定的元組匹配,不然記錄將被丟棄。鏈接能夠是如下類型:
Self-join 用於將表與其自身鏈接,就像表是兩個關係同樣,臨時重命名至少一個關係。一般,在Apache Pig中,爲了執行self-join,咱們將在不一樣的別名(名稱)下屢次加載相同的數據。
Inner Join使用較爲頻繁;它也被稱爲等值鏈接。當兩個表中都存在匹配時,內部鏈接將返回行。基於鏈接謂詞(join-predicate),經過組合兩個關係(例如A和B)的列值來建立新關係。查詢將A的每一行與B的每一行進行比較,以查找知足鏈接謂詞的全部行對。當鏈接謂詞被知足時,A和B的每一個匹配的行對的列值被組合成結果行。
grunt> join1 =join distinctcustid by roleid,b by roleid;
grunt> describe join1;
join1: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
grunt> dump join1
......
(600,0,600,0)
(600,0,600,0)
(600,0,600,0)
(600,0,600,0)
(650,1,650,1)
(650,1,650,1)
(650,1,650,1)
(650,1,650,1)
Inner Join使用較爲頻繁;它也被稱爲等值鏈接。當兩個表中都存在匹配時,內部鏈接將返回行。基於鏈接謂詞(join-predicate),經過組合兩個關係(例如A和B)的列值來建立新關係。查詢將A的每一行與B的每一行進行比較,以查找知足鏈接謂詞的全部行對。當鏈接謂詞被知足時,A和B的每一個匹配的行對的列值被組合成結果行。
grunt> joinleft =join distinctcustid by roleid left,b by roleid;
grunt> describe joinleft
joinleft: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
right outer join操做將返回右表中的全部行,即便左表中沒有匹配項。
grunt> joinright =join distinctcustid by roleid right,b by roleid;
grunt> describe joinright
joinright: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
當一個關係中存在匹配時,full outer join操做將返回行。
grunt> joinfull =join distinctcustid by roleid full,b by roleid;
grunt> describe joinfull
joinfull: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
咱們可使用多個key執行JOIN操做。關聯的key順序必須一致
grunt> joinbykeys =join distinctcustid by (roleid,sex),b by (roleid,sex);
grunt> describe joinbykeys
joinbykeys: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
CROSS 運算符計算兩個或多個關係的向量積。笛卡爾積
grunt> crosstest =cross distinctcustid,b;
grunt> describe crosstest
crosstest: {distinctcustid::roleid: int,distinctcustid::sex: int,b::roleid: int,b::sex: int}
Pig Latin的 UNION 運算符用於合併兩個關係的內容。要對兩個關係執行UNION操做,它們的列和域必須相同。
grunt> customer1 =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00002' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> customer =LOAD 'hdfs://localhost:9000/sqoop/sqoop1/2018060407/part-m-00003' USING PigStorage(',') as ( roleid:Int,name:chararray,dateid:Datetime,addr:chararray,sex:Int,level:Int );
grunt> union1 =union customer1,customer;
grunt> describe union1
union1: {roleid: int,name: chararray,dateid: datetime,addr: chararray,sex: int,level: int}
grunt> dump union1
..............................
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
SPLIT 運算符用於將關係拆分爲兩個或多個關係。
grunt> split union1 into customer1 if(sex==1),customer0 if(sex==0);
grunt> dump customer0;
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
grunt> dump customer1
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> cust1 =distinct customer0;
grunt> dump cust1
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
FILTER 運算符用於根據條件從關係中選擇所需的元組。
grunt> uniondis =distinct union1;
grunt> dump uniondis
.....
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> filter_level =filter uniondis by (level<50);
grunt> dump filter_level
.................
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
grunt> filter_level2 =filter uniondis by (level>=50);
grunt> dump filter_level2
........................
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
DISTINCT 運算符用於從關係中刪除冗餘(重複)元組,上面已有實例
FOREACH 運算符用於基於列數據生成指定的數據轉換。
grunt> foreach1 =foreach uniondis generate(name,sex,level);
grunt> dump foreach1
...........
((null,0,4))
((null,0,91))
((null,1,29))
grunt> foreach1 =foreach uniondis generate name,sex,level;
grunt> dump foreach1
.............
(null,0,4)
(null,0,91)
(null,1,29)
ORDER BY 運算符用於以基於一個或多個字段的排序順序顯示關係的內容。
grunt> orderby1 =order uniondis by level desc;
grunt> dump orderby1
.....
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
(650,null,2017-11-01T17:24:34.000+08:00,null,1,29)
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
grunt> limit1 =limit uniondis 2;
grunt> dump limit1
......
(400,null,2017-11-15T14:49:56.000+08:00,anhui hefei,0,4)
(600,null,2017-11-15T14:50:05.000+08:00,hunan changsha,0,91)
--------------------------------------------------------------------------------------------
grunt> B = foreach customer generate $0 as id,$4 as sex,level;
grunt> dump B
.........
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
(600,0,91)
(650,1,29)
----------------------------------------------------------------------------------------------------------
generate:使造成,發生,產生;
span:跨越
extend:延伸;擴大;推廣
Pig allows you to transform data in many ways. As a starting point, become familiar with these operators:
Use the FILTER operator to work with tuples or rows of data. Use the FOREACH operator to work with columns of data.
Use the GROUP operator to group data in a single relation. Use the COGROUP, inner JOIN, and outer JOIN operators to group or join data in two or more relations.
Use the UNION operator to merge the contents of two or more relations. Use the SPLIT operator to partition the contents of a relation into multiple relations.
Pig Latin provides operators that can help you debug your Pig Latin statements:
Use the DUMP operator to display results to your terminal screen.
Use the DESCRIBE operator to review the schema of a relation.
Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.
Pig provides shortcuts for the frequently used debugging operators (DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE). These shortcuts can be used in Grunt shell or within pig scripts. Following are the shortcuts supported by pig
\d alias - shourtcut for DUMP operator. If alias is ignored last defined alias will be used.
\de alias - shourtcut for DESCRIBE operator. If alias is ignored last defined alias will be used.
\e alias - shourtcut for EXPLAIN operator. If alias is ignored last defined alias will be used.
\i alias - shourtcut for ILLUSTRATE operator. If alias is ignored last defined alias will be used.
\q - To quit grunt shell