SQL裏面一般都會用Join來鏈接兩個表,作複雜的關聯查詢。好比用戶表和訂單表,能經過join獲得某個用戶購買的產品;或者某個產品被購買的人羣....html
Hive也支持這樣的操做,並且因爲Hive底層運行在hadoop上,所以有不少地方能夠進行優化。好比小表到大表的鏈接操做、小表進行緩存、大表進行避免緩存等等...apache
下面就來看看hive裏面的鏈接操做吧!其實跟SQL仍是差很少的...緩存
首先建立兩個原始數據的文件,這兩個文件分別有三列,第一列是id、第二列是名稱、第三列是另一個表的id。經過第二列能夠明顯的看到兩個表作鏈接查詢的結果:ide
[xingoo@localhost tmp]$ cat aa.txt 1 a 3 2 b 4 3 c 1 [xingoo@localhost tmp]$ cat bb.txt 1 xxx 2 2 yyy 3 3 zzz 5
接下來建立兩個表,須要注意的是表的字段分隔符爲空格,另外一個表能夠直接基於當前的表建立。oop
hive> create table aa > (a string,b string,c string) > row format delimited > fields terminated by ' '; OK Time taken: 0.19 seconds hive> create table bb like aa; OK Time taken: 0.188 seconds
查看兩個表的結構:優化
hive> describe aa; OK a string b string c string Time taken: 0.068 seconds, Fetched: 3 row(s) hive> describe bb; OK a string b string c string Time taken: 0.045 seconds, Fetched: 3 row(s)
下面能夠基於本地的文件,導入數據spa
hive> load data local inpath '/usr/tmp/aa.txt' overwrite into table aa; Loading data to table test.aa OK Time taken: 0.519 seconds hive> load data local inpath '/usr/tmp/bb.txt' overwrite into table bb; Loading data to table test.bb OK Time taken: 0.321 seconds
內鏈接即基於on語句,僅列出表1和表2符合鏈接條件的數據。code
hive> select * from aa a join bb b on a.c=b.a; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20160824161233_f9ecefa2-e5d7-416d-8d90-e191937e7313 Total jobs = 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2016-08-24 16:12:44 Starting to launch local task to process map join; maximum memory = 518979584 2016-08-24 16:12:47 Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable 2016-08-24 16:12:47 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (332 bytes) 2016-08-24 16:12:47 End of local task; Time Taken: 3.425 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2016-08-24 16:12:50,222 Stage-3 map = 100%, reduce = 0% Ended Job = job_local944389202_0007 MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1264 HDFS Write: 90 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 3 c 1 1 xxx 2 1 a 3 3 zzz 5 Time taken: 17.083 seconds, Fetched: 2 row(s)
左鏈接是顯示左邊的表的全部數據,若是有右邊表與之對應,則顯示;不然顯示nullorm
ive> select * from aa a left outer join bb b on a.c=b.a; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20160824161637_6d540592-13fd-4f59-a2cf-0a91c0fc9533 Total jobs = 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2016-08-24 16:16:48 Starting to launch local task to process map join; maximum memory = 518979584 2016-08-24 16:16:51 Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable 2016-08-24 16:16:51 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (338 bytes) 2016-08-24 16:16:51 End of local task; Time Taken: 2.634 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2016-08-24 16:16:53,843 Stage-3 map = 100%, reduce = 0% Ended Job = job_local1670258961_0008 MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1282 HDFS Write: 90 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1 a 3 3 zzz 5 2 b 4 NULL NULL NULL 3 c 1 1 xxx 2 Time taken: 16.048 seconds, Fetched: 3 row(s)
相似左鏈接,同理。htm
hive> select * from aa a right outer join bb b on a.c=b.a; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20160824162227_5d0f0090-1a9b-4a3f-9e82-e93c4d180f4b Total jobs = 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2016-08-24 16:22:37 Starting to launch local task to process map join; maximum memory = 518979584 2016-08-24 16:22:40 Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable 2016-08-24 16:22:40 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (332 bytes) 2016-08-24 16:22:40 End of local task; Time Taken: 2.368 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2016-08-24 16:22:43,060 Stage-3 map = 100%, reduce = 0% Ended Job = job_local2001415675_0009 MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1306 HDFS Write: 90 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 3 c 1 1 xxx 2 NULL NULL NULL 2 yyy 3 1 a 3 3 zzz 5 Time taken: 15.483 seconds, Fetched: 3 row(s)
至關於表1和表2的數據都顯示,若是沒有對應的數據,則顯示Null.
hive> select * from aa a full outer join bb b on a.c=b.a; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20160824162252_c71b2fae-9768-4b9a-b5ad-c06d7cdb60fb Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2016-08-24 16:22:54,111 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1766586034_0010 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 4026 HDFS Write: 270 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 3 c 1 1 xxx 2 NULL NULL NULL 2 yyy 3 1 a 3 3 zzz 5 2 b 4 NULL NULL NULL Time taken: 1.689 seconds, Fetched: 4 row(s)
這個比較特殊,SEMI-JOIN僅僅會顯示錶1的數據,即左邊表的數據。可是效率會比左鏈接快,由於他會先拿到表1的數據,而後在表2中查找,只要查找到結果立馬就返回數據。
hive> select * from aa a left semi join bb b on a.c=b.a; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20160824162327_e7fc72a7-ef91-4d39-83bc-ff8159ea8816 Total jobs = 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2016-08-24 16:23:37 Starting to launch local task to process map join; maximum memory = 518979584 2016-08-24 16:23:41 Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable 2016-08-24 16:23:41 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (317 bytes) 2016-08-24 16:23:41 End of local task; Time Taken: 3.586 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2016-08-24 16:23:43,798 Stage-3 map = 100%, reduce = 0% Ended Job = job_local521961878_0011 MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1366 HDFS Write: 90 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1 a 3 3 c 1 Time taken: 16.811 seconds, Fetched: 2 row(s)
笛卡爾積會針對表1和表2的每條數據作鏈接...
hive> select * from aa join bb; Warning: Map Join MAPJOIN[9][bigTable=?] in task 'Stage-3:MAPRED' is a cross product WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20160824162449_20e4b5ec-768f-48cf-a840-7d9ff360975f Total jobs = 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2016-08-24 16:25:00 Starting to launch local task to process map join; maximum memory = 518979584 2016-08-24 16:25:02 Dump the side-table for tag: 0 with group count: 1 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable 2016-08-24 16:25:02 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable (305 bytes) 2016-08-24 16:25:02 End of local task; Time Taken: 2.892 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2016-08-24 16:25:05,677 Stage-3 map = 100%, reduce = 0% Ended Job = job_local2068422373_0012 MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1390 HDFS Write: 90 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1 a 3 1 xxx 2 2 b 4 1 xxx 2 3 c 1 1 xxx 2 1 a 3 2 yyy 3 2 b 4 2 yyy 3 3 c 1 2 yyy 3 1 a 3 3 zzz 5 2 b 4 3 zzz 5 3 c 1 3 zzz 5
上面就是hive中的鏈接查詢,其實與SQL同樣的。