【Hive學習之八】Hive 調優【重要】

時間 2020-07-20

原文原文鏈接

環境
　　虛擬機：VMware 10
　　Linux版本：CentOS-6.5-x86_64
　　客戶端：Xshell4
　　FTP：Xftp4
　　jdk8
　　hadoop-3.1.1
　　apache-hive-3.1.1node

1、執行計劃
核心思想：把Hive SQL當作Mapreduce程序去優化
如下SQL不會轉爲Mapreduce來執行
　　-select僅查詢本表字段
　　-where僅對本表字段作條件過濾shell

Explain 顯示執行計劃：EXPLAIN [EXTENDED] queryexpress

hive> explain select count(*) from psn2; OK Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: psn2 Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Execution mode: vectorized Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 2.7 seconds, Fetched: 43 row(s) hive>

hive> explain extended select count(*) from psn2; OK Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: psn2 Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE GatherStats: false Select Operator Statistics: Num rows: 2 Data size: 7440 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE tag: -1 value expressions: _col0 (type: bigint) auto parallelism: false Execution mode: vectorized Path -> Alias: hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10 [psn2] hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20 [psn2] Path -> Partition: hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10 
 Partition base file name: age=10 input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat partition values: age 10 properties:   bucket_count -1
              collection.delim -
              column.name.delimiter ,
              columns id,name,likes,address
              columns.comments 
              columns.types int:string:array<string>:map<string,string>
              field.delim ,
              file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat line.delim location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=10 mapkey.delim : name default.psn2 numFiles 1 numRows 0 partition_columns age partition_columns.types int rawDataSize 0 serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address} serialization.format , serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 372 transient_lastDdlTime 1548986286 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: bucket_count -1 bucketing_version 2
                collection.delim -
                column.name.delimiter ,
                columns id,name,likes,address
                columns.comments 
                columns.types int:string:array<string>:map<string,string>
                field.delim ,
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                line.delim 

                location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2
                mapkey.delim :
                name default.psn2
                partition_columns age
                partition_columns.types int
                serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address}
                serialization.format ,
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                transient_lastDdlTime 1548915997
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.psn2 name: default.psn2 hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20 
 Partition base file name: age=20 input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat partition values: age 20 properties: bucket_count -1 collection.delim - column.name.delimiter , columns id,name,likes,address columns.comments columns.types int:string:array<string>:map<string,string> field.delim , file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat line.delim location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2/age=20
 mapkey.delim : name default.psn2 numFiles 1 numRows 0 partition_columns age partition_columns.types int rawDataSize 0 serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address} serialization.format , serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 372 transient_lastDdlTime 1548986540 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: bucket_count -1 bucketing_version 2 collection.delim - column.name.delimiter , columns id,name,likes,address columns.comments columns.types int:string:array<string>:map<string,string> field.delim , file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat line.delim location hdfs://PCS102:9820/root/hive_remote/warehouse/psn2
 mapkey.delim : name default.psn2 partition_columns age partition_columns.types int serialization.ddl struct psn2 { i32 id, string name, list<string> likes, map<string,string> address} serialization.format , serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe transient_lastDdlTime 1548915997 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.psn2 name: default.psn2 Truncated Path -> Alias: /psn2/age=10 [psn2] /psn2/age=20 [psn2] Needs Tagging: false Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 0 directory: hdfs://PCS102:9820/tmp/hive/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_10-58-42_159_2637812497308639143-1/-mr-10001/.hive-staging_hive_2019-02-19_10-58-42_159_2637812497308639143-1/-ext-10002
            NumFilesPerFileSink: 1 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: hdfs://PCS102:9820/tmp/hive/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_10-58-42_159_2637812497308639143-1/-mr-10001/.hive-staging_hive_2019-02-19_10-58-42_159_2637812497308639143-1/-ext-10002/
 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: columns _col0 columns.types bigint escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.escape.crlf true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.142 seconds, Fetched: 192 row(s) hive>

2、運行模式
(1)分爲本地模式和集羣模式
(2)開啓本地模式（對於數據量少的表狀況）：
　　set hive.exec.mode.local.auto=true;
注意：
　　hive.exec.mode.local.auto.inputbytes.max默認值爲128M
　　表示加載文件的最大值，若大於該配置仍會以集羣方式來運行！apache

hive> set hive.exec.mode.local.auto=true; hive> select count(*) from psn21; Automatically selecting local only mode for query Query ID = root_20190219144810_0bafff9e-1c40-45f6-b687-60c5d13c9f0c Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2019-02-19 14:48:11,839 Stage-1 map = 100%,  reduce = 100% Ended Job = job_local1827024396_0002 MapReduce Jobs Launched: Stage-Stage-1:  HDFS Read: 4702 HDFS Write: 623 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK _c0 6 Time taken: 1.376 seconds, Fetched: 1 row(s) hive> set hive.exec.mode.local.auto=false; hive> select count(*) from psn21; Query ID = root_20190219144841_6fd11106-5db1-4335-8b0b-884697b558df Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1548397153910_0013, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0013/
Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0013 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-02-19 14:48:49,046 Stage-1 map = 0%,  reduce = 0%
2019-02-19 14:48:54,210 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.87 sec 2019-02-19 14:48:59,328 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.28 sec MapReduce Total cumulative CPU time: 6 seconds 280 msec Ended Job = job_1548397153910_0013 MapReduce Jobs Launched: Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.28 sec   HDFS Read: 13976 HDFS Write: 101 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 280 msec OK _c0 6 Time taken: 18.923 seconds, Fetched: 1 row(s) hive>

3、並行計算
經過設置如下參數開啓並行模式（須要關閉本地模式）：
set hive.exec.parallel=true;app

另：hive.exec.parallel.thread.number:一次SQL計算中容許並行執行的job個數的最大值jvm

hive> set hive.exec.parallel; hive.exec.parallel=false hive> select t1.cnt1,t2.cnt2 from > (select count(id) cnt1 from psn21) t1, > (select count(name) cnt2 from psn21)t2; Warning: Map Join MAPJOIN[27][bigTable=?] in task 'Stage-4:MAPRED' is a cross product Warning: Map Join MAPJOIN[34][bigTable=?] in task 'Stage-5:MAPRED' is a cross product Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product Query ID = root_20190219145608_b4f3d4e9-b858-41be-9ddc-6eccff0ec9d9 Total jobs = 5 Launching Job 1 out of 5 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1548397153910_0014, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0014/
Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0014 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-02-19 14:56:15,027 Stage-1 map = 0%,  reduce = 0%
2019-02-19 14:56:20,148 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.85 sec 2019-02-19 14:56:25,277 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.04 sec MapReduce Total cumulative CPU time: 6 seconds 40 msec Ended Job = job_1548397153910_0014 Launching Job 2 out of 5 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1548397153910_0015, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0015/
Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0015 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2019-02-19 14:56:37,661 Stage-3 map = 0%,  reduce = 0%
2019-02-19 14:56:42,786 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.8 sec 2019-02-19 14:56:46,875 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 5.92 sec MapReduce Total cumulative CPU time: 5 seconds 920 msec Ended Job = job_1548397153910_0015 Stage-7 is selected by condition resolver. Stage-8 is filtered out by condition resolver. Stage-2 is filtered out by condition resolver. SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2019-02-19 14:56:55    Starting to launch local task to process map join;    maximum memory = 239075328
2019-02-19 14:56:56    Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-56-08_997_6748376838876035123-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable2019-02-19 14:56:56    Uploaded 1 File to: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-56-08_997_6748376838876035123-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (278 bytes) Execution completed successfully MapredLocal task succeeded Launching Job 4 out of 5 Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1548397153910_0016, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0016/
Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0016 Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2019-02-19 14:57:02,370 Stage-4 map = 0%,  reduce = 0%
2019-02-19 14:57:07,478 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 3.07 sec MapReduce Total cumulative CPU time: 3 seconds 70 msec Ended Job = job_1548397153910_0016 MapReduce Jobs Launched: Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.04 sec   HDFS Read: 13637 HDFS Write: 114 SUCCESS Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 5.92 sec   HDFS Read: 13641 HDFS Write: 114 SUCCESS Stage-Stage-4: Map: 1   Cumulative CPU: 3.07 sec   HDFS Read: 6091 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 30 msec OK t1.cnt1 t2.cnt2 6    6 Time taken: 59.527 seconds, Fetched: 1 row(s) hive> set hive.exec.parallel=true; hive> (select count(name) cnt2 from psn21)t2; FAILED: ParseException line 1:36 extraneous input 't2' expecting EOF near '<EOF>' hive> select t1.cnt1,t2.cnt2 from > (select count(id) cnt1 from psn21) t1, > (select count(name) cnt2 from psn21)t2; Warning: Map Join MAPJOIN[27][bigTable=?] in task 'Stage-4:MAPRED' is a cross product Warning: Map Join MAPJOIN[34][bigTable=?] in task 'Stage-5:MAPRED' is a cross product Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product Query ID = root_20190219145918_2f98437b-7070-41a4-905b-4c6a3a160d46 Total jobs = 5 Launching Job 1 out of 5 Launching Job 2 out of 5 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1548397153910_0018, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0018/
Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0018 Starting Job = job_1548397153910_0017, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0017/
Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0017 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2019-02-19 14:59:25,322 Stage-3 map = 0%,  reduce = 0%
2019-02-19 14:59:29,510 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.0 sec 2019-02-19 14:59:34,612 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 6.25 sec MapReduce Total cumulative CPU time: 6 seconds 250 msec Ended Job = job_1548397153910_0018 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-02-19 14:59:46,581 Stage-1 map = 0%,  reduce = 0%
2019-02-19 14:59:51,687 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.74 sec 2019-02-19 14:59:56,796 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.9 sec MapReduce Total cumulative CPU time: 5 seconds 900 msec Ended Job = job_1548397153910_0017 Stage-7 is selected by condition resolver. Stage-8 is filtered out by condition resolver. Stage-2 is filtered out by condition resolver. SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2019-02-19 15:00:07    Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-59-18_586_8660726948780795909-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile21--.hashtable2019-02-19 15:00:07    Uploaded 1 File to: file:/tmp/root/6f8ff71f-87bd-4d46-9f9a-516708d65459/hive_2019-02-19_14-59-18_586_8660726948780795909-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile21--.hashtable (278 bytes) Execution completed successfully MapredLocal task succeeded Launching Job 4 out of 5 Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1548397153910_0019, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0019/
Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job  -kill job_1548397153910_0019 Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2019-02-19 15:00:16,324 Stage-4 map = 0%,  reduce = 0%
2019-02-19 15:00:20,426 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 2.95 sec MapReduce Total cumulative CPU time: 2 seconds 950 msec Ended Job = job_1548397153910_0019 MapReduce Jobs Launched: Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 6.25 sec   HDFS Read: 13641 HDFS Write: 114 SUCCESS Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.9 sec   HDFS Read: 13637 HDFS Write: 114 SUCCESS Stage-Stage-4: Map: 1   Cumulative CPU: 2.95 sec   HDFS Read: 6091 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 100 msec OK t1.cnt1 t2.cnt2 6    6 Time taken: 64.206 seconds, Fetched: 1 row(s) hive>

4、嚴格模式
經過設置如下參數開啓嚴格模式：
set hive.mapred.mode=strict;
（默認爲：nonstrict非嚴格模式）ide

查詢限制：
一、對於分區表，必須添加where對於分區字段的條件過濾；oop

hive> set hive.mapred.mode=nonstrict; hive> select * from psn22; OK psn22.id psn22.name psn22.likes psn22.address psn22.age psn22.sex 1    小明1    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    18 boy 2    小明2    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    20    man
5    小明5    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21 boy 3    小明3    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21 boy 6    小明6    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21    man
4    小明4    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    21    man Time taken: 0.186 seconds, Fetched: 6 row(s) hive> set hive.mapred.mode=strict; hive> select * from psn22; FAILED: SemanticException [Error 10056]: Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.no.partition.filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "psn22" Table "psn22" hive> select * from psn22 where age=18 and sex='boy'; OK psn22.id psn22.name psn22.likes psn22.address psn22.age psn22.sex 1    小明1    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"}    18 boy Time taken: 0.282 seconds, Fetched: 1 row(s) hive>

二、order by語句必須包含limit輸出限制優化

hive> set hive.mapred.mode=strict; hive> select * from psn21 order by id; FAILED: SemanticException 1:29 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'id' hive> select * from psn21 order by id limit 2; Automatically selecting local only mode for query Query ID = root_20190219143842_b465a76f-a890-4bdc-aa76-b713c3ea13c0 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2019-02-19 14:38:43,896 Stage-1 map = 100%,  reduce = 100% Ended Job = job_local1585589360_0001 MapReduce Jobs Launched: Stage-Stage-1:  HDFS Read: 1696 HDFS Write: 261 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK psn21.id psn21.name psn21.age psn21.sex psn21.likes psn21.address 1    小明1    18    boy    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"} 2    小明2    20    man    ["lol","book","movie"]    {"beijing":"shangxuetang","shanghai":"pudong"} Time taken: 1.89 seconds, Fetched: 2 row(s) hive>

三、限制執行笛卡爾積的查詢。spa

5、Hive排序
-Order By - 對於查詢結果作全排序，只容許有一個reduce處理
（當數據量較大時，應慎用。嚴格模式下，必須結合limit來使用）
-Sort By - 對於單個reduce的數據進行排序
-Distribute By - 分區排序，常常和Sort By結合使用
-Cluster By - 至關於 Sort By + Distribute By
（Cluster By不能經過asc、desc的方式指定排序規則；
可經過 distribute by column sort by column asc|desc 的方式）

6、Hive Join（去掉MapReduce中的shuffle）
Join計算時，將小表（驅動表）放在join的左邊
Map Join：在Map端完成Join
兩種實現方式：
一、SQL方式，在SQL語句中添加MapJoin標記（mapjoin hint）
語法：
SELECT /*+ MAPJOIN(smallTable) */ smallTable.key, bigTable.value
FROM smallTable JOIN bigTable ON smallTable.key = bigTable.key;
二、開啓自動的MapJoin
經過修改如下配置啓用自動的mapjoin：
set hive.auto.convert.join = true;
（該參數爲true時，Hive自動對左邊的表統計量，若是是小表就加入內存，即對小表使用Map join）

相關配置參數：
hive.mapjoin.smalltable.filesize;
（大表小表判斷的閾值，若是表的大小小於該值則會被加載到內存中運行）
hive.ignore.mapjoin.hint；
（默認值：true；是否忽略mapjoin hint 即mapjoin標記）
hive.auto.convert.join.noconditionaltask;
（默認值：true；將普通的join轉化爲普通的mapjoin時，是否將多個mapjoin轉化爲一個mapjoin）
hive.auto.convert.join.noconditionaltask.size;
（將多個mapjoin轉化爲一個mapjoin時，其表的最大值）

7、Map-Side聚合（至關於MapReduce中的combine聚合）
經過設置如下參數開啓在Map端的聚合：
set hive.map.aggr=true;

相關配置參數：
hive.groupby.mapaggr.checkinterval：
map端group by執行聚合時處理的多少行數據（默認：100000）
hive.map.aggr.hash.min.reduction：
進行聚合的最小比例（預先對100000條數據作聚合，若聚合以後的數據量/100000的值大於該配置0.5，則不會聚合）
hive.map.aggr.hash.percentmemory：
map端聚合使用的內存的最大值
hive.map.aggr.hash.force.flush.memory.threshold：
map端作聚合操做是hash表的最大可用內容，大於該值則會觸發flush
hive.groupby.skewindata
是否對GroupBy產生的數據傾斜作優化，默認爲false

8、控制Hive中Map以及Reduce的數量
Map數量相關的參數
mapred.max.split.size
一個split的最大值，即每一個map處理文件的最大值
mapred.min.split.size.per.node
一個節點上split的最小值
mapred.min.split.size.per.rack
一個機架上split的最小值

Reduce數量相關的參數：
mapred.reduce.tasks
強制指定reduce任務的數量
hive.exec.reducers.bytes.per.reducer
每一個reduce任務處理的數據量
hive.exec.reducers.max
每一個任務最大的reduce數

9、Hive - JVM重用
適用場景：
一、小文件個數過多
二、task個數過多

經過 set mapred.job.reuse.jvm.num.tasks=n; 來設置（n爲task插槽個數）缺點：設置開啓以後，task插槽會一直佔用資源，不管是否有task運行，直到全部的task即整個job所有執行完成時，纔會釋放全部的task插槽資源！