hive編程指南-筆記-1

注：《hive實戰 practical hive a guide to hadoop's data warehouse system 》如下簡稱 hive實戰也有一些被加入到其中
第二章基礎操做
2.7 命令行界面（千萬注意那些是在命令行輸入的命令，那些是在hive界面輸入的，後面備註命令行輸入就是命令行輸入其餘是hive裏執行的）
2.7.1 CLI選項
hive --help --service cli 命令行輸入
2.7.2 變量和屬性
四種命名空間：hivevar 用戶自定義變量 hiveconf hive相關配置屬性 system java定義的配置屬性 env shell環境定義的環境變量
set env:HADOOP_HOME; -- 查看環境變量好比HADOOP_HOME 能夠替換 HIVE_HOME 前面env 是命名空間
set顯示上面四種命名空間的環境變量 set -v 顯示hadoop 的屬性

java

hive --define foo=bar  命令行輸入
        hive> set  foo ;    --查看foo
              set hivevar:foo;  -- 將foo  給命名空間 hivevar (猜的)
              set hivevar:foo=bar2; -- 改變foo 值
              set hivevar:foo;
            應用:create  table  toss1(i int, ${hivevar:foo} string);  create  table  toss2(i int ,${foo} string); 
            set hivevar:foo=900920;
            select  * from stock_basic  where    stock_id=${hivevar:foo}  limit 100;

        設置默認顯示hive數據庫的名字
            hive --hiveconf hive.cli.print.current.db=true;
            或者:  set  hiveconf:hive.cli.print.current.db=true;

            hive --hiveconf y=900920;  或者 set  hiveconf:y=900920;
            select  * from stock_basic  where    stock_id=${hiveconf:y}  limit 100;

        system  可讀可寫，可是env  只能讀不能寫
            set system:user.name;
            set system:user.name=hadoop;  # 沒問題

            set env:HOME;
            set env:HOME=/home; #會報錯。

    2.7.3 hive 中一次使用命令:執行一個或者多個語句，以後當即退出hive 窗口

        hive -e   " select  * from stock_basic  where    stock_id=900920  limit 100";  #命令行執行
        也能夠重定向  hive -e   " select  * from stock_basic  where    stock_id=900920  limit 100" > /tmp/stock_id.txt        
        應用:能夠用來查看記憶模糊的屬性
            hive  -S -e   "set" | grep warehouse  # -S  表明靜默模式 會去掉 ok time taken 等字段  其實加不加都無所謂。不加 這些無關字段也不出現
    2.7.4 從腳本中執行hsql
        hive -f    /tmp/hsql.hql  #命令行執行;
        或者 在hive 裏面 執行 source  /tmp/hsql.hql 

    2.7.5 hiverc文件 
    在$HOME 下 新建 .hiverc 文件 每次開發 cli 都會先去執行 hiverc文件裏的語句，這樣將系統變量和其餘屬性加入到這個文件，默認執行就方便多了
    2.7.6 補全 tab
    2.7.7  查看歷史  hive 會將行命令記錄到$home/.hivehistory 下 也能夠上下鍵查看
    2.7.8 執行 shell命令  ! pwd
    2.7.9 執行 hadoop dfs命令   dfs -ls /;
    2.7.10 註釋  --       
    2.7.11 設置默認顯示錶的字段的名字                
        set hive.cli.print.header=true;

第三章數據類型和文件格式node

3.1基本數據類型
    hive數據類型基本上都是對java中的接口實現。因此類型行爲細節和java中的類型徹底一致
3.2集合數據類型：struct,map,array 
3.3 文本文件數據編碼(分隔符)：ctrl+A  ctrl+B  ctrl+C  以及換行 \n 其中 換行 是行數據的分隔符。 
3.4讀時模式:已理解 不贅述

第四章數據定義
4.1 hive中的數據庫
正則表達式

查看全部數據庫  show databases;     或者
    show databases like  'financials*';
    建立數據庫 
    create  database if not exists  financials4  comment 'all financials  table   '   
    location  'hdfs://master:9000/hive_dir/financials4.db'
    with dbproperties('creator'='zhangyt','date'='2020-10-08');     
    #hdfs://master:9000 是hadoop 默認設置的路徑 見core-site.xml。  hive_dir 是hive設置的 hive 在hdfs上的路徑 見 hive-site.xml

    describe  database  financials2;  或者 describe  database  extended  financials2;    #查看更詳細信息  主要看第三種建數據庫的方式的信息

    使用數據庫   use  financials;
    刪除數據庫  drop   database financials2; #沒有表能夠這樣刪除     drop   database  financials2 cascade;   #有表必須加 cascade
4.2 修改數據庫 只能修改屬性信息 元數據信息不能修改 
    alter database financials3 set dbproperties('edited-by'='zhangyt1');
4.3 建立表 
    create table  stand_table_1  like  stand_table;#注意表屬性：若是原表是分區表 目標表也會是分區表
    create [external] table  [if not exists ]stand_table(stand_a string comment 'A 列 ',stand_b int  comment 'column b')  
    comment  'this is a stand table '
    row format delimited    fields  terminated  by   '\t'
    lines terminated by '\n' 
    stored as  textfile
    location '/hive_dir/stand_table'
    #row format delimited   必需要在 其餘子句以前
    # location 能夠指定別的路徑 不指定就默認 到 hive-site.xml 路徑去了
    # 若是有 location  和 stored as  textfile 則 stored as  textfile 必需要在 location 前面
    #        tblproperties('creator'='zhangyt','date'='2020-10-08')加不進去

    查看錶  show tables; 或者   show tables  in   financials2 ;  # in後面是數據庫   show tables 'stock*'    ; #模糊查詢 
    describe      stand_table;    
    describe   extended  stand_table;

    load   data  [local]   inpath '/tmp/hive_txt/ss.txt'   [overwrite] into  table ss_l;  
    load data    inpath '/input/ss.txt'   overwrite into  table ss;   
    # 這樣寫 這個地址 是 hdsf 地址  注意這是移動，移動以後 原來的目錄下的文件就被移動到 hive  的 文件目錄下去了  

    load   data  local   inpath '/tmp/hive_txt/ss.txt'   overwrite into  table ss_l;  
    # 這樣寫 地址 是本地地址  雖然是移動，可是 本地文件還在

    之因此會有上面文件存在於不存在的差別 是由於 hdfs上已是分佈式式文件系統了，不須要多份拷貝
4.4分區表，管理表
    create table   if not exists  stock_basic_partition(
    stock_name  string  comment '股票名稱'  
    ,stock_date  string   comment '股票日期'
    ,stock_start_price  DECIMAL(15,3)   comment '開盤價'
    ,stock_max_price   DECIMAL(15,3)   comment '最高價'
    ,stock_min_price   DECIMAL(15,3)   comment '最低價'
    ,stock_end_price   DECIMAL(15,3)   comment '收盤價'
    ,stock_volume   DECIMAL(15,3)   comment '成交量'
    ,stock_amount   DECIMAL(15,3)   comment '成交金額')  
    comment  'stock_ basic infomation '
    partitioned by   (stock_id  string )
    row format delimited    fields  terminated  by   ','
    lines terminated by '\n' 
    stored as  textfile
    location '/hive_dir/stock_basic_partition';
    注：hive實戰中的：hive分區對於特別的子查詢能夠改進其性能,能夠對不須要的查詢結果的分區進行剪枝：該過程稱爲分區消除
    分區遵照的原則：（暫時不理解）
    挑選一列做爲分區鍵，其惟一值個數應該在低值和中間值之間
    避免分區小於1GB
    當分區數量較多，調整hiveserver2和hive  metastore的內存
    當時用多列做爲分區鍵的時候，對於每個分區鍵的組合都要建立一個子目錄的嵌套樹。應該避免深刻嵌套，由於這會致使太多的分區
    當使用hive流處理插入數據，若是多個會話向相同的分區寫入數據會致使鎖閉。流處理參見6.5:hive流處理：hive的流的api主要做爲hive bolt與storm一塊兒使用。 暫時沒用到。。

    #表的 備註必須放在 partitioned by   (stock_id  string ) 前面 

    查看分區  show partitions  stock_basic_partition;       describe    extended  stock_basic_partition;
    #注意格式 分區字段要放最後 
    insert  overwrite  table    stock_basic_partition partition ( stock_id ) select  
     stock_name                
    ,stock_date                
    ,stock_start_price         
    ,stock_max_price           
    ,stock_min_price           
    ,stock_end_price           
    ,stock_volume              
    ,stock_amount 
    ,stock_id       
    from  stock_basic  
    limit  200000;
    要在非嚴格模式下才行

    load data  local inpath   '/tmp/stock_id.txt'    into   table   stock_basic_partition   partition (stock_id='900920');

    設置嚴格模式 使得 分區表必須加分區  set hive.exec.dynamic.partition.mode=strict ;  設置非嚴格模式 set   hive.exec.dynamic.partition.mode=nonstrict ;
    嚴格模式下 分區表也能夠不用指定分區。 
4.5 刪除表 drop table  stock_basic_partition; 外部表 只是刪除元數據 真實數據文件還在  
4.6 修改表
特別注意 修改表 只是修改元數據 可是真實數據並沒該表 須要本身對應修改真實數據文件
    修改表名 alter   table   stock_basic_test  rename  to  stock_basic_test1;
    增長分區  alter  table stock_basic_partition add  if not exists   partition (stock_id='00000001') location '/hive_dir/stock_basic_partition/00000001' 
              partition (stock_id='03000001') location '/hive_dir/stock_basic_partition/03000001';
    刪除分區    分區 這個語句暫時發現只能刪除一個分區
    alter  table stock_basic_partition drop   if   exists   partition (stock_id='00000001') ;
    更改列的位置  報錯 不行
    alter  table  stock_basic_test  CHANGE COLUMN stock_amount  DECIMAL(15,3) comment  '成交金額' AFTER  stock_id ;

    增長列 
    alter  table  stock_basic_test add  COLUMNS (  stock_other string  COMMENT '其餘信息');
    刪除列/替換列 
    將 stock_volume 刪除了     就是將不要的字段刪掉，可是保證刪掉後的數據 往前挪的時候類型正確。並且數據也要刪除 
    綜上 其實刪除列沒什麼用 太麻煩。
    alter  table   stock_basic_test replace  columns (
    stock_id string 
    ,stock_name  string  comment '股票名稱'  
    ,stock_date  string   comment '股票日期'
    ,stock_start_price  DECIMAL(15,3)   comment '開盤價'
    ,stock_max_price   DECIMAL(15,3)   comment '最高價'
    ,stock_min_price   DECIMAL(15,3)   comment '最低價'
    ,stock_end_price   DECIMAL(15,3)   comment '收盤價'
     ,stock_amount   DECIMAL(15,3)   comment '成交金額');

    create table  stock_basic_test as select  * from  stock_basic  limit  2000;

    修改表屬性 
    alter  table    stock_basic_test set tblproperties('notes'='hahaha');
    附：《hive實戰 practical hive  a guide to hadoop's data  warehouse system 》如下簡稱 hive實戰 4.4.6 表的屬性中
    重要的表的屬性：last_modified_user,last_modified_time, immutable,orc.compress,skip.header.line.count 
    1.immutable 使用： 當該屬性被設置爲true 則若是表不爲空，沒法插入數據。
        create table  stock_basic_test_immutable  like   stock_basic   ; 
        insert into  stock_basic_test_immutable select * from stock_basic limit 100;
        alter  table    stock_basic_test_immutable set tblproperties('immutable'='true');
        ## 試試是否能夠插入
        insert into  stock_basic_test_immutable select * from stock_basic limit 100;  ## 失敗
        alter  table    stock_basic_test_immutable set tblproperties('immutable'='false');
        insert into  stock_basic_test_immutable select * from stock_basic limit 100;  ## 成功
    2.skip.header.line.count 
        create table  stock_basic_test_skip(a string ,b string) row format delimited    fields  terminated  by   ',' lines terminated by '\n'  ; 
        alter  table    stock_basic_test_skip set tblproperties('skip.header.line.count'='1');
        load  data  local     inpath   '/tmp/stock_basic_test_skip.txt'  overwrite  into table     stock_basic_test_skip;   #實驗可行
    skip.header.line.count 去除數據的表頭：是hive外部表的重要特性之一

    修改存儲屬性 
    alter  table  stock_basic_test  set  fileformat sequencefile ;  sequencefile  能夠替換成   textfile  
    alter  table  stock_basic_partition  partition (stock_id='600909' )set  fileformat sequencefile ; 

        4.6.8衆多修改表語句 
        鉤子 看不懂（略去）
        將分區文件（只能用於分區）成一個 hadoop壓縮包(har文件)  能夠下降文件數 從而減小namenode壓力可是不會減小 壓縮空間            
        前提： 開啓模式 set hive.archive.enabled=true; 
        若是報錯 java.lang.NoClassDefFoundError: org/apache/hadoop/tools/HadoopArchives 
        須要將 hadoop的lib目錄下的  hadoop-archives-3.1.2.jar 複製到hive的lib目錄下便可  
        alter  table   stock_basic_partition  archive  partition ( stock_id ='600908');
        反之
        alter  table   stock_basic_partition  unarchive  partition ( stock_id ='600908');
        壓縮後能夠進去hdfs頁面看hive 目錄下該文件的方式，有點意思

        保護分區 防止被刪除 和查詢 報錯 下次看 百度下無法解決 
        alter  table  stock_basic_partition partition( stock_id ='600908')enable no_drop;

        alter  table  stock_basic_partition partition ( stock_id ='600908')  enable   offline ;

第五章數據操做
5.1向管理表中裝載數據
注意怎麼使用環境變量的
hive -e " select * from stock_basic where stock_id=900920 " > /tmp/stock_id_equal_900920.txt # 先生成文件
set hiveconf:loc_txt=/tmp/stock_id_equal_900920.txt; #注意這裏不要加引號
load data local inpath '${hiveconf:loc_txt}' overwrite into table stock_basic_partition partition (stock_id='900920');

sql

注意 若是沒有 overwrite 而新加入的文件和表原有的文件 名字同樣，則會增長文件編號  如：stock_id_equal_900920.txt   stock_id_equal_900920_copy_1.txt
5.2經過查詢向表裏插入數據
    insert into  table  stock_basic_partition 
    partition( stock_id=900922)
    select * from  stock_basic where   stock_id=900922; 
    #這樣寫是錯的。特別注意

    insert into  table  stock_basic_partition 
    partition( stock_id=900922)
    select 
    stock_name   
    ,stock_date   
    ,stock_start_price
    ,stock_max_price  
    ,stock_min_price  
    ,stock_end_price  
    ,stock_volume    
    ,stock_amount   
    from  stock_basic where   stock_id=900922;
    #驗證了華爲考試題 插入語句select 字段會比分區表表的字段少，少的是分區字段
    動態插入 若是按照上面一個一個寫，太慢了
    set hive.exec.dynamic.partition.mode=nonstrict  #設置爲非嚴格模式
    #注意格式 分區字段要放最後  
    insert  overwrite  table    stock_basic_partition partition ( stock_id ) select  
     stock_name                
    ,stock_date                
    ,stock_start_price         
    ,stock_max_price           
    ,stock_min_price           
    ,stock_end_price           
    ,stock_volume              
    ,stock_amount 
    ,stock_id       
    from  stock_basic     ;
    要在非嚴格模式下才行

    動態分區屬性其餘百度
    hive.exec.dynamic.partition   
    hive.exec.dynamic.partition.mode 
5.3 單個查詢語句建立表而且加載數據  ctas 
     create table  as  select  from 
     《hive實戰》：ctas目標表沒法是 外部表，分區表，分桶表  結論：正確，雖然來源表是分區表 可是目標表已經不是分區表了，和like 不一樣
    create  table stock_basic_partition_ctas  as select  * from  stock_basic_partition;
5.4 導出數據 注意 這是hdfs  上 不是本地  
    若是文件格式恰好合適直接導出
    hadoop fs -cp  source_path  target_path  
    hadoop fs -cp  hdfs://master:9000/hive_dir/date_stock  /tmp/data_from_hive   #命令行
    #不行。。
    insert overwrite local  directory  '/tmp/data_from_hive' select  * from  date_stock;  #hive 界面  這個生成的目錄很怪

    和插入對應輸出也能夠輸入到不一樣的文件夾
    from  stock_basic_partition  sp
    insert overwrite  directory  '/tmp/600896'
        select  * where  stock_id=600896
    insert overwrite  directory  '/tmp/600897'
        select  * where  stock_id=600897
    insert overwrite  directory  '/tmp/600898'
        select  * where  stock_id=600898

第六章 HiveQL查詢
6.1 select from 語句
6.1.1正則表達式來指定列很雞肋這個是用於結構化字段的
select a,b. ,from aa ;
6.1.4 使用函數
聚合函數 set hive.map.aggr=true ;在map階段進行彙集需消耗不少內存
6.1.9 什麼狀況能夠避免進行 mapreduce
查詢通常會觸發一個 mapreduce 其實不必。也就是設置本地模式：
set hive.exec.mode.local.auto =true
6.2 where 語句
6.2.2 關於浮點數的比較
double 0.2 可能表示的是 0.200000010000 (12位) float 0.2 可能表示的是 0.20000000 （8位）
因此 select

from stock_basic_partition where stock_amount>0.2 可能會出現等於 0.2的記錄
須要改寫爲 select from stock_basic_partition where stock_amount> cast(0.2 as float)
6.4 join 語句
hive只支持等值鏈接 on 子句不能出現 or 。可是能夠出現 on a.a=b.a and a.b>b.b
on的時候將大表放右邊由於hive從左往右執行，並且會把左邊表緩存起來
left semi join 是 in 或者exists的優化
mapjoin 不支持 right join 和full join 語句
6.5 order by 和sort by
order by 是全局的排序全部數據都會經過一個reducer 處理 sort by 是每一個reduce裏面的排序
6.6 含有 sort by 的distribute by
distribute by 是控制map 的輸出在reduce裏是如何劃分的。因爲 reduce 是按照 map的鍵值對均勻分發到不一樣reduce去
會致使不一樣的reduce會有些重疊。而用了 distribute by會保證相同key 會分發到相同的reduce去
好比股票按照 distribute by stock_id sort by stock_id,stock_amount 這樣就能保證相同的股票id 是在一塊兒的。
select

from stock_basic_partition distribute by stock_id sort by stock_id,stock_amount;
注意 distribute by 必須在 sort by 前面
6.7 cluster by
若是 6.6 中 sort by 和 distribute by 字段同樣能夠用 cluster by
select from stock_basic_partition cluster by stock_id;
6.9抽樣查詢不太懂
按照 rand() 隨機抽取
select

from stock_basic_partition tablesample( bucket 3 out of 10 on rand());
按照字段隨機抽取
select from stock_basic_partition tablesample( bucket 3 out of 51 on stock_id );
數據塊抽樣
select

from stock_basic tablesample(0.1 percent);

shell