深刻理解hive之事務處理

時間 2019-11-21

標籤深刻理解 hive 事務處理欄目 Hadoop 简体版

原文原文鏈接

事務的四個特性

　　1.automicity：原子性node

　　 2.consistency:一致性數據庫

　　 3. isolation:獨立性apache

　　 4.durability:持久性oracle

　　5.支持事務有幾個條件須要知足：1.全部的事務都支持自動提交；2.只支持ORC格式的數據；3.桶表oop

　　7.配置hive的參數使其支持事務：性能

　　　在hive-site.xml文件中進行以下的配置spa

<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
</property>
<property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
</property>

對幾個重要的屬性作闡述：code

　　hive.exec.dynamic.partition.mode:
　　可選值有：strict, nonstric.
　　strict嚴格模式下，必須制定一個partition爲靜態分區，目的是爲了防止誤操做其餘partition.
　　在一個事務中，可能不止會更新一個Partition, 並且更新時也沒法控制到底哪些partition會被操做到，所以爲了支持事務，必須使用 Nonstrict.orm

　　hive.compactor.initiator.on
　　默認值是 false, 由於默認的狀況連事務都不開啓。
　　這個屬性很重要的緣由是回答以前咱們的一個問題，若是 delta 文件過多，對namenode形成了影響，咱們改如何改善系統性能？（在thrift metaserver 上）開啓了這個屬性以後，會使得在 metaStore 實例上運行　　Initiator, cleaner 進程。initiator 進程負責查找哪些表或者分區的 delta 文件須要被壓縮，cleaner 進程負責刪除已經再也不須要的 delta 文件。接下來看看幾個hive的事務性操做server

　$hive>create table tx(id int,name string,age int) clustered by (id) into 3 buckets row format delimited fields terminated by ',' stored as orc ;//建立桶表，存儲格式爲orc使其支持事務

　$hive>desc formatted tx ;　　//查看tx表的結構

$hive>insert into tx values(1,'tom',23);    //向桶表中來插入數據

hive分區

　　Hive分區的概念與傳統關係型數據庫分區不一樣。

　　傳統數據庫的分區方式：就oracle而言，分區獨立存在於段裏，裏面存儲真實的數據，在數據進行插入的時候自動分配分區。

　　Hive的分區方式：因爲Hive實際是存儲在HDFS上的抽象，Hive的一個分區名對應一個目錄名，子分區名就是子目錄名，並非一個實際字段。因此咱們在插入數據的時候指定分區，就是新建一個目錄或者子目錄，或者在原來目錄的基礎上來添加數據。對於hive分區而言，能夠分爲靜態分區和動態分區這兩個類

　　1.靜態分區

　　　$hive>create table customers(id int,name string ,age int ) partitioned by(year int,month int) row format delimited fields terminated by ',';　//建立靜態分區表

　　 $hive>alter table customers add partition(year=2014,month=11) partition(year=2014,month=12);//在靜態分區表中來添加分區

　　　$hive>desc customers;//查看錶結構

　　　 $hive>show partitions customers ; 　　//查看customers表的分區結構

　　　$hive>load data local inpath '/data/customers.txt' into table customers partition(year=2014,year=11); //從外部表加載數據到靜態分區表的指定分區中來,這是文件的複製操做

　　　 $hive>dfs -lsr /;　　//查看文件系統的文件結構

　　　 $hive>select * from customers where year=2014 and month=11;

　　　新建表的時候定義的分區順序，決定了文件目錄順序（誰是父目錄誰是子目錄），正由於有了這個層級關係，當咱們查詢全部year=1024的時候，2014如下的全部日期下的數據都會被查出來。若是隻查詢月份分區，但父目錄都有該日期的數據，那麼Hive會對輸入路徑進行修剪，從而只掃描日期分區，性別分區不做過濾（即查詢結果包含了全部性別）。

　　 2.動態分區

　　在使用靜態分區的時候，咱們首先要知道有什麼分區類型，而後每一個分區來進行數據的加載，這個操做過程比較麻煩；而動態分區不會有這些沒必要要的操做，動態分區能夠根據查詢獲得的數據動態地分配到分區中去，動態分區與靜態分區最大的區別是不指定分區目錄，由系統本身進行過選擇。

　　動態分區模式能夠分爲嚴格模式(strict)和非嚴格模式(non-strict),兩者的區別是：嚴格模式在進行插入的時候至少指定一個靜態分區，而非嚴格模式在進行插入的時候能夠不指定靜態分區

　　首先啓動動態分區的功能：

 $hive>set hive.exec.dynamic.partition=true;

　　再設置分區模式爲非嚴格模式

$hive>set hive.exec.dynamic.partition.mode=nonstrict

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。