Hive工具、數據模型、Java API與常見問題

時間 2019-11-15

標籤 hive 工具數據模型 java api 常見問題欄目 Hadoop 简体版

原文原文鏈接

簡介

由於Hive的使用依賴Hadoop，不一樣的版本之間有不少問題，大的原則上是hive2.x版本對應hadoop2.x版本，hive3.x版本對應hadoop3.x版本。java

可是在實際的使用過程當中仍是有各類兼容問題，具體的hive安裝能夠參考hive安裝，這裏咱們介紹一下遇到問題的解決方案。node

在看到日誌中有java.lang.NoClassDefFoundError的異常，通常是缺乏jar包，咱們只須要找到對應的jar包，jar的名稱好找日誌中就能看到，具體的版本能夠查看對應的源碼pom依賴。而後放到對應的lib目錄下就能夠了。web

在日誌中看到NoSuchMethod的異常，通常是由於jar包衝突，有多個版本。要找到hive的lib目錄，和hadoop的share目錄下的各個目錄中的jar包，而後使用高版本的替換對應的低版本，由於通常狀況高版本是兼容低版本的jar包的。若是用到hbase，hbase的lib目錄也要查看。spring

在hive3.1.1和hadoop3.0.2一塊兒使用的時候就會有disruptor和guava包有多個版本的問題，換成高版本就能夠了。sql

hive

hive是一個操做hive數據庫的命令行接口(CLI，Client Line Interface)數據庫

beeline

beeline是一個操做hive數據庫的新命令行接口(New Client Line Interface)apache

beeline -u jdbc:hive2://localhost:10000
beeline -u jdbc:hive2://localhost:10000 -n user -p password
beeline !connect jdbc:hive2://localhost:10000

hiveserver

HiveServer是一個服務端接口，使遠程客戶端能夠執行對Hive的查詢並返回結果。目前基於Thrift RPC的實現是HiveServer的改進版本，並支持多客戶端併發和身份驗證。windows

hive-site.xml數組

<property>
  <name>hive.server2.thrift.port</name>
  <value>10000</value>
</property>
<property>
  <name>hive.server2.thrift.bind.host</name>
  <value>127.0.0.1</value>
</property>
<property>
    <name>hive.server2.webui.host</name>
    <value>127.0.0.1</value>
</property>
<property>
    <name>hive.server2.webui.port</name>
    <value>10002</value>
</property>
<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value>
</property>

hive.server2.enable.doAs:是爲了防止hdfs權限問題，這樣hive server會以提交用戶的身份去執行語句，若是設置爲false，則會以hive server的admin user來執行語句。併發

hive --service hiveserver
hive --service hiveserver2
hiveserver2 --hiveconf hive.server2.thrift.port=10000

MetaStoreServer

MetaStoreServer使用thrift協議提供了一個訪問元數據的服務，這樣遠程客戶端就能夠不用經過訪問數據庫的方式獲取元數據信息了，spark就會訪問這個服務。

hive --service metastore

<property>
 <name>hive.metastore.uris</name>
 <value>thrift://192.168.10.7:9083,thrift://192.168.10.8:9083</value>
 <description></description>
</property>

metastore

Hive數據類型與表建立

基本類型

TINYINT,SMALLINT,INT,BIGINT,BOOLEAN,FLOAT,DOUBLE,STRING,BINARY,TIMESTAMP,DECIMAL,CHAR,VARCHAR,DATE

其中： TINYINT 1字節

SMALLINT 2字節

INT 4字節

BIGINT 8字節

FLOAT 4字節

DOUBLE 8字節

複雜類型包括

ARRAY,MAP,STRUCT,UNION

array是數組類型，map是鍵值對，struct是其餘類型的組合

array與struct元素的分隔符是^B(ctrl+B，建立表中的八進制\002)

map鍵與值的分隔符是^C(ctrl+C，建立表中的八進制\003)

Hive默認的列分隔符是：^A(ctrl+A，建立表中的八進制\001)

Hive默認的行分隔符是\n

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]

create table goods(id int,name string,amount int)
partitioned by (ctime date)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003'
lines terminated by '\n'
;

能夠指定多個分區，分區列不用添加到建立表主體的字段中。

Hive數據

Hive的本質是將SQL語句轉換爲MapReduce任務運行
Hive底層數據是存儲在HDFS上
Hive 的存儲結構包括數據庫、表、視圖、分區和表數據等
Hive數據庫、表、分區等等都對應HDFS上的一個目錄
Hive表數據對應HDFS對應目錄下的文件
Hive數據在HDFS中沒有專門的數據存儲格式
Hive只須要在建立表的時候告訴Hive數據中的列分隔符和行分隔符
Hive的元數據存儲在RDBMS中，除元數據外的其它全部數據都基於HDFS存儲
Hive元數據保存在內嵌的Derby數據庫中，只能容許一個會話鏈接，只適合簡單的測試
Hive生產環境中爲了支持多用戶會話使用獨立的元數據庫，如MySQL

Hive數據模型

database，在HDFS中表現爲${hive.metastore.warehouse.dir}目錄下子目錄
table，在HDFS中表現爲database目錄下子目錄
external table，與table相似，不過其數據存放位置能夠指定任意HDFS目錄路徑
partition，在HDFS中表現爲table目錄下的子目錄
bucket，在HDFS中表現爲同一個表目錄或者分區目錄下根據某個字段的值進行hash散列以後的多個文件
view，只讀，基於基本表建立

內部表和外部表的區別：

刪除內部表，刪除表元數據和數據
刪除外部表，刪除元數據，不刪除數據

若是數據的全部處理都在Hive中進行，那麼使用內部表，可是若是Hive 和其餘工具要針對相同的數據集進行處理，使用外部表更合適。

分區表和分桶表的區別：

Hive數據表能夠根據某些字段進行分區操做，細化數據管理，可讓部分查詢更快。同時表和分區也能夠進一步被劃分爲Buckets。

Java API

import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.sql.*;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Random;


public class HiveTest {

//    public static final String HIVE_URL = "jdbc:hive://127.0.0.1:10000/default";//hiveserver1
    private static final String HIVE_URL = "jdbc:hive2://127.0.0.1:10000/test";//hiveserver2

    private static String DRIVER_NAME = "org.apache.hive.jdbc.HiveDriver";

    private Connection connection;

    @Before
    public void setUp() throws ClassNotFoundException, SQLException {
        Class.forName(DRIVER_NAME);
        connection = DriverManager.getConnection(HIVE_URL,"hive","");
    }

    @Test
    public void showDbs() throws SQLException {
        Statement statement = connection.createStatement();
        String sql = "show databases";
        ResultSet rs = statement.executeQuery(sql);
        while (rs.next()){
            System.out.println(rs.getString(1));
        }
    }

    @Test
    public void createDB() throws SQLException {
        //create database dbName
        //create schema dbName
        Statement statement = connection.createStatement();
        String sql = "create schema mytable";
        statement.execute(sql);
    }

    @Test
    public void createTable() throws SQLException {
        String sql = "create table goods(id int,name string,amount int) " +
                "partitioned by (ctime date) " +
                "row format delimited " +
                "fields terminated by '\\001' " +
                "collection items terminated by '\\002' " +
                "map keys terminated by '\\003' " +
                "lines terminated by '\\n'";
        Statement statement = connection.createStatement();
        statement.execute(sql);
    }

    @Test
    public void createNewGoodsTable() throws SQLException {
        String sql = "create table new_goods(id int,name string,amount int) " +
                "partitioned by (ctime string) " +
                "row format delimited " +
                "fields terminated by '\\001' " +
                "collection items terminated by '\\002' " +
                "map keys terminated by '\\003' " +
                "lines terminated by '\\n'";
        Statement statement = connection.createStatement();
        statement.execute(sql);
    }

    @Test
    public void insertGoods() throws SQLException {
        String sql = "insert into goods(id,name,amount,ctime) values(?,?,?,?)";
        PreparedStatement ps = connection.prepareStatement(sql);
        Random random = new Random();
        String[] names = {"allen","alice","bob","tony","ribon"};
        Calendar instance = Calendar.getInstance();
        for(int i=0;i<10;i++){
            ps.setInt(1,random.nextInt(1000));
            ps.setString(2,names[random.nextInt(names.length)]);
            ps.setInt(3,random.nextInt(100000));
            instance.add(Calendar.DAY_OF_MONTH,random.nextInt(3));
            ps.setDate(4,new Date(instance.getTimeInMillis()));
            ps.executeUpdate();
        }
    }

    @Test
    public void insertDood() throws SQLException {
        String sql = "insert into new_goods(id,name,amount,ctime) values (1,'allen',100,'2019-07-04')";
        Statement statement = connection.createStatement();
        statement.executeUpdate(sql);
        System.out.println("insert done");
        statement.close();
    }

    @Test
    public void selectGood() throws SQLException {
        String sql = "select id,name,amount,ctime from new_goods";
        Statement statement = connection.createStatement();
        ResultSet rs = statement.executeQuery(sql);
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
        while (rs.next()){
            System.out.println("id:" + rs.getInt(1));
            System.out.println("name:" + rs.getString(2));
            System.out.println("amount:" + rs.getInt(3));
            System.out.println("ctime:" + rs.getString(4));
        }
    }

    @Test
    public void selectGoods() throws SQLException {
        String sql = "select id,name,amount,ctime from goods";
        Statement statement = connection.createStatement();
        ResultSet rs = statement.executeQuery(sql);
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
        while (rs.next()){
            System.out.println("id:" + rs.getInt(1));
            System.out.println("name:" + rs.getString(2));
            System.out.println("amount:" + rs.getInt(3));
            System.out.println("ctime:" + sdf.format(rs.getDate(4)));
        }
    }

    @Test
    public void descTable() throws SQLException {
        //desc tableName
        //describe tableName
        String sql = "desc goods";
        Statement statement = connection.createStatement();
        statement.execute(sql);
    }

//    @Test
    public void dropTable() throws SQLException {
        //drop table
        String sql = "drop table goods";
        Statement statement = connection.createStatement();
        statement.execute(sql);
    }

    @Test
    public void showTable() throws SQLException {
        //show tables;
        String sql = "show tables";
        Statement statement = connection.createStatement();
        statement.execute(sql);
    }


    @After
    public void tearDown() throws SQLException {
        connection.close();
    }
}

使用hiveserver，URL爲：

jdbc:hive://127.0.0.1:10000/default

使用hiveserver2，URL爲：

jdbc:hive2://127.0.0.1:10000/default

若是遇到相似於下面的錯誤：

User: xxxx is not allowed to impersonate hive

能夠在hadoop的core-site.xml文件中加入以下配置，其中curitis爲用戶組，windows就是用戶名。

<property>
      <name>hadoop.proxyuser.curitis.groups</name>
      <value>*</value>
      <description></description>
 </property>
 <property>
      <name>hadoop.proxyuser.curitis.hosts</name>
      <value>*</value>
      <description></description>
  </property>

若是windows下用戶名包含.這樣的特殊字符，那就須要修改用戶名，稍微麻煩一點，先更名字，而後啓用Administrator用戶，使用Administrator用戶修改以前用戶名的用戶目錄，而後修改註冊表：

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Profilelist

下面找到對應的舊用戶名，修改成新的用戶名。

若是遇到相似於下面的錯誤：

AccessControlException Permission denied: user=hive, access=WRITE, inode="/user/hive/warehouse/test.db":admin:supergroup:drwxr-xr-x

能夠執行：

hadoop fs -chmod -R 777 /user

若是遇到相似於下面的錯誤：

ipc.Client: Retrying connect to server: account.jetbrains.com/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

能夠在hadoop的yarn.site.xml配置文件中加入下面的配置：

<property>
    <name>yarn.resourcemanager.address</name>
    <value>127.0.0.1:8032</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value> 127.0.0.1:8030</value>
</property>
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value> 127.0.0.1:8031</value>
</property>

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
        http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.curitis</groupId>
    <artifactId>hive-learn</artifactId>
    <version>1.0.0</version>

    <properties>
        <spring.version>5.1.3.RELEASE</spring.version>
        <junit.version>4.11</junit.version>
        <hive.version>3.1.1</hive.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-common</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>${hive.version}</version>
        </dependency>

        <!--test-->
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-test</artifactId>
            <version>${spring.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>${junit.version}</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

log4j2.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration status="INFO" monitorInterval="600">

    <Properties>
<!--        <property name="LOG_HOME">${sys:user.home}/hive/test</property>-->
        <property name="LOG_HOME">F:/logs/hive/test</property>
        <!--輸出日誌的格式
         %d{yyyy-MM-dd HH:mm:ss, SSS} : 日誌生產時間
         %p : 日誌輸出格式
         %c : logger的名稱
         %m : 日誌內容，即 logger.info("message")
         %n : 換行符
         %C : Java類名
         %L : 日誌輸出所在行數
         %M : 日誌輸出所在方法名
         hostName : 本地機器名
         hostAddress : 本地ip地址 -->
        <Property name="PATTERN_ONE">%5p [%t] %d{yyyy-MM-dd HH:mm:ss} (%F:%L) %m%n</Property>
        <Property name="PATTERN_TWO">%d{HH:mm:ss.SSS} %-5level %class{36} %L %M - %msg%xEx%n</Property>
    </Properties>

    <appenders>
        <console name="Console" target="SYSTEM_OUT">
            <ThresholdFilter level="INFO" onMatch="ACCEPT" onMismatch="DENY" />
            <PatternLayout pattern="${PATTERN_ONE}" />
        </console>

        <File name="FileLog" fileName="${LOG_HOME}/hive.log" append="false">
            <ThresholdFilter level="DEBUG" onMatch="ACCEPT" onMismatch="DENY"/>
            <PatternLayout pattern="${PATTERN_TWO}"/>
        </File>

        <RollingFile name="RollingFileInfo" fileName="${sys:user.home}/logs/info.log"
                     filePattern="${sys:user.home}/logs/$${date:yyyy-MM}/info-%d{yyyy-MM-dd}-%i.log">
            <ThresholdFilter level="info" onMatch="ACCEPT" onMismatch="DENY"/>
            <PatternLayout pattern="[%d{HH:mm:ss:SSS}] [%p] - %l - %m%n"/>
            <Policies>
                <TimeBasedTriggeringPolicy/>
                <SizeBasedTriggeringPolicy size="100 MB"/>
            </Policies>
        </RollingFile>

        <RollingFile name="RollingFileWarn" fileName="${sys:user.home}/logs/warn.log"
                     filePattern="${sys:user.home}/logs/$${date:yyyy-MM}/warn-%d{yyyy-MM-dd}-%i.log">
            <ThresholdFilter level="warn" onMatch="ACCEPT" onMismatch="DENY"/>
            <PatternLayout pattern="[%d{HH:mm:ss:SSS}] [%p] - %l - %m%n"/>
            <Policies>
                <TimeBasedTriggeringPolicy/>
                <SizeBasedTriggeringPolicy size="100 MB"/>
            </Policies>
            <DefaultRolloverStrategy max="20"/>
        </RollingFile>

        <RollingFile name="RollingFileError" fileName="${sys:user.home}/logs/error.log"
                     filePattern="${sys:user.home}/logs/$${date:yyyy-MM}/error-%d{yyyy-MM-dd}-%i.log">
            <ThresholdFilter level="error" onMatch="ACCEPT" onMismatch="DENY"/>
            <PatternLayout pattern="[%d{HH:mm:ss:SSS}] [%p] - %l - %m%n"/>
            <Policies>
                <TimeBasedTriggeringPolicy/>
                <SizeBasedTriggeringPolicy size="100 MB"/>
            </Policies>
        </RollingFile>
    </appenders>

    <loggers>
        <logger name="org.springframework" level="INFO"></logger>
        <root level="all">
            <appender-ref ref="Console"/>
            <appender-ref ref="FileLog"/>
<!--            <appender-ref ref="RollingFileInfo"/>-->
<!--            <appender-ref ref="RollingFileWarn"/>-->
<!--            <appender-ref ref="RollingFileError"/>-->
        </root>
    </loggers>
</configuration>

HWI(Hie Web Interface)

2.2.0以前可用

HWI

hive-site.xml

<property>
    <name>hive.hwi.listen.host</name>
    <value>0.0.0.0</value>
    <description>監聽的地址</description>
</property>
<property>
    <name>hive.hwi.listen.port</name>
    <value>9999</value>
    <description>監聽的端口號</description>
</property>
<property>
    <name>hive.hwi.war.file</name>
    <value>${HIVE_HOME}/lib/hive-hwi-2.1.0.war</value>
    <description>war包所在的地址</description>
</property>