hadoop記錄篇6-數據倉庫hive

一 。hive簡介html

   hive是基於hadoop文件系統的大數據分析工具,可以輕鬆實現數據彙總 點對點查詢 大批量數據分析等 使用傳統的SQL語法  提供了UDF 用戶自定義函數來分析統計數據。
hive的數據組成:
     數據庫(Databases) 相似於 mysql的數據庫 用於將不一樣表進行區分的命名空間;
     表(Table) ddl表預先定義列名和數據的格式 dml操做帶有行和列的數據集;
     分區(Partitions) 根據特定的列進行分區 數據會被寫到文件系統不一樣的分區目錄;
     桶 Buckets (or Clusters) 定義hash的桶的個數和列 插入數據時根據列的值%桶的個數 決定文件寫入哪一個文件java

 hive架構:#aaaanode

    hive啓動將表結構數據存儲在derby或者其餘數據庫中,將數據行存儲在hdfs文件系統中 客戶端經過jdbc或者對應客戶端發起查詢時 hive將對應sql結構分析後 經過metastore(表結構)優化後轉換成mapreduce任務執行 返回結果(圖片摘自網絡)mysql

hive適用於少許用戶統計查詢數據 不適用於大批量用戶請求linux

二。 hive安裝web

1》hive單機安裝(參考官方wiki https://cwiki.apache.org/confluence/display/Hive/Home)sql

   hive使用hadoop來存儲數據必須擁有hadoop環境  使用關係型數據庫存儲表結構數據 hive內置了derby數據庫 默認使用derby 須要搭建hadoop集羣mongodb

 環境參考http://blog.csdn.net/liaomin416100569/article/details/78360734環境shell

 主機信息數據庫

 

/etc/hosts      
192.168.58.147 node1      
192.168.58.149 node2      
192.168.58.150 node3      
192.168.58.151 node4

節點信息

 

 

namenode      
   node1  nameserviceid:node1      
   node2  nameserviceid:node2    
secondarynode    
   node1    
   node2    
DataNodes      
   node2      
   node3      
   node4      
        
Resource Manager  
  node1  
NodeManger  
  node2  
  node3  
  node4

我這裏是fendration因此 將node1的core-site.xml 默認dfs修改爲本機非viewfs

 

 

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://node1</value>
</property>

hive單機還 安裝在node1節點上 

 安裝jdk1.7以上版本(安裝過hadoop因此有了)
下載hive版本 apache-hive-2.2.0-bin.tar.gz (hive.apache.org)
 解壓hive安裝包到:  /soft/hive-2.2.0  
  /etc/profile添加

HIVE_HOME=/soft/hive-2.2.0
export HIVE_HOME
PATH=$PATH:${HIVE_HOME}/bin
export PATH

執行命令 馬上生效

source /etc/profile

node1的hdfs上建立幾個目錄用於存儲數據

 

 

hdfs上建立關鍵目錄
 hdfs dfs -mkdir       /tmp
 hdfs dfs  -mkdir  -p     /user/hive/warehouse
 hdfs dfs  -chmod g+w   /tmp
 hdfs dfs -chmod g+w   /user/hive/warehouse

單機方式metadata文件只能同一時間一個用戶同時訪問 該方式不啓用任何端口 鏈接就直接訪問元數據和文件系統(hive要讀取hadoop知道hdfs默認地址因此必須配置HADOOP_HOME)
 內嵌數據庫 derby管理元數據 
將 $HIVE_HOME/conf/metastore_db 刪除或者 備份爲其餘名字

mv $HIVE_HOME/conf/metastore_db $HIVE_HOME/conf/metastore_db_tmp

能夠經過 /conf目錄下的hive-default.xml.template下查看到如下四個選項 分別是jdbc的url 驅動類 用戶名和密碼

 

 

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=metastore_db;create=true</value> 默認使用derby管理元數據
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property> 
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>mine</value>
    <description>password to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>APP</value>
    <description>Username to use against metastore database</description>
  </property>
   <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

 若是有須要將元數據存儲在其餘的數據庫中 能夠將驅動jar包 置於 ${HIVE_HOME}/lib目錄下 修改這四個值爲對應數據庫便可 

 

hive支持的數據庫經過如下方式查看

[root@node1 upgrade]# pwd
/soft/hive-2.2.0/scripts/metastore/upgrade
[root@node1 upgrade]# ll
total 24
drwxr-xr-x 2 root root   89 Oct 27 01:39 azuredb
drwxr-xr-x 2 root root 4096 Oct 27 01:39 derby
drwxr-xr-x 2 root root 4096 Oct 27 01:39 mssql
drwxr-xr-x 2 root root 4096 Oct 27 01:39 mysql
drwxr-xr-x 2 root root 4096 Oct 27 01:39 oracle
drwxr-xr-x 2 root root 4096 Oct 27 01:39 postgres

初始化 hive 使用內置的derby管理元數據 (也就是表結構數據)
  

schematool -dbType derby -initSchema

初始化成功使用hive命令 進入客戶端控制

 

 

hive

能夠使用相似mysql的語法建立數據庫 操做表等

 

 若是過程當中發現沒法鏈接等問題 能夠直接查看hive-log4j2.properties.template 查看日誌文件在哪裏
 默認位置是:property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
 也就是/tmp/root目錄 下面有個文件是 hive.log
 

2》hive服務器(hiveserver2+beeline)安裝

  hive服務器模式是啓動一個端口 對外發布 多個客戶端經過jdbcapi 登陸該端口進行統計查詢 該模式適用於多用戶模式

順便這裏演示一下 將元數據保存在mysql中

  好比本機window mysql數據庫  192.168.58.1 3306  root 123456
$HIVE_HOME/conf目錄下新建文件 hive-site.xml 拷貝四要素 並修改成mysql

 

configuration>

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://192.168.58.1:3306/metadb</value>  
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
</property> 
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description>password to use against metastore database</description>
  </property>
  <property> 
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>Username to use against metastore database</description>
  </property>
   <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
	
</configuration>

拷貝mysql驅動包 到  ${HIVE_HOME}/lib目錄下
58.1 mysql數據庫上建立數據庫 metadb

create database metadb

node1執行初始化metadb

 

 

[root@node1 lib]# schematool -dbType mysql -initSchema  
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:        jdbc:mysql://192.168.58.1:3306/metadb
Metastore Connection Driver :    com.mysql.jdbc.Driver
Metastore connection User:       root
Starting metastore schema initialization to 2.1.0
Initialization script hive-schema-2.1.0.mysql.sql
Initialization script completed

window機器上看出 metadb數據庫下是否多了一堆表  

mysql> use metadb
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_metadb          |
+---------------------------+
| aux_table                 |
| bucketing_cols            |
| cds                       |
| columns_v2                |
| compaction_queue          |
| completed_compactions     |
| completed_txn_components  |
| database_params           |
| db_privs                  |
| dbs                       |
| delegation_tokens         |
| func_ru                   |
| funcs                     |
| global_privs              |
| hive_locks                |
| idxs                      |
| index_params              |
| key_constraints           |
| master_keys               |
| next_compaction_queue_id  |
| next_lock_id              |
| next_txn_id               |
| notification_log          |
| notification_sequence     |
| nucleus_tables            |
| part_col_privs            |
| part_col_stats            |
| part_privs                |
| partition_events          |
| partition_key_vals        |
| partition_keys            |
| partition_params          |
| partitions                |
| role_map                  |
| roles                     |
| sd_params                 |
| sds                       |
| sequence_table            |
| serde_params              |
| serdes                    |
| skewed_col_names          |
| skewed_col_value_loc_map  |
| skewed_string_list        |
| skewed_string_list_values |
| skewed_values             |
| sort_cols                 |
| tab_col_stats             |
| table_params              |
| tbl_col_privs             |
| tbl_privs                 |
| tbls                      |
| txn_components            |
| txns                      |
| type_fields               |
| types                     |
| version                   |
| write_set                 |
+---------------------------+
57 rows in set (0.01 sec)

確保當前機器存在hadoop安裝 而且全部hadoop和yarn必須啓動如下配置是hadoop的一種代理機制  

 

修改hadoop的core-site.xml

 

<property>
    <name>hadoop.proxyuser.用戶名.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.用戶名.groups</name>
    <value>*</value>
</property>

用戶名能夠隨便取  *表示hadoop集羣中全部機器的用戶 能夠使用本身指定的用戶名假裝代理 就至關於我有了工牌 就不用報名字了
hive默認使用的是匿名帳號沒有任何權限 全部鏈接時 配置一個代理用戶 -n 代理用戶 就能夠訪問hadoop下全部文件了 代理
用戶 相關 參考(http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/Superusers.html)

 

我假設hive鏈接是 root

<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>

node1(58.147)啓動服務 

 

 

hiveserver2

 

默認啓動端口10000(用於jdbc鏈接)  10002是web監控界面
其餘機器上也安裝一個hive 

 

beeline -u jdbc:hive2://192.168.58.147:10000 -n root 使用root用戶鏈接
建立表 查詢測試

注意 yarn和hadoop必定要所有啓動 (jps查看) 由於測試插入數據 調用mapreduce 
客戶端測試

 

 

[root@node2 bin]# beeline -u jdbc:hive2://192.168.58.147:10000 -n root
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://192.168.58.147:10000
Connected to: Apache Hive (version 2.2.0)
Driver: Hive JDBC (version 2.2.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.2.0 by Apache Hive
0: jdbc:hive2://192.168.58.147:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| test           |
+----------------+--+
2 rows selected (4.615 seconds)
0: jdbc:hive2://192.168.58.147:10000> drop database test;
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database test is not empty. One or more tables exist.) (state=08S01,code=1)
0: jdbc:hive2://192.168.58.147:10000> create database hello;
No rows affected (0.96 seconds)
0: jdbc:hive2://192.168.58.147:10000> use hello;
No rows affected (0.252 seconds)
0: jdbc:hive2://192.168.58.147:10000> create table tt(id int,name string);
No rows affected (1.403 seconds)
0: jdbc:hive2://192.168.58.147:10000> insert into tt values(1,'zs');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (45.633 seconds)
0: jdbc:hive2://192.168.58.147:10000> select * from tt;
+--------+----------+--+
| tt.id  | tt.name  |
+--------+----------+--+
| 1      | zs       |
+--------+----------+--+
1 row selected (1.515 seconds)

在hiveserver2控制檯 看到插入數據會啓動mapreduce

 


測試成功。。。。

使用help查看全部beeline全部幫助

 

Beeline version 2.2.0 by Apache Hive
beeline> help
!addlocaldriverjar  Add driver jar file in the beeline client side.
!addlocaldrivername Add driver name that needs to be supported in the beeline
                    client side.
!all                Execute the specified SQL against all the current connections
!autocommit         Set autocommit mode on or off
!batch              Start or execute a batch of statements
!brief              Set verbose mode off
!call               Execute a callable statement
!close              Close the current connection to the database
!closeall           Close all current open connections
!columns            List all the columns for the specified table
!commit             Commit the current transaction (if autocommit is off)
!connect            Open a new connection to the database.
!dbinfo             Give metadata information about the database
!describe           Describe a table
!dropall            Drop all tables in the current database
!exportedkeys       List all the exported keys for the specified table
!go                 Select the current connection
!help               Print a summary of command usage
!history            Display the command history
!importedkeys       List all the imported keys for the specified table
!indexes            List all the indexes for the specified table
!isolation          Set the transaction isolation for this connection
!list               List the current connections
!manual             Display the BeeLine manual
!metadata           Obtain metadata information
!nativesql          Show the native SQL for the specified statement
!nullemptystring    Set to true to get historic behavior of printing null as
                    empty string. Default is false.
!outputformat       Set the output format for displaying results
                    (table,vertical,csv2,dsv,tsv2,xmlattrs,xmlelements, and
                    deprecated formats(csv, tsv))
!primarykeys        List all the primary keys for the specified table
!procedures         List all the procedures
!properties         Connect to the database specified in the properties file(s)
!quit               Exits the program
!reconnect          Reconnect to the database
!record             Record all output to the specified file
!rehash             Fetch table and column names for command completion
!rollback           Roll back the current transaction (if autocommit is off)
!run                Run a script from the specified file
!save               Save the current variabes and aliases
!scan               Scan for installed JDBC drivers
!script             Start saving a script to a file
!set                Set a beeline variable
!sh                 Execute a shell command
!sql                Execute a SQL command
!tables             List all the tables in the database
!typeinfo           Display the type map for the current connection
!verbose            Set verbose mode on

Comments, bug reports, and patches go to ???


退出命令 !quit

 

3》hcatalog非交互客戶端

首先沿用2章節 hiveserver2的metastore  

hcatalog能夠直接在命令行直接執行sql 主要用執行ddl語句

建立日誌目錄

 

mkdir -p /soft/hive-2.2.0/hcatalog/var/log

 

 啓動服務

 

[root@node1 sbin]# ./hcat_server.sh start
Started metastore server init, testing if initialized correctly...
Metastore initialized successfully on port[9083].

當前機器使用 命令

 

 

[root@node1 bin]# ./hcat -e "create table ggg(id int,name string)"

使用hive查看是否建立表

 

 

[root@node1 log]# hive
which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/mongodb-linux-x86_64-rhel70-3.4.9/bin:/soft/hadoop-2.7.4/bin:/soft/hadoop-2.7.4/sbin:/soft/hive-2.2.0/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/soft/hive-2.2.0/lib/hive-common-2.2.0.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
hello
test
Time taken: 3.325 seconds, Fetched: 3 row(s)
hive> show tables;
OK
ggg
Time taken: 0.209 seconds, Fetched: 1 row(s)
hive> desc ggg;
OK
id                      int                                         
name                    string                                      
Time taken: 0.926 seconds, Fetched: 2 row(s)

其餘用法參考(https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat)

hcatalog只支持單機訪問 若是須要遠程 必須搭配webhcat服務

webhcat支持rest風格url操做hive

具體參考

  安裝 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ConfigurationManagementOverview

  介紹https://cwiki.apache.org/confluence/display/Hive/WebHCat+UsingWebHCat  

相關文章
相關標籤/搜索