在Docker環境成功搭建了Apache Hadoop 2.8 分佈式集羣,並實現了NameNode HA、ResourceManager HA以後(詳見個人另外一篇博文:Apache Hadoop 2.8分佈式集羣詳細搭建過程),接下來將搭建最新穩定版的Apache Hive 2.1.1,方便平常在本身電腦上測試hive配置和做業,一樣的配置也能夠應用於服務器上。如下是Apache Hive 2.1.1的安裝配置詳細過程html
一、閱讀Apache Hive官網說明文檔,下載最新版本Hivejava
Hive是一個基於Hadoop的數據倉庫工具,將HDFS中的結構化數據映射爲數據表,並實現將類SQL腳本轉換爲MapReduce做業,從而實現用戶只需像傳統關係型數據庫提供SQL語句,並能實現對Hadoop數據的分析和處理,門檻低,很是適合傳統的基於關係型數據庫的數據分析向基於Hadoop的分析進行轉變。所以,Hive是Hadoop生態圈很是重要的一個工具。node
安裝配置Apache Hive,最直接的方式,即是閱讀 Apache Hive官網的說明文檔,能瞭解到不少有用的信息。Apache Hive 要求JDK 1.7及以上,Hadoop 2.x(從Hive 2.0.0開始便再也不支持Hadoop 1.x),Hive 可部署於Linux、Mac、Windows環境。python
從官網下載最新穩定版本的 Apache Hive 2.1.1mysql
二、安裝配置Apache Hivelinux
(1)解壓 hive 壓縮包web
tar -zxvf apache-hive-2.1.1-bin.tar.gz
(2)配置環境變量sql
vi ~/.bash_profile # 配置hive環境 export HIVE_HOME=/home/ahadoop/apache-hive-2.1.1-bin export HIVE_CONF_DIR=$HIVE_HOME/conf export PATH=$PATH:$HIVE_HOME/bin # 使配置文件生效 source ~/.bash_profile
(3)配置hive-site.xmldocker
官網給出了Apache Hive的配置說明,Hive的配置支持多種方式,主要以下(以map-reduce臨時目錄配置項 hive.exec.scratchdir 爲例):數據庫
set hive.exec.scratchdir=/tmp/mydir;
bin/hive --hiveconf hive.exec.scratchdir=/tmp/mydir
<property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property>
<property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property>
當同時出現多種配置方式時,則按如下優先級生效(越日後,優先級越高):
hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml -> '--hiveconf' 命令行參數
在 $HIVE_HOME/conf 裏面還有一個默認的配置文件 hive-default.xml.template ,這裏存儲了默認的參數,經過複製該默認配置模板,並命名爲hive-site.xml,用於配置新的參數
cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml
在配置hive-site.xml以前,要先作一些準備工做
首先,在HDFS上新建文件夾
a、臨時文件夾,默認map-reduce臨時的中轉路徑是hdfs上的/tmp/hive-<username>,所以,若是hdfs上沒有/tmp臨時文件夾,則新建並受權
hadoop fs -mkdir /tmp hadoop fs -chmod g+w /tmp
b、建立hive數據倉庫目錄,默認是存放在hdfs上的/user/hive/warehouse目錄,新建並受權
hadoop fs -mkdir /user/hive hadoop fs -mkdir /user/hive/warehouse hadoop fs -chmod g+w /user/hive hadoop fs -chmod g+w /user/hive/warehouse
其次,安裝mysql用於存儲hive的元數據
hive 默認使用Derby做爲Hive metastore的存儲數據庫,這個數據庫更多用於單元測試,只支持一個用戶訪問,在生產環境,建議改爲功能更強大的關係型數據庫,根據官網的介紹,支持用於存儲hive元數據的數據庫以下:
hive元數據支持的數據庫 | 最低版本要求 |
---|---|
MySQL | 5.6.17 |
Postgres | 9.1.13 |
Oracle | 11g |
MS SQL Server | 2008 R2 |
因爲元數據的數據量都比較小,通常都以安裝mysql來存儲元數據。下面將介紹mysql的安裝配置
a、到MySQL官網打開MySQL 社區版下載頁面,而後下載如下的MySQL rpm安裝包
b、MySQL官網有介紹MySQL rpm包的安裝方法,通常須要安裝 mysql-community-server, mysql-community-client, mysql-community-libs, mysql-community-common, and mysql-community-libs-compat 這些包。在MySQL服務端至少安裝 mysql-community-{server,client,common,libs}-* 軟件 包,在MySQL客戶端至少安裝 mysql-community-{client,common,libs}-* 軟件包
在安裝以前,先查看一下,系統以前是否有安裝過mysql相關的包,若是有,則卸載掉,輸入指令查詢
rpm -qa|grep mysql
在本實驗中,因爲在docker裏面的centos6安裝,因爲是centos6精簡環境,還須要安裝一些相應的依賴包,以下
yum install -y perl libaio numactl.x86_64
接下來,按順序安裝mysql 的 rpm包,因爲這幾個rpm包有依賴關係,所以,安裝時按如下順序逐個安裝
rpm -ivh mysql-community-common-5.7.18-1.el6.x86_64.rpm rpm -ivh mysql-community-libs-5.7.18-1.el6.x86_64.rpm rpm -ivh mysql-community-libs-compat-5.7.18-1.el6.x86_64.rpm rpm -ivh mysql-community-client-5.7.18-1.el6.x86_64.rpm rpm -ivh mysql-community-devel-5.7.18-1.el6.x86_64.rpm rpm -ivh mysql-community-server-5.7.18-1.el6.x86_64.rpm
c、所有安裝完成後,則使用 service mysqld start 啓動mysql服務,首次啓動時,mysql 數據庫還會進行初始化,並生成root的初始密碼
[root@31d48048cb1e ahadoop]# service mysqld start Initializing MySQL database: [ OK ] Installing validate password plugin: [ OK ] Starting mysqld: [ OK ]
d、在日誌裏面獲取root初始密碼,使用如下命令
[root@31d48048cb1e ahadoop]# grep 'temporary password' /var/log/mysqld.log 2017-06-23T04:04:40.322567Z 1 [Note] A temporary password is generated for root@localhost: g1hK=pYBo(x9
其中,最後的 g1hK=pYBo(x9 就是初始密碼(隨機產生的,每次安裝不同的哦)
使用初始密碼,登陸mysql並修改root密碼爲 Test.123
mysql -u root -p mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'Test.123';
【注意】MySQL 默認會開啓強密碼驗證(MySQL's validate_password plugin is installed by default),要求密碼長度至少8個字符,包含至少1個大寫、1個小寫、1個數字、1個特殊字符。
e、修改數據庫的字符集,查看默認的字符集
mysql> SHOW VARIABLES like 'character%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec)
能夠看出,database、server的字符集爲latin1,若是後面在建數據庫、數據表時,沒有指定utf8,輸入中文會變成亂碼。MySQL 官網有介紹了更改字符集的方法,修改 mysql 的配置文件
vi /etc/my.cnf # 在 [mysqld] 下面加上這個配置 [mysqld] character-set-server=utf8 # 若是 client 默認不是 utf8,要改爲 utf8 則在 [client] 中加上這個配置 [client] default-character-set=utf8
更改好配置文件後,保存退出,重啓 mysql
service mysqld restart
再查看數據庫的字符集,已變成utf8,以下
mysql> SHOW VARIABLES like 'character%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec)
f、建立用於存儲hive元數據庫的數據庫、帳號密碼
使用mysql的root帳號進入mysql後,建立數據庫 hivedb,帳號 hive,密碼 Test.123
create database hivedb; GRANT ALL ON hivedb.* TO 'hive'@'localhost' IDENTIFIED BY 'Test.123'; flush privileges;
到此,存儲hive元數據的mysql已經搭建完成。
接下來,將繼續配置hive-site.xml,使用mysql進行存儲,hive官網有說明文檔
# 複製 hive 的默認配置模板文件爲 hive-site.xml cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml # 複製 mysql 驅動包到 lib 目錄 mv mysql-connector-java-6.0.6-bin.jar $HIVE_HOME/lib/
編輯hive_site.xml,修改jdbc連接(hd1爲mysql server的主機名)、驅動、帳號、密碼、數據倉庫默認路徑(hdfs)等信息,以下:
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hd1:3306/hivedb?createDatabaseIfNotExist=true</value> <description> JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>Test.123</value> <description>password to use against metastore database</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property>
在完成基本配置後,還有如下幾點須要注意的
a、按官網文檔,還強烈推薦在hive-site.xml中配置datanucleus.autoStartMechanism項,以解決多併發讀取失敗的問題(HIVE-4762),配置以下
<property> <name>datanucleus.autoStartMechanism</name> <value>SchemaTable</value> </property>
b、在hive-site.xml中再配置元數據認證爲false,不然啓動時會報如下異常
Caused by: MetaException(message:Version information not found in metastore. )
配置以下
<property> <name>hive.metastore.schema.verification</name> <value>false</value> <description> Enforce metastore schema version consistency. True: Verify that version information stored in metastore matches with one from Hive jars. Also disable automatic schema migration attempt. Users are required to manully migrate schema after Hive upgrade which ensures proper metastore schema migration. (Default) False: Warn if the version information stored in metastore doesn't match with one from in Hive jars. </description> </property>
當配置爲true時,則表示會強制metastore的版本信息與hive jar 一致。(這裏很奇怪,使用hive官網下載的包來解壓安裝,按理metastore的版本信息應該是會和hive jar一致的,怎麼設置爲true會報異常呢)
c、配置 io 臨時目錄,不然會報異常
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D at org.apache.hadoop.fs.Path.initialize(Path.java:254) at org.apache.hadoop.fs.Path.<init>(Path.java:212) at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:644) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:563) at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:531) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:234) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D at java.net.URI.checkPath(URI.java:1823) at java.net.URI.<init>(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:251) ... 12 more
這是由於在hive-site.xml中使用了變量${system:java.io.tmpdir}爲表示io臨時目錄,但沒有指定這個變量的值,也即沒有指定io臨時目錄的路徑,所以會報異常。
建立io臨時目錄
mkdir /home/ahadoop/hive-data mkdir /home/ahadoop/hive-data/tmp
配置 hive-site.xml 指定 io 臨時目錄的路徑(本實驗使用的linux帳號爲ahadoop,可根據實際狀況修改)
<property> <name>system:java.io.tmpdir</name> <value>/home/ahadoop/hive-data/tmp</value> </property> <property> <name>system:user.name</name> <value>ahadoop</value> </property>
通過以上步驟,已經完成了hive-site.xml的配置了
(4)初始化hive元數據庫
執行如下指令初始化hive元數據庫,不然 mysql 裏面儲存 hive 元數據的數據庫是空的,沒法啓動 hive,會報錯
schematool -dbType mysql -initSchema
(5)啓動 hive
# 輸入 hive ,啓動 hive $ hive # 查看數據庫 hive> show databases; OK default Time taken: 1.221 seconds, Fetched: 1 row(s) # 建立數據庫 hive> create database testdb; OK Time taken: 0.362 seconds # 切換數據庫 hive> use testdb; OK Time taken: 0.032 seconds # 建立數據表 hive> create table tmp_test(a int,b string) row format delimited fields terminated by '|'; OK Time taken: 0.485 seconds # 導入數據 hive> LOAD DATA LOCAL INPATH 'tmp_test' OVERWRITE INTO TABLE tmp_test; Loading data to table testdb.tmp_test OK Time taken: 2.926 seconds # 查詢數據表 hive> select * from tmp_test; OK 1 fdsfds 2 dddd 3 4fdss Time taken: 1.376 seconds, Fetched: 3 row(s)
建立數據庫、數據表、導入數據、查詢數據等,都能正常執行,說明 hive 已經配置成功了
三、配置使用 beeline
beeline 是 hive 提供的一個新的命令行工具,基於SQLLine CLI的JDBC客戶端,beeline 要與HiveServer2配合使用,支持嵌入模式和遠程模式兩種,也即既能夠像hive client同樣訪問本機的hive服務,也能夠經過指定ip和端口遠程訪問某個hive服務。hive 官網是推薦使用beeline,它還提供了更爲友好的顯示方式(相似mysql client)
a、要使用 beeline ,先把 hiveserver2 啓動起來,默認端口爲10000
# 啓動 hiveserver2 $ hiveserver2
b、使用beeline
# 一、指定要鏈接的hiveserver2的主機、端口 beeline -u jdbc:hive2://hd1:10000 # 二、若是是本機的hiveserver2,則可省略主機、端口 beeline -u jdbc:hive2://
在鏈接時,beeline報異常
[ahadoop@31d48048cb1e ~]$ beeline -u jdbc:hive2://hd1:10000 which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ahadoop/bin:/usr/java/jdk1.8.0_131/bin:/home/ahadoop/hadoop-2.8.0/bin:/home/ahadoop/hadoop-2.8.0/sbin:/home/ahadoop/zookeeper-3.4.10/bin:/home/ahadoop/apache-hive-2.1.1-bin/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/ahadoop/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/ahadoop/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Connecting to jdbc:hive2://hd1:10000 17/06/23 10:35:14 [main]: WARN jdbc.HiveConnection: Failed to connect to hd1:10000 Error: Could not open client transport with JDBC Uri: jdbc:hive2://hd1:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: ahadoop is not allowed to impersonate anonymous (state=08S01,code=0) Beeline version 2.1.1 by Apache Hive
顯示是本機的linux用戶ahadoop,不容許訪問的異常 (org.apache.hadoop.security.authorize.AuthorizationException): User: ahadoop is not allowed to impersonate anonymous
這是因爲hadoop 2.0之後引入了一個安全假裝機制,使得hadoop不容許上層系統(例如hive)直接將實際用戶傳遞到hadoop層,而是將實際用戶傳遞給一個超級代理,由該代理在hadoop上執行操做,避免任意客戶端隨意操做hadoop。所以,將本實驗的linux用戶ahadoop,設置爲代理用戶
在 $HADOOP_HOME/etc/hadoop/core-site.xml 中配置
<property> <name>hadoop.proxyuser.ahadoop.hosts</name> <value>*</value> <description>配置*,表示任意主機。也能夠指定是某些主機(主機之間使用英文逗號隔開),若是有指定,則表示超級用戶代理功能只支持指定的主機,在其它主機節點仍會報錯</description> </property> <property> <name>hadoop.proxyuser.ahadoop.groups</name> <value>*</value> <description>配置*,表示任意組。也能夠指定是某個組(組之間使用英文逗號隔開),若是有指定,則表示該組下面的用戶可提高爲超級用戶代理</description> </property>
配置後,重啓hadoop集羣,或者使用如下命令,刷新配置
hdfs dfsadmin –refreshSuperUserGroupsConfiguration yarn rmadmin –refreshSuperUserGroupsConfiguration # 針對 namenode HA 集羣的,兩個 namenode 都要刷新配置 hadoop dfsadmin -fs hdfs://hd1:8020 –refreshSuperUserGroupsConfiguration hadoop dfsadmin -fs hdfs://hd2:8020 –refreshSuperUserGroupsConfiguration
修改hadoop的core-site.xml配置後,再使用 beeline -u jdbc:hive2:// 從新鏈接,這時仍然報異常,以下
17/06/24 03:48:58 [main]: WARN Datastore.Schema: Exception thrown obtaining schema column information from datastore java.sql.SQLException: Column name pattern can not be NULL or empty. at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:545) ~[mysql-connector-java-6.0.6-bin.jar:6.0.6] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:513) ~[mysql-connector-java-6.0.6-bin.jar:6.0.6] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:505) ~[mysql-connector-java-6.0.6-bin.jar:6.0.6] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:479) ~[mysql-connector-java-6.0.6-bin.jar:6.0.6] at com.mysql.cj.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2074) ~[mysql-connector-java-6.0.6-bin.jar:6.0.6]
經查,緣由是因爲本實驗使用的mysql jdbc驅動是比較高的版本(mysql-connector-java-6.0.6-bin.jar),將其替換爲較低的版本(mysql-connector-java-5.1.30-bin.jar)
mv mysql-connector-java-5.1.30-bin.jar $HIVE_HOME/lib/ rm $HIVE_HOME/lib/mysql-connector-java-6.0.6-bin.jar
修改配置後,再次使用 beeline -u jdbc:hive2:// 從新鏈接,就正常了,可順利進入到 beeline 了
[ahadoop@31d48048cb1e ~]$ beeline -u jdbc:hive2:// 0: jdbc:hive2://> show databases; OK +----------------+--+ | database_name | +----------------+--+ | default | | testdb | +----------------+--+ 2 rows selected (1.89 seconds) 0: jdbc:hive2://> use testdb; OK No rows affected (0.094 seconds) 0: jdbc:hive2://> show tables; OK +------------+--+ | tab_name | +------------+--+ | tmp_test | | tmp_test2 | +------------+--+ 2 rows selected (0.13 seconds) 0: jdbc:hive2://> desc tmp_test; OK +-----------+------------+----------+--+ | col_name | data_type | comment | +-----------+------------+----------+--+ | a | int | | | b | string | | +-----------+------------+----------+--+ 2 rows selected (0.308 seconds) 0: jdbc:hive2://>
能夠看出,其顯示方式與mysql client比較相似
使用完beeline後,使用 !quit 退出 beeline
到此,beeline 成功完成配置
四、配置使用 Hive Web Interface(hwi)
Hive Web Interface(hwi)是Hive自帶的一個Web GUI,功能很少,可用於展現,查看數據表、執行hql腳本。官網有較爲詳細的介紹。
因爲hive-bin包中並無包含hwi的頁面,只有Java代碼編譯好的jar包(hive-hwi-2.1.1.jar),所以,還須要下載hive源代碼,從中提取jsp頁面文件並打包成war文件,放到hive-lib目錄中
(1)從apache hive中下載hive源代碼壓縮包:apache-hive-2.1.1-src.tar.gz
(2)解壓並將hwi頁面打包成war
# 解壓源代碼壓縮包 tar -zxvf apache-hive-2.1.1-src.tar.gz # 切換到hwi目錄 cd apache-hive-2.1.1-src/hwi # 將 web 目錄下的 jsp 頁面打包成 war 格式,輸入文件格式爲 hive-hwi-${version}.war jar cfM hive-hwi-2.1.1.war -C web .
(3)將war包複製到hive-lib目錄
cp apache-hive-2.1.1-src/hwi/hive-hwi-2.1.1.war $HIVE_HOME/lib/
(4)啓動 hwi
輸入指令啓動 hwi
hive --service hwi &
結果報異常,沒法啓動
[ahadoop@31d48048cb1e ~]$ hive --service hwi & [1] 3192 [ahadoop@31d48048cb1e ~]$ which: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ahadoop/bin:/usr/java/jdk1.8.0_131/bin:/home/ahadoop/hadoop-2.8.0/bin:/home/ahadoop/hadoop-2.8.0/sbin:/home/ahadoop/zookeeper-3.4.10/bin:/home/ahadoop/apache-hive-2.1.1-bin/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/ahadoop/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/ahadoop/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] [1]+ Exit 1 hive --service hwi [ahadoop@31d48048cb1e ~]$ tail /tmp/ahadoop/hive.log 2017-06-25T00:37:39,289 INFO [main] hwi.HWIServer: HWI is starting up 2017-06-25T00:37:39,328 INFO [main] conf.HiveConf: Found configuration file file:/home/ahadoop/apache-hive-2.1.1-bin/conf/hive-site.xml 2017-06-25T00:37:40,651 ERROR [main] hwi.HWIServer: HWI WAR file not found at /home/ahadoop/apache-hive-2.1.1-bin//home/ahadoop/apache-hive-2.1.1-bin/lib/hive-hwi-2.1.1.war
該異常是找不到 war 文件,其緣由是hwi啓動腳本($HIVE_HOME/bin/ext/hwi.sh)的bug,hwi.sh裏面用的是完整路徑,而HWIServer類中的代碼用的相對路徑:
String hwiWAR = conf.getVar(HiveConf.ConfVars.HIVEHWIWARFILE); String hivehome = System.getenv().get("HIVE_HOME"); File hwiWARFile = new File(hivehome, hwiWAR); if (!hwiWARFile.exists()) { l4j.fatal("HWI WAR file not found at " + hwiWARFile.toString()); System.exit(1); }
所以,會報錯。經過修改hive-site.xml,將hive.hwi.war.file的路徑改成相對路徑,與HWIServer類的代碼一致,則可避免這個問題
vi $HIVE_HOME/conf/hive-site.xml <property> <name>hive.hwi.war.file</name> <value>lib/hive-hwi-2.1.1.war</value> </property>
再次啓動後,仍然報錯
When initializing hive with no arguments, the CLI is invoked. Hive has an extension architecture used to start other hive demons. Jetty requires Apache Ant to start HWI. You should define ANT_LIB as an environment variable or add that to the hive invocation.
緣由是hwi使用Jetty做爲web容器,而該容器須要Apache Ant才能啓動,所以還須要安裝Apache Ant
a、從Apache Ant官網下載壓縮包(apache-ant-1.10.1-bin.tar.gz)
b、解壓
tar -zxvf apache-ant-1.10.1-bin.tar.gz
c、配置環境變量
vi .bash_profile export ANT_HOME=/home/ahadoop/apache-ant-1.10.1 export ANT_LIB=$ANT_HOME/lib export CLASSPATH=$CLASSPATH:$ANT_LIB export PATH=$PATH:$ANT_HOME/bin
執行 source .bash_profile 使環境變量生效
完成了 apache ant 安裝後,再次啓動 hwi
hive --service hwi &
已經可順序啓動了,經過在瀏覽器輸入網站 http://172.17.0.1:9999/hwi 訪問 hwi 頁面,仍然報錯,以下圖
這個頁面報錯是
Unable to find a javac compiler; com.sun.tools.javac.Main is not on the classpath. Perhaps JAVA_HOME does not point to the JDK
明明已經在環境變量中配置了JAVA_HOME,怎麼還會報錯呢,經查,是因爲$HIVE_HOME/lib下沒有tools.jar所致,所以,對 $JAVA_HOME/lib/tools.jar 建立一個軟連接到 $HIVE_HOME/lib
ln -s $JAVA_HOME/lib/tools.jar $HIVE_HOME/lib/
從新啓動 hive --service hwi & 以後,再訪問 http://172.17.0.1:9999/hwi ,仍是報錯,以下圖
報異常信息
The following error occurred while executing this line: jar:file:/home/ahadoop/apache-hive-2.1.1.bin/lib/ant-1.9.1.jar!/org/apache/tools/ant/antlib.xml:37:Could not create task or type of type:componentdef. Ant could not find the task or a class this task relies upon.
經查,主要是 ant 版本的問題致使的,$HIVE_HOME/lib下的ant.jar版本爲1.9.1,而剛纔新安裝的ant版本爲1.10.1,所以,須要把{ANT_HOME}/lib下的ant.jar包copy到${HIVE_HOME}/lib下
cp ${ANT_HOME}/lib/ant.jar ${HIVE_HOME}/lib/ant-1.10.1.jar
從新啓動 hive --service hwi & 以後,再訪問 http://172.17.0.1:9999/hwi 頁面,終於可正常訪問了
點擊左邊菜單 Browse Schema 可查看數據表狀況
點擊左邊菜單 Create Session 可建立一個會話(會話名 test),以下圖
點擊左邊菜單 List Sessions 可顯示會話列表,以下圖
點擊 Manager 按鈕可進入該會話的管理頁面,以下圖
在會話管理頁面中,可執行hql腳本。在 Result File 中輸入個結果文件名,在 Query 編輯框中輸入 hql 腳本(本實驗的腳本爲 select * from testdb.tmp_test4),在 Start Query 中選擇 YES,而後點擊下方的 Submit 按鈕提交會話,便可執行 hql 腳本。執行後,點擊 Result File 編輯框右邊的 View File 按鈕,可查看運行結果,以下圖
從上面使用 hwi 的過程能夠看出,執行 hql 腳本的過程沒有任何提示,不知道某一個查詢執行是何時結束的,只能在服務端才能夠看到執行過程日誌信息,並且使用起來也不是很方便。針對數據分析人員,通常仍是建議用CLI操做Hive,效率比較高。
(5)關於 hive metastore 的問題
按官網的介紹,無論是使用Hive CLI、客戶端仍是HWI訪問Hive,都須要首先啓動Hive 元數據服務,不然沒法訪問Hive數據庫。不然會報異常
15/01/09 16:37:58 INFO hive.metastore: Trying to connect to metastore with URI thrift://172.17.0.1:9083 15/01/09 16:37:58 WARN hive.metastore: Failed to connect to the MetaStore Server... 15/01/09 16:37:58 INFO hive.metastore: Waiting 1 seconds before next connection attempt.
使用 hive --service metastore 命令便可啓動 metastore 服務
[ahadoop@31d48048cb1e ~]$ hive --service metastore Starting Hive Metastore Server 15/01/09 16:38:52 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 15/01/09 16:38:52 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 15/01/09 16:38:52 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 15/01/09 16:38:52 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 15/01/09 16:38:52 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
但在本實驗中,不啓動 metastore 服務也能經過 hwi 正常訪問 hive 裏面的數據,可能跟新版本的機制有關
五、安裝配置 HCatalog
HCatalog是Hadoop的元數據和數據表的管理系統,它是基於Hive的元數據層,經過類SQL的語言展示Hadoop數據的關聯關係,支持Hive、Pig、MapReduce等共享數據和元數據,使用戶在編寫應用程序時無需關心數據是怎麼存儲、存在哪裏,避免用戶因schema和存儲格式的改變而受到影響。HCatalog的這種靈活性,使得在不影響到使用者的應用程序讀取數據的狀況下,數據產生者能夠在數據中增長新列。在不影響生產者或使用者的狀況下,管理員能夠遷移數據或是改變數據的存儲格式。以下圖
從上圖可看出,HCatalog低層支持多種文件格式的數據存儲方法,上層支持Pig、MapReduce、Hive、Streaming等多種應用。
這樣的好處在於,能夠支持不一樣的工具集和系統能在一塊兒使用,例如數據分析團隊,一開始可能只使用一種工具(如Hive,Pig,Map Reduce),而隨着數據分析工做的深刻,須要多種工具相結合,如剛開始使用Hive進行分析查詢的用戶,後面還須要使用Pig爲ETL過程處理或創建數據模型;剛開始使用Pig的用戶發現,他們更想使用Hive進行分析查詢。在這些狀況下,經過HCatalog提供了元數據之間的共享,使用戶更方便的在不一樣工具間切換操做,好比在 Map Reduce或Pig中載入數據並進行規範化,而後經過Hive進行分析,當這些工具都共享一個metastore時,各個工具的用戶就可以即時訪問其餘工具建立的數據,而無需載入和傳輸的步驟,很是高效、方便。
Apache Hive 對 HCatalog 有詳細的介紹說明。從 hive 0.11.0 版本以後,hive 安裝包便提供了 hcatalog,也即安裝hive後,裏面就已經有了hcatalog了(官網說明)。接下來將介紹如何配置使用 HCatalog
(1)配置環境變量
vi ~/.bash_profile export PATH=$PATH:$HIVE_HOME/hcatalog/bin:$HIVE_HOME/hcatalog/sbin # 使用環境變量生效 source ~/.bash_profile
(2)啓動 hcatalog
hcat_server.sh start &
啓動後,程序報錯,以下
Started metastore server init, testing if initialized correctly... /usr/local/hive/hcatalog/sbin/hcat_server.sh: line 91: /usr/local/hive/hcatalog/sbin/../var/log/hcat.out: No such file or directory.Metastore startup failed, see /usr/local/hive/hcatalog/sbin/../var/log/hcat.err
報錯的是log路徑不存在(跟前面介紹的hwi相對路徑、絕對路徑相似),這時經過指定一個hive用戶具備寫權限的路徑,而後在啓動腳本里export變量HCAT_LOG_DIR指定路徑,以下:
# 創建log目錄,如 /tmp/ahadoop/hcat mkdir /tmp/ahadoop/hcat # 在 hcat_server.sh 文件開頭指定log目錄(加上如下export指令) vi ${HIVE_HOME}/hcatalog/sbin/hcat_server.sh export HCAT_LOG_DIR=/tmp/ahadoop/hcat
從新啓動後 hcat_server.sh start ,仍是報異常
經查,是端口衝突,若是原先已經開啓了 hive --service metastore & ,則會佔用了9083端口。而在開啓 hcat_server.sh start 時,會再從新開啓 metastore 服務,使用9083端口。這是由於 hcatalog 的目的就是要管理元數據的,所以本身會開啓 metastore 服務用於管理元數據,而若是其它應用也啓動了 metastore 服務,則可能會形成不一致。
將 hive --service metastore 的服務結束後,再啓動 hcat_server.sh start & ,便可順利啓動 hcat_server
[ahadoop@31d48048cb1e ~]$ hcat_server.sh start & [2] 6763 [ahadoop@31d48048cb1e ~]$ Started metastore server init, testing if initialized correctly... Metastore initialized successfully on port[9083]. [2]+ Done hcat_server.sh start
hcat 是 hcatalog 的客戶端操做入口,官網有介紹了詳細的使用(hcatalog client 命令官網介紹)
使用 hcat -e 可執行元數據操做的命令(使用SQL語句,很是方便),例如
hcat -e"show databases" hcat -e"show tables" hcat -e"create table tmp_test5(a string,b int)" hcat -e"desc tmp_test5" hcat -e"drop table tmp_test5"
建立/修改/刪除數據表,查看數據表結構等等,這些跟元數據相關的DDL語句都是支持的。如下語句就不支持:
ALTER INDEX ... REBUILD
CREATE TABLE ... AS SELECT
ALTER TABLE ... CONCATENATE
ALTER TABLE ARCHIVE/UNARCHIVE PARTITION
ANALYZE TABLE ... COMPUTE STATISTICS
IMPORT FROM ...
EXPORT TABLE
另外,若是執行 hcat -e"select * from tmp_test5",也是不支持的,由於hcatalog主要是用來管理元數據的,而不是分析使用的,所以,不能跟hive等同
使用 hcat_server.sh stop 可中止 hcatalog 服務
hcat_srever.sh stop
六、WebHCat
WebHCat是爲HCatalog提供REST API的服務,自hive 0.11.0 版本以後,hive 中也自帶了 webhcat (官網介紹說明),以下圖,經過WebHCat,程序可以經過REST的API很安全的連接和操做HCatalog提供的服務,方便Hive、Pig、MapReduce等應用使用。(相似於經過WebHDFS以web的方式來操做HDFS)
使用如下命令啓動 webhcat,默認的web端口爲50111(須先啓動 hcat_srever.sh start)
webhcat_server.sh start &
(1)在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/status 可查看 hcatalog 的狀態,以下圖
(2)在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/version/hive 可查看 hive 的版本,以下圖
(3)在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/version/hadoop 可查看 hadoop 的版本,以下圖
(4)在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database 查看元數據的數據庫信息,但卻報錯了,以下圖
這是由於像ddl這些操做,須要指定用戶,所以,指定使用帳號(hcat所在linux服務器的帳號ahadoop)
在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database?user.name=ahadoop 查看元數據的數據庫信息,仍是報錯了,以下圖
報錯信息爲 {"error":"Unable to access program:${env.PYTHON_CMD}"},識別不到 python 路徑
這時配置環境變量
vi ~/.bash_profile export PYTHON_CMD=/usr/bin/python # 使用環境變量生效 source ~/.bash_profile
從新啓動 webhcat
[ahadoop@31d48048cb1e ~]$ webhcat_server.sh stop Lenght of string is non zero webhcat: stopping ... webhcat: stopping ... stopped webhcat: done [ahadoop@31d48048cb1e ~]$ webhcat_server.sh start & [2] 3758 [ahadoop@31d48048cb1e ~]$ Lenght of string is non zero webhcat: starting ... webhcat: /home/ahadoop/hadoop-2.8.0/bin/hadoop jar /home/ahadoop/apache-hive-2.1.1-bin/hcatalog/sbin/../share/webhcat/svr/lib/hive-webhcat-2.1.1.jar org.apache.hive.hcatalog.templeton.Main webhcat: starting ... started. webhcat: done [2]+ Done webhcat_server.sh start
在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database?user.name=ahadoop 查看元數據的數據庫信息,繼續報錯,錯誤信息爲
/home/ahadoop/hadoop-2.8.0/bin/hadoop: line 27: /home/ahadoop/../libexec/hadoop-config.sh:No such file or directory\n/home/ahadoop/hadoop-2.8.0/bin/hadoop: line 166:exec:: not found\n"
這是提示找不到 libexec 的路徑,這時,根據提示,編輯 hadoop 執行文件的27行,進行如下修改
# 編輯 hadoop 可執行文件 vi $HADOOP_HOME/bin/hadoop # 定位到26行,將 HADOOP_LIBEXEC_DIR 註釋掉 # HADOOP_LIBEXEC_DIR=${HADOOP_LIBEXEC_DIR:-$DEFAULT_LIBEXEC_DIR} # 而後增長1行,寫上 HADOOP_LIBEXEC_DIR 路徑 HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
從新啓動 webhcat
[ahadoop@31d48048cb1e ~]$ webhcat_server.sh stop Lenght of string is non zero webhcat: stopping ... webhcat: stopping ... stopped webhcat: done [ahadoop@31d48048cb1e ~]$ webhcat_server.sh start & [2] 3758 [ahadoop@31d48048cb1e ~]$ Lenght of string is non zero webhcat: starting ... webhcat: /home/ahadoop/hadoop-2.8.0/bin/hadoop jar /home/ahadoop/apache-hive-2.1.1-bin/hcatalog/sbin/../share/webhcat/svr/lib/hive-webhcat-2.1.1.jar org.apache.hive.hcatalog.templeton.Main webhcat: starting ... started. webhcat: done [2]+ Done webhcat_server.sh start
而後,在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database?user.name=ahadoop 查看元數據的數據庫信息,仍是繼續報錯(暈,心中萬馬奔騰啊……)
{"statement":"show databases like 't*';","error":"unable to show databases for: t*","exec":{"stdout":"","stderr":"which: no /home/ahadoop/hadoop-2.8.0/bin/hadoop in ((null))\ndirname: missing operand\nTry `dirname --help' for more information.\nSLF4J: Class path contains multiple SLF4J bindings.\nSLF4J: Found binding in [jar:file:/home/ahadoop/hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]\nSLF4J: Found binding in [jar:file:/home/ahadoop/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]\nSLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.\nSLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]\n17/06/28 03:03:50 INFO conf.HiveConf: Found configuration file file:/home/ahadoop/apache-hive-2.1.1-bin/conf/hive-site.xml\n17/06/28 03:03:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n17/06/28 03:03:54 INFO metastore.HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore\n17/06/28 03:03:54 INFO metastore.ObjectStore: ObjectStore, initialize called\n17/06/28 03:03:54 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored\n17/06/28 03:03:54 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored\n17/06/28 03:03:56 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MSerDeInfo since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MPartition since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MColumnDescriptor since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MTablePrivilege since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MGlobalPrivilege since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MTable since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MRole since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MTableColumnStatistics since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MStringList since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MFunction since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MDatabase since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MStorageDescriptor since it was managed previously\n17/06/28 03:03:57 INFO DataNucleus.Persistence: Managing Persistence of org.apache.hadoop.hive.metastore.model.MVersionTable since it was managed previously\n17/06/28 03:03:57 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=\"Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order\"\n17/06/28 03:03:57 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL\n17/06/28 03:03:57 INFO metastore.ObjectStore: Initialized ObjectStore\n17/06/28 03:03:58 INFO metastore.HiveMetaStore: Added admin role in metastore\n17/06/28 03:03:58 INFO metastore.HiveMetaStore: Added public role in metastore\n17/06/28 03:03:58 INFO metastore.HiveMetaStore: No user is added in admin role, since config is empty\n17/06/28 03:03:58 INFO metastore.HiveMetaStore: 0: get_all_functions\n17/06/28 03:03:58 INFO HiveMetaStore.audit: ugi=ahadoop\tip=unknown-ip-addr\tcmd=get_all_functions\t\n Command was terminated due to timeout(10000ms). See templeton.exec.timeout property","exitcode":143}}
根據提示,是用戶的問題,所以,跟 hcatalog 的配置同樣,設置超級代理用戶(使用webhcat所在的服務器linux帳號ahadoop)
cp $HIVE_HOME/hcatalog/etc/webhcat/webhcat-default.xml $HIVE_HOME/hcatalog/etc/webhcat/webhcat-site.xml vi $HIVE_HOME/hcatalog/etc/webhcat/webhcat-site.xml # 修改如下兩個配置項,將用戶名 ahadoop 寫在 proxyuser 後面。 <property> <name>webhcat.proxyuser.ahadoop.hosts</name> <value>*</value> </property> <property> <name>webhcat.proxyuser.ahadoop.groups</name> <value>*</value> </property> # 修改超時時間,默認爲10秒,改長一些,不然後面還會報超時 <property> <name>templeton.exec.timeout</name> <value>60000</value> <description> How long in milliseconds a program is allowed to run on the Templeton box. </description> </property>
完成配置後,再從新啓動webhcat
[ahadoop@31d48048cb1e ~]$ webhcat_server.sh stop Lenght of string is non zero webhcat: stopping ... webhcat: stopping ... stopped webhcat: done [ahadoop@31d48048cb1e ~]$ webhcat_server.sh start & [2] 3758 [ahadoop@31d48048cb1e ~]$ Lenght of string is non zero webhcat: starting ... webhcat: /home/ahadoop/hadoop-2.8.0/bin/hadoop jar /home/ahadoop/apache-hive-2.1.1-bin/hcatalog/sbin/../share/webhcat/svr/lib/hive-webhcat-2.1.1.jar org.apache.hive.hcatalog.templeton.Main webhcat: starting ... started. webhcat: done [2]+ Done webhcat_server.sh start
在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database?user.name=ahadoop 查看元數據的數據庫信息,終於能夠正常訪問了
若是須要加篩選條件,則在url後面加上like,例如篩選以t開頭的數據庫名稱,則在瀏覽器中輸入
http://172.17.0.1:50111/templeton/v1/ddl/database?user.name=ahadoop&like=t*
(5)查看某個數據庫的信息,則在database後面指定數據庫名(如本實驗的testdb數據庫),在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database/testdb?user.name=ahadoop
(6)查看某個數據庫裏面全部數據表的信息,則須要加上table關鍵字(如本實驗的testdb數據庫),在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database/testdb/table?user.name=ahadoop
(7)查看某個數據庫、某張表的信息,則須要加上數據表名(如本實驗的testdb數據庫、tmp_test3數據表),在瀏覽器輸入 http://172.17.0.1:50111/templeton/v1/ddl/database/testdb/table/tmp_test3?user.name=ahadoop
(8)使用 webhcat_server.sh stop 可終止 webhcat 服務
webhcat_server.sh stop
六、結語
經過以上的配置,完成了 hive、beeline、hwi、HCatalog、WebHCat 等組件安裝和使用,也瞭解到了他們的功能。hive 目前一直在持續更新版本,期待將來有更多好用的特性,爲Hadoop的數據分析處理人員,帶來更高效的工具。
歡迎關注本人的微信公衆號「大數據與人工智能Lab」(BigdataAILab),獲取更多資訊