應用場景:html
在項目中會遇到輸入結果集很大,可是輸出結果很小,好比一些 pv、uv 數據,而後爲了實時查詢的需求,或者一些 OLAP 的需求,咱們須要 mapreduce 與 mysql 進行數據的交互,而這些是 hbase 或者 hive 目前亟待改進的地方。java
1.從mysql中node
讀數據:python
Hadoop訪問關係數據庫主要經過一下接口實現的:DBInputFormat類,包所在位置:org.apache.hadoop.mapred.lib.db 中。DBInputFormat 在 Hadoop 應用程序中經過數據庫供應商提供的 JDBC接口來與數據庫進行交互,而且可使用標準的 SQL 來讀取數據庫中的記錄。學習DBInputFormat首先必須知道二個條件。mysql
在使用 DBInputFormat 以前,必須將要使用的 JDBC 驅動拷貝到分佈式系統各個節點的$HADOOP_HOME/lib/目錄下。linux
寫數據:sql
每每對於數據處理的結果的數據量通常不會太大,可能適合hadoop直接寫入數據庫中。hadoop提供了相應的數據庫直接輸出的計算髮結果。shell
2.Hive常見命令數據庫
Hive經常使用的SQL命令操做 apache
Hive導出查詢內容: INSERT OVERWRITE LOCAL DIRECTORY '/tmp/result.txt' select id,name from t_test;
hive -e"select id,name from t_test;"> result.txt
鏈接hive的三種方式:
1.cli 本質上是每一個鏈接都存放一個元數據,各個之間都不相同,不適合作產品的開發和應用
2.JDBC鏈接的方式,容易被大數據量衝掛,不穩定
3. 直接利用Hive的 Driver class 來直接鏈接 Driver driver = new Driver(new HiveConf(SessionState.class));
遠程鏈接Hive
hive --service hiveserver -p 50000 &
打開50000端口,而後java就可使用java連了,將所需的jar包作個標記
HQL結果直接導入mysql
一、首先下載mysql-connector-java jar包。
add jar /home/hadoop/hive-0.12.0/lib/hive-contrib-0.12.0.jar;
add jar /home/hadoop/hive-0.12.0/lib/mysql-connector-java-5.1.27-bin.jar;
三、給指點方法弄個簡稱:
CREATE TEMPORARY FUNCTION dboutput AS 'org.apache.hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput';
四、執行:
select dboutput('jdbc:mysql://localhost:port/dbname','db_username','db_pwd','INSERT INTO mysql_table(field1,field2,field3) VALUES (6,?,?)',substr(field_i,1,10),count(field_j)) from hive_table group by substr(field_i,1,10) limit 10;
問題:
發現總提示找不到org.apache.Hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput
解決辦法:
後來通過琢磨才弄明白org.apache.Hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput部分本身要去編寫,編寫後打成jar包 用add jar添加進去就能夠了。
import sys from hive_service import ThriftHive from hive_service.ttypes import HiveServerException from thrift import Thrift from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol try: transport = TSocket.TSocket('192.168.30.201', 10000) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() hql = '''CREATE TABLE people(a STRING, b INT, c DOUBLE) row format delimited fields terminated by ',' ''' print hql client.execute(hql) client.execute("LOAD DATA LOCAL INPATH '/home/diver/data.txt' INTO TABLE people") #client.execute("SELECT * FROM people") #while (1): # row = client.fetchOne() # if (row == None): # break # print row client.execute("SELECT count(*) FROM people") print client.fetchAll() transport.close() except Thrift.TException, tx: print '%s' % (tx.message)
#!/usr/bin/python #-*-coding:UTF-8 -*- import sys import os import string import re import MySQLdb from hive_service import ThriftHive from hive_service.ttypes import HiveServerException from thrift import Thrift from thrift.transport import TSocket from thrift.transport import TTransport from thrift.protocol import TBinaryProtocol def hiveExe(hsql,dbname): #定義hive查詢函數 try: transport = TSocket.TSocket('192.168.10.1', 10000) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() client.execute('ADD jar /opt/modules/hive/hive-0.7.1/lib/hive-contrib-0.7.1.jar') client.execute("use "+dbname) row = client.fetchOne() #使用庫名,只需一次fetch,用fetchOne client.execute(hsql) return client.fetchAll() #查詢全部數據,用fetchAll() transport.close() except Thrift.TException, tx: print '%s' % (tx.message) def mysqlExe(sql): try: conn = MySQLdb.connect(user="test",passwd="test123",host="127.0.0.1",db="active2_ip",port=5029) except Exception,data: print "Could not connect to MySQL server.:",data try: cursor = conn.cursor() cursor.execute(sql) return row cursor.commit() cursor.close() conn.close() except Exception,data: print "Could not Fetch anything:",data dbname = "active2" date = os.popen("date -d '1 day ago' +%Y%m%d").read().strip() #shell方式取昨天日期,讀取並去先後\n date.close() sql = "create table IF NOT EXISTS "+dbname+"_group_ip_"+date+" like "+dbname+"_group_ip;load data infile '/tmp/"+dbname+"_"+date+".csv' into table "+dbname+"_group_ip_"+date+" FIELDS TERMINATED BY ','" #以模板表建立日期表,並load data到該表中 hsql = "insert overwrite local directory '/tmp/"+dbname+"_"+date+"' select count(version) as vc,stat_hour,type,version,province,city,isp from "+dbname+"_"+date+" group by province,city,version,type,stat_hour,isp" #hive查詢,並將查詢結果導出到本地/tmp/active2_20111129目錄下,可能生成多個文件 hiveExe(hsql, dbname) #執行查詢 os.system("sudo cat /tmp/"+dbname+"_"+date+"/* > /tmp/tmplog ") #將多個文件經過shell合併爲一個文件tmplog file1 = open("/tmp/tmplog", 'r') #打開合併後的臨時文件 file2 = open("/tmp/"+dbname+"_"+date+".csv",'w') #打開另外一個文件,作文字替換。由於hive導出結果,其分隔符爲特殊字符。因此須要作替換,格式爲csv,故用逗號分隔 sep = ',' for line in file1: tmp = line[:-1].split('\x01') #hive導出文件分隔符爲ascii中的001,\x01是16進制,但其實也就是十進制的1 replace = sep.join(tmp) file2.write(replace+"\n") file1.close() file2.close() os.system("sudo rm -f /tmp/tmplog") #刪除臨時的tmplog mysqlExe(sql) #執行mysql查詢,建立表和加載數據。 os.system("sudo rm -f /tmp/"+dbname+"_"+date)
Thrift是Apache的一個開源的跨語言服務開發框架,它提供了一個代碼生成引擎來構建服務,支持C++,Java,Python,PHP,Ruby,Erlang,Perl,Haskell,C#,Cocoa,JavaScript,Node.js,Smalltalk,OCaml,Delphi等多種編程語言。
通常來講,使用Thrift來開發應用程序,主要創建在兩種場景下:
Python就是用Thrift來鏈接Hive的
#!/bin/sh # 一鍵安裝thrift-0.9.0的腳本 # thrift依賴boost、openssl和libevent # 下面的變量值能夠根據實現作修改 PROJECT_HOME=$HOME/iflow # 項目源碼主目錄 # thrift及依賴的第三方庫源碼包存放目錄和安裝目錄, # 一鍵腳本要和第三方庫源碼包放在同一個目錄下 THIRD_PARTY_HOME=$PROJECT_HOME/third-party boost=boost_1_52_0 openssl=openssl-1.0.1c libevent=libevent-2.0.19-stable thrift=thrift-0.9.0 # # 安裝boost # printf "n33[0;32;34minstalling boost33[mn" tar xzf $boost.tar.gz cd $boost ./bootstrap.sh if test $? -ne 0; then exit 1 fi ./b2 install --prefix=$THIRD_PARTY_HOME/boost printf "n33[0;32;34m./b2 install return $?33[mn" cd - # # 安裝openssl # printf "n33[0;32;34minstalling openssl33[mn" tar xzf $openssl.tar.gz cd $openssl ./config --prefix=$THIRD_PARTY_HOME/openssl shared threads if test $? -ne 0; then exit 1 fi make if test $? -ne 0; then exit 1 fi make install cd - # # 安裝libevent # printf "n33[0;32;34minstalling libevent33[mn" tar xzf $libevent.tar.gz cd $libevent ./configure --prefix=$THIRD_PARTY_HOME/libevent if test $? -ne 0; then exit 1 fi make if test $? -ne 0; then exit 1 fi make install cd - # # 安裝thrift # printf "n33[0;32;34minstalling thrift33[mn" tar xzf $thrift.tar.gz cd $thrift # 按照常規的configure,使用--with-openssl,會遇到 # 「Error: libcrypto required.」錯誤,這裏使用CPPFLAGS和LDFLAGS替代 ./configure --prefix=$THIRD_PARTY_HOME/thrift --with-boost=$THIRD_PARTY_HOME/boost --with-libevent=$THIRD_PARTY_HOME/libevent CPPFLAGS="-I$THIRD_PARTY_HOME/openssl/include" LDFLAGS="-ldl -L$THIRD_PARTY_HOME/openssl/lib" --with-qt4=no --with-c_glib=no --with-csharp=no --with-java=no --with-erlang=no --with-python=no --with-perl=no --with-ruby=no --with-haskell=no --with-go=no --with-d=no if test $? -ne 0; then exit 1 fi # 完成上述修改後,configure能夠成功了,但還須要下面修改, # 不然make時會報malloc未聲明 sed -i -e 's!#define HAVE_MALLOC 0!#define HAVE_MALLOC 1!' config.h sed -i -e 's!#define HAVE_REALLOC 0!#define HAVE_REALLOC 1!' config.h sed -i -e 's!#define malloc rpl_malloc!/*#define malloc rpl_malloc*/!' config.h sed -i -e 's!#define realloc rpl_realloc!/*#define realloc rpl_realloc*/!' config.h make if test $? -ne 0; then exit 1 fi make install cd - # 安裝成功提示一下 printf "n33[0;32;34minstall SUCCESS33[mn"
hive的結果導入到mysql報錯 參考 Hiveserver和Hiveserver2的區別
一、sqoop依賴zookeeper因此必須配置ZOOKEEPER_HOME到環境變量中。
二、sqoop-1.2.0-CDH3B4依賴hadoop-core-0.20.2-CDH3B4.jar因此你須要下載hadoop-0.20.2-CDH3B4.tar.gz解壓縮後將hadoop-0.20.2-CDH3B4/hadoop-core-0.20.2-CDH3B4.jar複製到sqoop-1.2.0-CDH3B4/lib中。
三、sqoop導入mysql數據運行過程當中依賴mysql-connector-java-.jar因此你須要下載mysql-connector-java-.jar並複製到sqoop-1.2.0-CDH3B4/lib中。
利用udf函數將Hive統計結果直接插入到MySQL
http://www.linuxidc.com/Linux/2013-04/82878.htm
Python腳本將Hive的結果保存到MySQL
http://pslff.diandian.com/post ... 08648
hive的insert操做小結 分區及導出
insert 語法格式爲: 1. 基本的插入語法: insert overwrite table tablename [partition(partcol1=val1,partclo2=val2)] select_statement; insert into table tablename [partition(partcol1=val1,partclo2=val2)] select_statement; eg: insert overwrite table test_insert select * from test_table; insert into table test_insert select * from test_table; 注: overwrite重寫,into追加。 2. 對多個表進行插入操做: from source_table insert overwrite table tablename1 [partition (partcol1=val1,partclo2=val2)] select_statement1 insert overwrite table tablename2 [partition (partcol1=val1,partclo2=val2)] select_statement2 eg: from test_table insert overwrite table test_insert1 select key insert overwrite table test_insert2 select value; 注:hive不支持用insert語句一條一條的進行插入操做,也不支持update操做。數據是以load的方式加載到創建好的表中,數據一旦導入就不能夠修改。 2.經過查詢將數據保存到filesystem insert overwrite [local] directory 'directory' select_statement; eg: (1)導入數據到本地目錄: insert overwrite local directory '/home/hadoop/data' select * from test_insert1; 產生的文件會覆蓋指定目錄中的其餘文件,即將目錄中已經存在的文件進行刪除。 只能用overwrite,into錯誤! (2)導出數據到HDFS中: insert overwrite directory '/user/hive/warehouse/table' select value from test_table; 只能用overwrite,into錯誤! (3)同一個查詢結果能夠同時插入到多個表或者多個目錄中: from source_table insert overwrite local directory '/home/hadoop/data' select * insert overwrite directory '/user/hive/warehouse/table' select value; 3. 小結: (1)insert命令主要用於將hive中的數據導出,導出的目的地能夠是hdfs或本地filesysytem,導入什麼數據在於書寫的select語句。 (2)overwrite與into: insert overwrite/into table 能夠搭配; insert overwrite directory 能夠搭配;
Hive的安裝詳解 重在思路 Beeline
向前看,其實不少人都也只是接觸的那些,看誰更有遠見,才能抄近路,數據不在大小,關鍵在於價值
1.安裝 yum install hive相關包 hive相關包以下: hive – base package that provides the complete language and runtime (required) hive-metastore – provides scripts for running the metastore as a standalone service (optional) hive-server – provides scripts for running the original HiveServer as a standalone service (optional) hive-server2 – provides scripts for running the new HiveServer2 as a standalone service (optional) 2.配置MySQL做爲hive元數據庫 1)建立數據庫 $ mysql -u root -p Enter password: mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql; 2)建立用戶/分配權限 mysql> CREATE USER ‘hive’@’metastorehost’ IDENTIFIED BY ‘mypassword'; … mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM ‘hive’@’metastorehost'; mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO ‘hive’@’metastorehost'; mysql> FLUSH PRIVILEGES; mysql> quit; 3.配置hive-site.xml a)基礎配置(配置爲遠程模式) <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://192.168.1.52:3306/metastore</value> <description>the URL of the MySQL database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>datanucleus.autoStartMechanism</name> <value>SchemaTable</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://192.168.1.57:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> 其中hive.metastore.uris配置代表使用第三種方式(遠程模式)使用hive。 注意:hive.metastore.local在hive0.10後沒必要須配置,若是配置了上面的參數。 4.配置使用hiveserver2 在hive-site.xml中配置下面選項: <property> <name>hive.support.concurrency</name> <description>Enable Hive's Table Lock Manager Service</description> <value>true</value> </property> <property> <name>hive.zookeeper.quorum</name> <description>Zookeeper quorum used by Hive's Table Lock Manager</description> <value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value> </property> 注意:沒用配置hive.zookeeper.quorum會致使沒法併發執行hive ql請求和致使數據異常 Enabling the Table Lock Manager without specifying a list of valid Zookeeper quorum nodes will result in unpredictable behavior. Make sure that both properties are properly configured. 5.安裝Zookeeper 因爲hiveserver2的表鎖管理器須要依賴Zookeeper,所以須要安裝Zookeeper啓動Zookeeper,詳情能夠參看文章「Zookeeper安裝」 啓動集羣的Zookeeper,若是Zookeeper不是默認的端口,則須要顯示配置參數:hive.zookeeper.client.port。 6啓動服務 1)啓動hive-metastore 啓動metadata服務: sudo service hive-metastore start 或者:hive –service metastore 啓動後的端口默認爲9083 2)啓動hive-server2 啓動hiveserver2: sudo service hive-server2 start 3)測試 使用beeline控制檯鏈接hive-server2: /usr/bin/beeline >!connect jdbc:hive2://localhost:10000 -n hive -p hive org.apache.hive.jdbc.HiveDriver 執行,show tables等命令查看結果。 附1:beeline參數 Usage: java org.apache.hive.cli.beeline.BeeLine -u the JDBC URL to connect to -n the username to connect as -p the password to connect as -d the driver class to use -e query that should be executed -f script file that should be executed --color=[true/false] control whether color is used for display --showHeader=[true/false] show column names in query results --headerInterval=ROWS; the interval between which heades are displayed --fastConnect=[true/false] skip building table/column list for tab-completion 比較有用的參數: –fastConnect=true Building list of tables and columns for tab-completion (set fastconnect to true to skip)…(確實有效) –isolation 設置事務的隔離級別 例子: 執行sql語句方式: beeline -u jdbc:hive2://localhost:10000 -n hdfs -p hdfs -e 「show tables」 執行sql文件方式: beeline -u jdbc:hive2://localhost:10000 -n hdfs -p hdfs -f hiveql_test.sql 附錄2:hive-server1和hive-server2的區別 Hiveserver1 和hiveserver2的JDBC區別: HiveServer version Connection URL Driver Class HiveServer2 jdbc:hive2://: org.apache.hive.jdbc.HiveDriver HiveServer1 jdbc:hive://: org.apache.hadoop.hive.jdbc.HiveDriver