標籤整理

時間 2020-08-11

標籤標籤整理简体版

原文原文鏈接

1. MapReduce與mysql鏈接總結

應用場景：html

　　在項目中會遇到輸入結果集很大，可是輸出結果很小，好比一些 pv、uv 數據，而後爲了實時查詢的需求，或者一些 OLAP 的需求，咱們須要 mapreduce 與 mysql 進行數據的交互，而這些是 hbase 或者 hive 目前亟待改進的地方。java

1.從mysql中node

讀數據：python

　　Hadoop訪問關係數據庫主要經過一下接口實現的：DBInputFormat類，包所在位置：org.apache.hadoop.mapred.lib.db 中。DBInputFormat 在 Hadoop 應用程序中經過數據庫供應商提供的 JDBC接口來與數據庫進行交互,而且可使用標準的 SQL 來讀取數據庫中的記錄。學習DBInputFormat首先必須知道二個條件。mysql

在使用 DBInputFormat 以前,必須將要使用的 JDBC 驅動拷貝到分佈式系統各個節點的$HADOOP_HOME/lib/目錄下。linux
MapReduce訪問關係數據庫時，大量頻繁的從MapReduce程序中查詢和讀取數據，這大大的增長了數據庫的訪問負載，所以，DBInputFormat接口僅僅適合讀取小數據量的數據，而不適合處理數據倉庫。要處理數據倉庫的方法有：利用數據庫的Dump工具將大量待分析的數據輸出爲文本，並上傳的Hdfs中進行處理，處理的方法可參考：http://www.cnblogs.com/liqizhou/archive/2012/05/15/2501835.html

寫數據：sql

　　每每對於數據處理的結果的數據量通常不會太大，可能適合hadoop直接寫入數據庫中。hadoop提供了相應的數據庫直接輸出的計算髮結果。shell

1. 　　DBOutFormat: 提供數據庫寫入接口。
2. 　　DBRecordWriter:提供向數據庫中寫入的數據記錄的接口。
3. DBConfiguration:提供數據庫配置和建立連接的接口

2.Hive常見命令數據庫

　　Hive經常使用的SQL命令操做 apache

　　Hive導出查詢內容： INSERT OVERWRITE LOCAL DIRECTORY '/tmp/result.txt' select id,name from t_test;

　　　　　　　　　　　　 hive -e"select id,name from t_test;"> result.txt

鏈接hive的三種方式：

　　1.cli 本質上是每一個鏈接都存放一個元數據，各個之間都不相同，不適合作產品的開發和應用

　　2.JDBC鏈接的方式，容易被大數據量衝掛，不穩定

　　3. 直接利用Hive的 Driver class 來直接鏈接 Driver driver = new Driver(new HiveConf(SessionState.class));

遠程鏈接Hive

　　hive --service hiveserver -p 50000 &

　　打開50000端口，而後java就可使用java連了，將所需的jar包作個標記

HQL結果直接導入mysql

一、首先下載mysql-connector-java jar包。

二、在hive cli端添加必要jar：

add jar /home/hadoop/hive-0.12.0/lib/hive-contrib-0.12.0.jar;

add jar /home/hadoop/hive-0.12.0/lib/mysql-connector-java-5.1.27-bin.jar;

三、給指點方法弄個簡稱：

CREATE TEMPORARY FUNCTION dboutput AS 'org.apache.hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput';

四、執行：

select dboutput('jdbc:mysql://localhost:port/dbname','db_username','db_pwd','INSERT INTO mysql_table(field1,field2,field3) VALUES (6,?,?)',substr(field_i,1,10),count(field_j)) from hive_table group by substr(field_i,1,10) limit 10;

問題：

發現總提示找不到org.apache.Hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput

解決辦法：

後來通過琢磨才弄明白org.apache.Hadoop.hive.contrib.genericudf.example.GenericUDFDBOutput部分本身要去編寫，編寫後打成jar包用add jar添加進去就能夠了。

Python鏈接Hive

import sys
from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
    transport = TSocket.TSocket('192.168.30.201', 10000)
    transport = TTransport.TBufferedTransport(transport)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

    client = ThriftHive.Client(protocol)
    transport.open()
    hql = '''CREATE TABLE people(a STRING, b INT, c DOUBLE) row format delimited fields terminated by ',' '''
    print hql

    client.execute(hql)
    client.execute("LOAD DATA LOCAL INPATH '/home/diver/data.txt' INTO TABLE people")
    #client.execute("SELECT * FROM people")
    #while (1):
    #  row = client.fetchOne()
    #  if (row == None):
    #    break
    #  print row
    client.execute("SELECT count(*) FROM people")
    print client.fetchAll()

    transport.close()

except Thrift.TException, tx:
    print '%s' % (tx.message)

#!/usr/bin/python
#-*-coding:UTF-8 -*-
import sys
import os
import string
import re
import MySQLdb

from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

def hiveExe(hsql,dbname):
#定義hive查詢函數
                try:
                                transport = TSocket.TSocket('192.168.10.1', 10000)
                                transport = TTransport.TBufferedTransport(transport)
                                protocol = TBinaryProtocol.TBinaryProtocol(transport)

                                client = ThriftHive.Client(protocol)
                                transport.open()

                                client.execute('ADD jar /opt/modules/hive/hive-0.7.1/lib/hive-contrib-0.7.1.jar')

                                client.execute("use "+dbname)
                                row = client.fetchOne()
                                #使用庫名，只需一次fetch，用fetchOne

                                client.execute(hsql)
                                return client.fetchAll()
                                #查詢全部數據，用fetchAll()

                                transport.close()

                except Thrift.TException, tx:
                                print '%s' % (tx.message)

def mysqlExe(sql):
                try:
                                conn = MySQLdb.connect(user="test",passwd="test123",host="127.0.0.1",db="active2_ip",port=5029)
                except Exception,data:
                                print "Could not connect to MySQL server.:",data
                try:
                                cursor = conn.cursor()
                                cursor.execute(sql)
                                return row
                                cursor.commit()
                                cursor.close()
                                conn.close()
                except Exception,data:
                                print "Could not Fetch anything:",data

dbname = "active2"
date = os.popen("date -d '1 day ago' +%Y%m%d").read().strip()
#shell方式取昨天日期，讀取並去先後\n
date.close()

sql = "create table IF NOT EXISTS "+dbname+"_group_ip_"+date+" like "+dbname+"_group_ip;load data infile '/tmp/"+dbname+"_"+date+".csv' into table "+dbname+"_group_ip_"+date+" FIELDS TERMINATED BY ','"
#以模板表建立日期表，並load data到該表中

hsql = "insert overwrite local directory '/tmp/"+dbname+"_"+date+"' select count(version) as vc,stat_hour,type,version,province,city,isp from "+dbname+"_"+date+" group by province,city,version,type,stat_hour,isp"
#hive查詢，並將查詢結果導出到本地/tmp/active2_20111129目錄下，可能生成多個文件

hiveExe(hsql, dbname)
#執行查詢

os.system("sudo cat /tmp/"+dbname+"_"+date+"/* > /tmp/tmplog ")
#將多個文件經過shell合併爲一個文件tmplog

file1 = open("/tmp/tmplog", 'r')
#打開合併後的臨時文件
file2 = open("/tmp/"+dbname+"_"+date+".csv",'w')
#打開另外一個文件，作文字替換。由於hive導出結果，其分隔符爲特殊字符。因此須要作替換，格式爲csv，故用逗號分隔
sep = ','
for line in file1:
                tmp = line[:-1].split('\x01')
                #hive導出文件分隔符爲ascii中的001，\x01是16進制，但其實也就是十進制的1
                replace = sep.join(tmp)
                file2.write(replace+"\n")


file1.close()
file2.close()

os.system("sudo rm -f /tmp/tmplog")
#刪除臨時的tmplog

mysqlExe(sql)
#執行mysql查詢，建立表和加載數據。
os.system("sudo rm -f /tmp/"+dbname+"_"+date)

　Thrift是Apache的一個開源的跨語言服務開發框架，它提供了一個代碼生成引擎來構建服務，支持C++，Java，Python，PHP，Ruby，Erlang，Perl，Haskell，C#，Cocoa，JavaScript，Node.js，Smalltalk，OCaml，Delphi等多種編程語言。

通常來講，使用Thrift來開發應用程序，主要創建在兩種場景下：

第一，在咱們開發過程當中，一個比較大的項目須要多個團隊進行協做，而每一個團隊的成員在編程技術方面的技能可能不必定相同，爲了實現這種跨語言的開發氛圍，使用Thrift來構建服務
第二，企業之間合做，在業務上不可避免出現跨語言的編程環境，使用Thrift能夠達到相似Web Services的跨平臺的特性

Python就是用Thrift來鏈接Hive的

#!/bin/sh
# 一鍵安裝thrift-0.9.0的腳本
# thrift依賴boost、openssl和libevent
# 下面的變量值能夠根據實現作修改
PROJECT_HOME=$HOME/iflow # 項目源碼主目錄
# thrift及依賴的第三方庫源碼包存放目錄和安裝目錄，
# 一鍵腳本要和第三方庫源碼包放在同一個目錄下
THIRD_PARTY_HOME=$PROJECT_HOME/third-party
boost=boost_1_52_0
openssl=openssl-1.0.1c
libevent=libevent-2.0.19-stable
thrift=thrift-0.9.0
#
# 安裝boost
#
printf "n33[0;32;34minstalling boost33[mn"
tar xzf $boost.tar.gz
cd $boost
./bootstrap.sh
if test $? -ne 0; then
exit 1
fi
./b2 install --prefix=$THIRD_PARTY_HOME/boost
printf "n33[0;32;34m./b2 install return $?33[mn"
cd -
#
# 安裝openssl
#
printf "n33[0;32;34minstalling openssl33[mn"
tar xzf $openssl.tar.gz
cd $openssl
./config --prefix=$THIRD_PARTY_HOME/openssl shared threads
if test $? -ne 0; then
exit 1
fi
make
if test $? -ne 0; then
exit 1
fi
make install
cd -
#
# 安裝libevent
#
printf "n33[0;32;34minstalling libevent33[mn"
tar xzf $libevent.tar.gz
cd $libevent
./configure --prefix=$THIRD_PARTY_HOME/libevent
if test $? -ne 0; then
exit 1
fi
make
if test $? -ne 0; then
exit 1
fi
make install
cd -
#
# 安裝thrift
#
printf "n33[0;32;34minstalling thrift33[mn"
tar xzf $thrift.tar.gz
cd $thrift
# 按照常規的configure，使用--with-openssl，會遇到
# 「Error: libcrypto required.」錯誤，這裏使用CPPFLAGS和LDFLAGS替代
./configure --prefix=$THIRD_PARTY_HOME/thrift
           --with-boost=$THIRD_PARTY_HOME/boost
           --with-libevent=$THIRD_PARTY_HOME/libevent
           CPPFLAGS="-I$THIRD_PARTY_HOME/openssl/include"
           LDFLAGS="-ldl -L$THIRD_PARTY_HOME/openssl/lib"
           --with-qt4=no --with-c_glib=no --with-csharp=no
           --with-java=no --with-erlang=no --with-python=no
           --with-perl=no --with-ruby=no --with-haskell=no
           --with-go=no --with-d=no
if test $? -ne 0; then
exit 1
fi
# 完成上述修改後，configure能夠成功了，但還須要下面修改，
# 不然make時會報malloc未聲明
sed -i -e 's!#define HAVE_MALLOC 0!#define HAVE_MALLOC 1!' config.h
sed -i -e 's!#define HAVE_REALLOC 0!#define HAVE_REALLOC 1!' config.h
sed -i -e 's!#define malloc rpl_malloc!/*#define malloc rpl_malloc*/!' config.h
sed -i -e 's!#define realloc rpl_realloc!/*#define realloc rpl_realloc*/!' config.h
make
if test $? -ne 0; then
exit 1
fi
make install
cd -
# 安裝成功提示一下
printf "n33[0;32;34minstall SUCCESS33[mn"

　hive的結果導入到mysql報錯參考 Hiveserver和Hiveserver2的區別

　一、sqoop依賴zookeeper因此必須配置ZOOKEEPER_HOME到環境變量中。

二、sqoop-1.2.0-CDH3B4依賴hadoop-core-0.20.2-CDH3B4.jar因此你須要下載hadoop-0.20.2-CDH3B4.tar.gz解壓縮後將hadoop-0.20.2-CDH3B4/hadoop-core-0.20.2-CDH3B4.jar複製到sqoop-1.2.0-CDH3B4/lib中。

三、sqoop導入mysql數據運行過程當中依賴mysql-connector-java-.jar因此你須要下載mysql-connector-java-.jar並複製到sqoop-1.2.0-CDH3B4/lib中。

利用udf函數將Hive統計結果直接插入到MySQL
http://www.linuxidc.com/Linux/2013-04/82878.htm

Python腳本將Hive的結果保存到MySQL
http://pslff.diandian.com/post ... 08648

hive的insert操做小結分區及導出

insert 語法格式爲：

1. 基本的插入語法：
insert overwrite table tablename [partition(partcol1=val1,partclo2=val2)] select_statement;
insert into table tablename [partition(partcol1=val1,partclo2=val2)] select_statement;
eg：
insert overwrite table test_insert select * from test_table;
insert into table test_insert select * from test_table;
注：
overwrite重寫，into追加。

2. 對多個表進行插入操做：
from source_table
insert overwrite table tablename1 [partition (partcol1=val1,partclo2=val2)] select_statement1
insert overwrite table tablename2 [partition (partcol1=val1,partclo2=val2)] select_statement2
eg:
from test_table                     
insert overwrite table test_insert1 
select key
insert overwrite table test_insert2
select value;
注：hive不支持用insert語句一條一條的進行插入操做，也不支持update操做。數據是以load的方式加載到創建好的表中，數據一旦導入就不能夠修改。

2.經過查詢將數據保存到filesystem
insert overwrite [local] directory 'directory' select_statement;
eg:
（1）導入數據到本地目錄：
insert overwrite local directory '/home/hadoop/data' select * from test_insert1;
產生的文件會覆蓋指定目錄中的其餘文件，即將目錄中已經存在的文件進行刪除。
只能用overwrite，into錯誤！
（2）導出數據到HDFS中：
insert overwrite directory '/user/hive/warehouse/table' select value from test_table;
只能用overwrite，into錯誤！
（3）同一個查詢結果能夠同時插入到多個表或者多個目錄中：
from source_table
insert overwrite local directory '/home/hadoop/data' select * 
insert overwrite directory '/user/hive/warehouse/table' select value;

3. 小結：
（1）insert命令主要用於將hive中的數據導出，導出的目的地能夠是hdfs或本地filesysytem，導入什麼數據在於書寫的select語句。
（2）overwrite與into：
insert overwrite/into table 能夠搭配；
insert overwrite directory 能夠搭配；

　Hive的安裝詳解重在思路 Beeline

向前看，其實不少人都也只是接觸的那些，看誰更有遠見，才能抄近路，數據不在大小，關鍵在於價值

1.安裝

yum install hive相關包
hive相關包以下：
hive – base package that provides the complete language and runtime (required)
hive-metastore – provides scripts for running the metastore as a standalone service (optional)
hive-server – provides scripts for running the original HiveServer as a standalone service (optional)
hive-server2 – provides scripts for running the new HiveServer2 as a standalone service (optional)

2.配置MySQL做爲hive元數據庫
1）建立數據庫
$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
2）建立用戶/分配權限
mysql> CREATE USER ‘hive’@’metastorehost’ IDENTIFIED BY ‘mypassword';
…
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM ‘hive’@’metastorehost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO ‘hive’@’metastorehost';
mysql> FLUSH PRIVILEGES;
mysql> quit;

3.配置hive-site.xml
a）基礎配置（配置爲遠程模式）

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://192.168.1.52:3306/metastore</value>
  <description>the URL of the MySQL database</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>hive</value>
</property>
<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>
<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>
<property>
  <name>datanucleus.autoStartMechanism</name>
  <value>SchemaTable</value>
</property>
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://192.168.1.57:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
其中hive.metastore.uris配置代表使用第三種方式（遠程模式）使用hive。
注意：hive.metastore.local在hive0.10後沒必要須配置，若是配置了上面的參數。

4.配置使用hiveserver2
在hive-site.xml中配置下面選項：

 <property>
  <name>hive.support.concurrency</name>
  <description>Enable Hive's Table Lock Manager Service</description>
  <value>true</value>
</property>
<property>
  <name>hive.zookeeper.quorum</name>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
  <value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value>
</property>
注意：沒用配置hive.zookeeper.quorum會致使沒法併發執行hive ql請求和致使數據異常

Enabling the Table Lock Manager without specifying a list of valid Zookeeper quorum nodes will result in unpredictable behavior. Make sure that both properties are properly configured.
5.安裝Zookeeper
因爲hiveserver2的表鎖管理器須要依賴Zookeeper，所以須要安裝Zookeeper啓動Zookeeper，詳情能夠參看文章「Zookeeper安裝」
啓動集羣的Zookeeper，若是Zookeeper不是默認的端口，則須要顯示配置參數：hive.zookeeper.client.port。

6啓動服務
1）啓動hive-metastore
啓動metadata服務：
sudo service hive-metastore start 或者：hive –service metastore
啓動後的端口默認爲9083
2）啓動hive-server2
啓動hiveserver2：
sudo service hive-server2 start
3）測試
使用beeline控制檯鏈接hive-server2：
/usr/bin/beeline
>!connect jdbc:hive2://localhost:10000 -n hive -p hive org.apache.hive.jdbc.HiveDriver
執行，show tables等命令查看結果。

附1：beeline參數

Usage: java org.apache.hive.cli.beeline.BeeLine 
   -u                the JDBC URL to connect to
   -n                    the username to connect as
   -p                    the password to connect as
   -d                the driver class to use
   -e                       query that should be executed
   -f                        script file that should be executed
   --color=[true/false]            control whether color is used for display
   --showHeader=[true/false]       show column names in query results
   --headerInterval=ROWS;          the interval between which heades are displayed
   --fastConnect=[true/false]      skip building table/column list for tab-completion
比較有用的參數:
–fastConnect=true Building list of tables and columns for tab-completion (set fastconnect to true to skip)…(確實有效)
–isolation 設置事務的隔離級別
例子：
執行sql語句方式：
beeline -u jdbc:hive2://localhost:10000 -n hdfs -p hdfs -e 「show tables」
執行sql文件方式：
beeline -u jdbc:hive2://localhost:10000 -n hdfs -p hdfs -f hiveql_test.sql

附錄2：hive-server1和hive-server2的區別
Hiveserver1 和hiveserver2的JDBC區別：
HiveServer version               Connection URL                    Driver Class

HiveServer2                          jdbc:hive2://:                          org.apache.hive.jdbc.HiveDriver
HiveServer1                          jdbc:hive://:                            org.apache.hadoop.hive.jdbc.HiveDriver

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。