Docker+Hadoop+Hive+Presto 使用Docker部署Hadoop環境和Presto

 

Backgroundhtml

一. 什麼是Prestojava

Presto經過使用分佈式查詢,能夠快速高效的完成海量數據的查詢。若是你須要處理TB或者PB級別的數據,那麼你可能更但願藉助於Hadoop和HDFS來完成這些數據的處理。做爲Hive和Pig(Hive和Pig都是經過MapReduce的管道流來完成HDFS數據的查詢)的替代者,Presto不只能夠訪問HDFS,也能夠操做不一樣的數據源,包括:RDBMS和其餘的數據源(例如:Cassandra)。node

Presto被設計爲數據倉庫和數據分析產品:數據分析、大規模數據彙集和生成報表。這些工做常常一般被認爲是線上分析處理操做。mysql

Presto是FaceBook開源的一個開源項目。Presto在FaceBook誕生,而且由FaceBook內部工程師和開源社區的工程師公共維護和改進。linux

 

二. 環境和應用準備sql

  • 環境

  macbook prodocker

  • application

  Docker for mac: https://docs.docker.com/docker-for-mac/#check-versions數據庫

  jdk-1.8: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.htmlapache

  hadoop-2.7.5centos

  hive-2.3.3

  presto-server-0.198.tar.gz

  presto-cli-0.198-executable.jar

 

三. 構建images

咱們使用Docker來啓動三臺Centos7虛擬機,三臺機器上安裝Hadoop和Java。

1. 安裝Docker,Macbook上安裝Docker,並使用倉庫帳號登陸。

docker login

2. 驗證安裝結果

docker version

3. 拉取Centos7 images

docker pull centos

4. 構建具備ssh功能的centos

mkdir ~/centos-ssh
cd centos-ssh
vi Dockerfile
# 選擇一個已有的os鏡像做爲基礎  
FROM centos 

# 鏡像的做者  
MAINTAINER crxy 

# 安裝openssh-server和sudo軟件包,而且將sshd的UsePAM參數設置成no  
RUN yum install -y openssh-server sudo  
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config  
#安裝openssh-clients
RUN yum  install -y openssh-clients

# 添加測試用戶root,密碼root,而且將此用戶添加到sudoers裏  
RUN echo "root:root" | chpasswd  
RUN echo "root   ALL=(ALL)       ALL" >> /etc/sudoers  
# 下面這兩句比較特殊,在centos6上必需要有,不然建立出來的容器sshd不能登陸  
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key  
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key  

# 啓動sshd服務而且暴露22端口  
RUN mkdir /var/run/sshd  
EXPOSE 22  
CMD ["/usr/sbin/sshd", "-D"]

構建

docker build -t=」centos-ssh」 .

5. 基於centos-ssh鏡像構建有JDK和Hadoop的鏡像

mkdir ~/hadoop
cd hadoop
vi Dockerfile
FROM centos-ssh
ADD jdk-8u161-linux-x64.tar.gz /usr/local/
RUN mv jdk-8u161-linux-x64.tar.gz /usr/local/jdk1.7
ENV JAVA_HOME /usr/local/jdk1.8
ENV PATH $JAVA_HOME/bin:$PATH

ADD hadoop-2.7.5.tar.gz /usr/local
RUN mv hadoop-2.7.5.tar.gz /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH

jdk包和hadoop包要放在hadoop目錄下

docker build -t=」centos-hadoop」 .

 

四. 搭建Hadoop集羣

1. 集羣規劃

搭建有三個節點的hadoop集羣,一主兩從

主節點:hadoop0 ip:172.18.0.2 
從節點1:hadoop1 ip:172.18.0.3 
從節點2:hadoop2 ip:172.18.0.4

可是因爲docker容器從新啓動以後ip會發生變化,因此須要咱們給docker設置固定ip。

Docker安裝後,默認會建立下面三種網絡類型:

docker network ls                                                                                                                                                                                                                           jinhongliu@Jinhongs-MacBo
NETWORK ID          NAME                DRIVER              SCOPE
085be4855a90        bridge              bridge              local
177432e48de5        host                host                local
569f368d1561        none                null                local

啓動 Docker的時候,用 --network 參數,能夠指定網絡類型,如:

~ docker run -itd --name test1 --network bridge --ip 172.17.0.10 centos:latest /bin/bash

bridge:橋接網絡

默認狀況下啓動的Docker容器,都是使用 bridge,Docker安裝時建立的橋接網絡,每次Docker容器重啓時,會按照順序獲取對應的IP地址,這個就致使重啓下,Docker的IP地址就變了.

none:無指定網絡

使用 --network=none ,docker 容器就不會分配局域網的IP

host: 主機網絡

使用 --network=host,此時,Docker 容器的網絡會附屬在主機上,二者是互通的。

例如,在容器中運行一個Web服務,監聽8080端口,則主機的8080端口就會自動映射到容器中。

建立自定義網絡:(設置固定IP)

啓動Docker容器的時候,使用默認的網絡是不支持指派固定IP的,以下:

~ docker run -itd --net bridge --ip 172.17.0.10 centos:latest /bin/bash
6eb1f228cf308d1c60db30093c126acbfd0cb21d76cb448c678bab0f1a7c0df6
docker: Error response from daemon: User specified IP address is supported on user defined networks only.

所以,須要建立自定義網絡,下面是具體的步驟:

步驟1: 建立自定義網絡

建立自定義網絡,而且指定網段:172.18.0.0/16

➜ ~ docker network create --subnet=172.18.0.0/16 mynetwork
➜ ~ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
085be4855a90        bridge              bridge              local
177432e48de5        host                host                local
620ebbc09400        mynetwork           bridge              local
569f368d1561        none                null                local

步驟2: 建立docker容器。啓動三個容器,分別做爲hadoop0 hadoop1 hadoop2

➜  ~ docker run --name hadoop0 --hostname hadoop0 --net mynetwork --ip 172.18.0.2 -d -P -p 50070:50070 -p 8088:8088  centos-hadoop
➜  ~ docker run --name hadoop0 --hostname hadoop1 --net mynetwork --ip 172.18.0.3 -d -P centos-hadoop
➜  ~ docker run --name hadoop0 --hostname hadoop2 --net mynetwork --ip 172.18.0.4 -d -P centos-hadoop

使用docker ps 查看剛纔啓動的是三個容器:

5e0028ed6da0        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 3 hours          0.0.0.0:32771->22/tcp                                                     hadoop2
35211872eb20        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 4 hours          0.0.0.0:32769->22/tcp                                                     hadoop1
0f63a870ef2b        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 5 hours          0.0.0.0:8088->8088/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:32768->22/tcp   hadoop0

這樣3臺機器就有了固定的IP地址。驗證一下,分別ping三個ip,能ping通就說明沒問題。

 

五. 配置Hadoop集羣

1. 先鏈接到hadoop0上, 使用命令

docker exec -it hadoop0 /bin/bash

下面的步驟就是hadoop集羣的配置過程 
1:設置主機名與ip的映射,修改三臺容器:vi /etc/hosts 
添加下面配置

172.18.0.2    hadoop0
172.18.0.3    hadoop1
172.18.0.4    hadoop2

2:設置ssh免密碼登陸 
在hadoop0上執行下面操做

cd  ~
mkdir .ssh
cd .ssh
ssh-keygen -t rsa(一直按回車便可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop0
ssh-copy-id -i hadoop1
ssh-copy-id -i hadoop2
在hadoop1上執行下面操做
cd  ~
cd .ssh
ssh-keygen -t rsa(一直按回車便可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop1
在hadoop2上執行下面操做
cd  ~
cd .ssh
ssh-keygen -t rsa(一直按回車便可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop2

3:在hadoop0上修改hadoop的配置文件 
進入到/usr/local/hadoop/etc/hadoop目錄 
修改目錄下的配置文件core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml 
(1)hadoop-env.sh

export JAVA_HOME=/usr/local/jdk1.8

(2)core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop0:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/usr/local/hadoop/tmp</value>
        </property>
         <property>
                 <name>fs.trash.interval</name>
                 <value>1440</value>
        </property>
</configuration>

(3)hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
</configuration>

(4)yarn-site.xml

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property> 
                <name>yarn.log-aggregation-enable</name> 
                <value>true</value> 
        </property>
</configuration>

(5)修改文件名:mv mapred-site.xml.template mapred-site.xml 
vi mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

(6)格式化 
進入到/usr/local/hadoop目錄下 
執行格式化命令

bin/hdfs namenode -format
注意:在執行的時候會報錯,是由於缺乏which命令,安裝便可

執行下面命令安裝
yum install -y which

格式化操做不能重複執行。若是必定要重複格式化,帶參數-force便可。

(7)啓動僞分佈hadoop

命令:sbin/start-all.sh

第一次啓動的過程當中須要輸入yes確認一下。 使用jps,檢查進程是否正常啓動?能看到下面幾個進程表示僞分佈啓動成功

3267 SecondaryNameNode
3003 NameNode
3664 Jps
3397 ResourceManager
3090 DataNode
3487 NodeManager

(8)中止僞分佈hadoop

命令:sbin/stop-all.sh

(9)指定nodemanager的地址,修改文件yarn-site.xml

<property>
    <description>The hostname of the RM.</description>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop0</value>
  </property>

(10)修改hadoop0中hadoop的一個配置文件etc/hadoop/slaves 
刪除原來的全部內容,修改成以下

hadoop1
hadoop2

(11)在hadoop0中執行命令

scp  -rq /usr/local/hadoop   hadoop1:/usr/local
scp  -rq /usr/local/hadoop   hadoop2:/usr/local

(12)啓動hadoop分佈式集羣服務

執行sbin/start-all.sh

注意:在執行的時候會報錯,是由於兩個從節點缺乏which命令,安裝便可

分別在兩個從節點執行下面命令安裝

yum install -y which

再啓動集羣(若是集羣已啓動,須要先中止)

(13)驗證集羣是否正常 
首先查看進程: 

Hadoop0上須要有這幾個進程

4643 Jps
4073 NameNode
4216 SecondaryNameNode
4381 ResourceManager

Hadoop1上須要有這幾個進程

715 NodeManager
849 Jps
645 DataNode

Hadoop2上須要有這幾個進程

456 NodeManager
589 Jps
388 DataNode

使用程序驗證集羣服務 
建立一個本地文件

vi a.txt
hello you
hello me

上傳a.txt到hdfs上

hdfs dfs -put a.txt /

執行wordcount程序

cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount /a.txt /out

查看程序執行結果 

這樣就說明集羣正常了。

經過瀏覽器訪問集羣的服務 
因爲在啓動hadoop0這個容器的時候把50070和8088映射到宿主機的對應端口上了

因此在這能夠直接經過宿主機訪問容器中hadoop集羣的服務 

 

六. 安裝Hive

咱們使用Presto的hive connector來對hive中的數據進行查詢,所以須要先安裝hive.

1. 本地下載hive,使用下面的命令傳到hadoop0上

docker cp ~/Download/hive-2.3.3-bin.tar.gz 容器ID:/

2. 解壓到指定目錄

tar -zxvf apache-hive-2.3.3-bin.tar.gz
mv apache-hive-2.3.3-bin /hive
cd /hive

三、配置/etc/profile,在/etc/profile中添加以下語句

export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH
source /etc/profile 

四、安裝MySQL數據庫

咱們使用docker容器來進行安裝,首先pull mysql image

docker pull mysql

啓動mysql容器

docker run --name mysql -e MYSQL_ROOT_PASSWORD=111111 --net mynetwork --ip 172.18.0.5  -d

登陸mysql容器

五、建立metastore數據庫併爲其受權

create database metastore;

六、 下載jdbc connector

下載地址Connector/J 5.1.43

下載完成以後將其解壓,並把其中的mysql-connector-java-5.1.41-bin.jar文件拷貝到$HIVE_HOME/lib目錄

七、修改hive配置文件

cd /hive/conf

7.1複製初始化文件並重更名

cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
cp hive-log4j2.properties.template hive-log4j2.properties
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

7.2修改hive-env.sh

export JAVA_HOME=/usr/local/jdk1.8    ##Java路徑
export HADOOP_HOME=/usr/local/hadoop   ##Hadoop安裝路徑
export HIVE_HOME=/usr/local/hive    ##Hive安裝路徑
export HIVE_CONF_DIR=/hive/conf    ##Hive配置文件路徑

7.3在hdfs 中建立下面的目錄 ,而且受權

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /user/hive/tmp
hdfs dfs -mkdir -p /user/hive/log
hdfs dfs -chmod -R 777 /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/tmp
hdfs dfs -chmod -R 777 /user/hive/log

7.4修改hive-site.xml

<property>
    <name>hive.exec.scratchdir</name>
    <value>/user/hive/tmp</value>
</property>
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
</property>

## 配置 MySQL 數據庫鏈接信息
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://172.18.0.5:3306/metastore?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>111111</value>
  </property>

7.5 建立tmp文件

mkdir /home/hadoop/hive/tmp

並在hive-site.xml中修改:

把{system:java.io.tmpdir} 改爲 /home/hadoop/hive/tmp/

把 {system:user.name} 改爲 {user.name}

八、初始化hive

schematool -dbType mysql -initSchema

九、啓動hive

hive

10. hive中建立表

新建create_table文件

REATE TABLE IF NOT EXISTS `default`.`d_abstract_event` ( `id` BIGINT, `network_id` BIGINT, `name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:49:25' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_bumper` ( `front_bumper_id` BIGINT, `end_bumper_id` BIGINT, `content_item_type` STRING, `content_item_id` BIGINT, `content_item_name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:05' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tracking` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `creative_id` BIGINT, `creative_name` STRING, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `placement_id` BIGINT, `placement_name` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_status` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `agency_id` BIGINT, `agency_name` STRING, `status` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_frequency_cap` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `frequency_cap` INT, `frequency_period` INT, `frequency_cap_type` STRING, `frequency_cap_scope` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_skippable` ( `id` BIGINT, `skippable` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `internal_id` STRING, `staging_internal_id` STRING, `budget_exempt` INT, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `ad_unit_type` STRING, `ad_unit_size` STRING, `placement_id` BIGINT, `placement_name` STRING, `placement_internal_id` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `io_internal_id` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_internal_id` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `advertiser_internal_id` STRING, `agency_id` BIGINT, `agency_name` STRING, `agency_internal_id` STRING, `price_model` STRING, `price_type` STRING, `ad_unit_price` DECIMAL(16,2), `status` STRING, `companion_ad_package_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_staging` ( `ad_tree_node_id` BIGINT, `adapter_status` STRING, `primary_ad_tree_node_id` BIGINT, `production_ad_tree_node_id` BIGINT, `hide` INT, `ignore` INT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_trait` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `trait_type` STRING, `parameter` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit_ad_slot_assignment` ( `id` BIGINT, `ad_unit_id` BIGINT, `ad_slot_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit` ( `id` BIGINT, `name` STRING, `ad_unit_type` STRING, `height` INT, `width` INT, `size` STRING, `network_id` BIGINT, `created_type` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_advertiser` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `agency_id` BIGINT, `agency_name` STRING, `advertiser_company_id` BIGINT, `agency_company_id` BIGINT, `billing_contact_company_id` BIGINT, `address_1` STRING, `address_2` STRING, `address_3` STRING, `city` STRING, `state_region_id` BIGINT, `country_id` BIGINT, `postal_code` STRING, `email` STRING, `phone` STRING, `fax` STRING, `url` STRING, `notes` STRING, `billing_term` STRING, `meta_data` STRING, `internal_id` STRING, `active` INT, `budgeted_imp` BIGINT, `num_of_campaigns` BIGINT, `adv_category_name_list` STRING, `adv_category_id_name_list` STRING, `updated_at` TIMESTAMP, `created_at` TIMESTAMP) COMMENT 'Imported by sqoop on 2017/06/27 09:31:22' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
cat create_table | hive

11. 啓動metadata service

presto須要使用hive的metadata service

nohup hive --service metadata &

至此hive的安裝就完成了。

 

七. 安裝presto

1. 下載presto-server-0.198.tar.gz

2. 解壓

cd presto-service-0.198
mkdir etc
cd etc

3. 編輯配置文件:

Node Properties 

etc/node.properties

node.environment=production
node.id=ffffffff-0000-0000-0000-ffffffffffff
node.data-dir=/opt/presto/data/discovery/

JVM Config

etc/jvm.config

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Config Properties

etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://hadoop0:8080

catalog配置:

etc/catalog/hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://hadoop0:9083
hive.config.resources=/usr/local/hadoop/etc/hadoop/core-site.xml,/usr/local/hadoop/etc/hadoop/hdfs-site.xml

4. 啓動hive service

./bin/launch start

5. Download presto-cli-0.198-executable.jar, rename it to presto, make it executable with chmod +x, then run it:

./presto --server localhost:8080 --catalog hive --schema default

這樣整個配置就完成啦。看一下效果吧,經過show tables來查看咱們在hive中建立的表。

 

參考:

https://blog.csdn.net/xu470438000/article/details/50512442‘

http://www.jb51.net/article/118396.htm

https://prestodb.io/docs/current/installation/cli.html

相關文章
相關標籤/搜索