Backgroundhtml
一. 什麼是Prestojava
Presto經過使用分佈式查詢,能夠快速高效的完成海量數據的查詢。若是你須要處理TB或者PB級別的數據,那麼你可能更但願藉助於Hadoop和HDFS來完成這些數據的處理。做爲Hive和Pig(Hive和Pig都是經過MapReduce的管道流來完成HDFS數據的查詢)的替代者,Presto不只能夠訪問HDFS,也能夠操做不一樣的數據源,包括:RDBMS和其餘的數據源(例如:Cassandra)。node
Presto被設計爲數據倉庫和數據分析產品:數據分析、大規模數據彙集和生成報表。這些工做常常一般被認爲是線上分析處理操做。mysql
Presto是FaceBook開源的一個開源項目。Presto在FaceBook誕生,而且由FaceBook內部工程師和開源社區的工程師公共維護和改進。linux
二. 環境和應用準備sql
macbook prodocker
Docker for mac: https://docs.docker.com/docker-for-mac/#check-versions數據庫
jdk-1.8: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.htmlapache
hadoop-2.7.5centos
hive-2.3.3
presto-cli-0.198-executable.jar
三. 構建images
咱們使用Docker來啓動三臺Centos7虛擬機,三臺機器上安裝Hadoop和Java。
1. 安裝Docker,Macbook上安裝Docker,並使用倉庫帳號登陸。
docker login
2. 驗證安裝結果
docker version
3. 拉取Centos7 images
docker pull centos
4. 構建具備ssh功能的centos
mkdir ~/centos-ssh cd centos-ssh vi Dockerfile
# 選擇一個已有的os鏡像做爲基礎 FROM centos # 鏡像的做者 MAINTAINER crxy # 安裝openssh-server和sudo軟件包,而且將sshd的UsePAM參數設置成no RUN yum install -y openssh-server sudo RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config #安裝openssh-clients RUN yum install -y openssh-clients # 添加測試用戶root,密碼root,而且將此用戶添加到sudoers裏 RUN echo "root:root" | chpasswd RUN echo "root ALL=(ALL) ALL" >> /etc/sudoers # 下面這兩句比較特殊,在centos6上必需要有,不然建立出來的容器sshd不能登陸 RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key # 啓動sshd服務而且暴露22端口 RUN mkdir /var/run/sshd EXPOSE 22 CMD ["/usr/sbin/sshd", "-D"]
構建
docker build -t=」centos-ssh」 .
5. 基於centos-ssh鏡像構建有JDK和Hadoop的鏡像
mkdir ~/hadoop cd hadoop vi Dockerfile
FROM centos-ssh ADD jdk-8u161-linux-x64.tar.gz /usr/local/ RUN mv jdk-8u161-linux-x64.tar.gz /usr/local/jdk1.7 ENV JAVA_HOME /usr/local/jdk1.8 ENV PATH $JAVA_HOME/bin:$PATH ADD hadoop-2.7.5.tar.gz /usr/local RUN mv hadoop-2.7.5.tar.gz /usr/local/hadoop ENV HADOOP_HOME /usr/local/hadoop ENV PATH $HADOOP_HOME/bin:$PATH
jdk包和hadoop包要放在hadoop目錄下
docker build -t=」centos-hadoop」 .
四. 搭建Hadoop集羣
1. 集羣規劃
搭建有三個節點的hadoop集羣,一主兩從
主節點:hadoop0 ip:172.18.0.2 從節點1:hadoop1 ip:172.18.0.3 從節點2:hadoop2 ip:172.18.0.4
可是因爲docker容器從新啓動以後ip會發生變化,因此須要咱們給docker設置固定ip。
Docker安裝後,默認會建立下面三種網絡類型:
docker network ls jinhongliu@Jinhongs-MacBo NETWORK ID NAME DRIVER SCOPE 085be4855a90 bridge bridge local 177432e48de5 host host local 569f368d1561 none null local
啓動 Docker的時候,用 --network
參數,能夠指定網絡類型,如:
~ docker run -itd --name test1 --network bridge --ip 172.17.0.10 centos:latest /bin/bash
bridge:橋接網絡
默認狀況下啓動的Docker容器,都是使用 bridge,Docker安裝時建立的橋接網絡,每次Docker容器重啓時,會按照順序獲取對應的IP地址,這個就致使重啓下,Docker的IP地址就變了.
none:無指定網絡
使用 --network=none
,docker 容器就不會分配局域網的IP
host: 主機網絡
使用 --network=host
,此時,Docker 容器的網絡會附屬在主機上,二者是互通的。
例如,在容器中運行一個Web服務,監聽8080端口,則主機的8080端口就會自動映射到容器中。
建立自定義網絡:(設置固定IP)
啓動Docker容器的時候,使用默認的網絡是不支持指派固定IP的,以下:
~ docker run -itd --net bridge --ip 172.17.0.10 centos:latest /bin/bash 6eb1f228cf308d1c60db30093c126acbfd0cb21d76cb448c678bab0f1a7c0df6 docker: Error response from daemon: User specified IP address is supported on user defined networks only.
所以,須要建立自定義網絡,下面是具體的步驟:
步驟1: 建立自定義網絡
建立自定義網絡,而且指定網段:172.18.0.0/16
➜ ~ docker network create --subnet=172.18.0.0/16 mynetwork ➜ ~ docker network ls NETWORK ID NAME DRIVER SCOPE 085be4855a90 bridge bridge local 177432e48de5 host host local 620ebbc09400 mynetwork bridge local 569f368d1561 none null local
步驟2: 建立docker容器。啓動三個容器,分別做爲hadoop0 hadoop1 hadoop2
➜ ~ docker run --name hadoop0 --hostname hadoop0 --net mynetwork --ip 172.18.0.2 -d -P -p 50070:50070 -p 8088:8088 centos-hadoop
➜ ~ docker run --name hadoop0 --hostname hadoop1 --net mynetwork --ip 172.18.0.3 -d -P centos-hadoop
➜ ~ docker run --name hadoop0 --hostname hadoop2 --net mynetwork --ip 172.18.0.4 -d -P centos-hadoop
使用docker ps 查看剛纔啓動的是三個容器:
5e0028ed6da0 hadoop "/usr/sbin/sshd -D" 16 hours ago Up 3 hours 0.0.0.0:32771->22/tcp hadoop2 35211872eb20 hadoop "/usr/sbin/sshd -D" 16 hours ago Up 4 hours 0.0.0.0:32769->22/tcp hadoop1 0f63a870ef2b hadoop "/usr/sbin/sshd -D" 16 hours ago Up 5 hours 0.0.0.0:8088->8088/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:32768->22/tcp hadoop0
這樣3臺機器就有了固定的IP地址。驗證一下,分別ping三個ip,能ping通就說明沒問題。
五. 配置Hadoop集羣
1. 先鏈接到hadoop0上, 使用命令
docker exec -it hadoop0 /bin/bash
下面的步驟就是hadoop集羣的配置過程
1:設置主機名與ip的映射,修改三臺容器:vi /etc/hosts
添加下面配置
172.18.0.2 hadoop0 172.18.0.3 hadoop1 172.18.0.4 hadoop2
2:設置ssh免密碼登陸
在hadoop0上執行下面操做
cd ~ mkdir .ssh cd .ssh ssh-keygen -t rsa(一直按回車便可) ssh-copy-id -i localhost ssh-copy-id -i hadoop0 ssh-copy-id -i hadoop1 ssh-copy-id -i hadoop2 在hadoop1上執行下面操做 cd ~ cd .ssh ssh-keygen -t rsa(一直按回車便可) ssh-copy-id -i localhost ssh-copy-id -i hadoop1 在hadoop2上執行下面操做 cd ~ cd .ssh ssh-keygen -t rsa(一直按回車便可) ssh-copy-id -i localhost ssh-copy-id -i hadoop2
3:在hadoop0上修改hadoop的配置文件
進入到/usr/local/hadoop/etc/hadoop目錄
修改目錄下的配置文件core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml
(1)hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8
(2)core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop0:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> </property> <property> <name>fs.trash.interval</name> <value>1440</value> </property> </configuration>
(3)hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
(4)yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration>
(5)修改文件名:mv mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
(6)格式化
進入到/usr/local/hadoop目錄下
執行格式化命令
bin/hdfs namenode -format 注意:在執行的時候會報錯,是由於缺乏which命令,安裝便可 執行下面命令安裝 yum install -y which
格式化操做不能重複執行。若是必定要重複格式化,帶參數-force便可。
(7)啓動僞分佈hadoop
命令:sbin/start-all.sh
第一次啓動的過程當中須要輸入yes確認一下。 使用jps,檢查進程是否正常啓動?能看到下面幾個進程表示僞分佈啓動成功
3267 SecondaryNameNode 3003 NameNode 3664 Jps 3397 ResourceManager 3090 DataNode 3487 NodeManager
(8)中止僞分佈hadoop
命令:sbin/stop-all.sh
(9)指定nodemanager的地址,修改文件yarn-site.xml
<property> <description>The hostname of the RM.</description> <name>yarn.resourcemanager.hostname</name> <value>hadoop0</value> </property>
(10)修改hadoop0中hadoop的一個配置文件etc/hadoop/slaves
刪除原來的全部內容,修改成以下
hadoop1
hadoop2
(11)在hadoop0中執行命令
scp -rq /usr/local/hadoop hadoop1:/usr/local scp -rq /usr/local/hadoop hadoop2:/usr/local
(12)啓動hadoop分佈式集羣服務
執行sbin/start-all.sh
注意:在執行的時候會報錯,是由於兩個從節點缺乏which命令,安裝便可
分別在兩個從節點執行下面命令安裝
yum install -y which
再啓動集羣(若是集羣已啓動,須要先中止)
(13)驗證集羣是否正常
首先查看進程:
Hadoop0上須要有這幾個進程
4643 Jps 4073 NameNode 4216 SecondaryNameNode 4381 ResourceManager
Hadoop1上須要有這幾個進程
715 NodeManager 849 Jps 645 DataNode
Hadoop2上須要有這幾個進程
456 NodeManager 589 Jps 388 DataNode
使用程序驗證集羣服務
建立一個本地文件
vi a.txt hello you hello me
上傳a.txt到hdfs上
hdfs dfs -put a.txt /
執行wordcount程序
cd /usr/local/hadoop/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount /a.txt /out
查看程序執行結果
這樣就說明集羣正常了。
經過瀏覽器訪問集羣的服務
因爲在啓動hadoop0這個容器的時候把50070和8088映射到宿主機的對應端口上了
因此在這能夠直接經過宿主機訪問容器中hadoop集羣的服務
六. 安裝Hive
咱們使用Presto的hive connector來對hive中的數據進行查詢,所以須要先安裝hive.
1. 本地下載hive,使用下面的命令傳到hadoop0上
docker cp ~/Download/hive-2.3.3-bin.tar.gz 容器ID:/
2. 解壓到指定目錄
tar -zxvf apache-hive-2.3.3-bin.tar.gz mv apache-hive-2.3.3-bin /hive cd /hive
三、配置/etc/profile,在/etc/profile中添加以下語句
export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH
source /etc/profile
四、安裝MySQL數據庫
咱們使用docker容器來進行安裝,首先pull mysql image
docker pull mysql
啓動mysql容器
docker run --name mysql -e MYSQL_ROOT_PASSWORD=111111 --net mynetwork --ip 172.18.0.5 -d
登陸mysql容器
五、建立metastore數據庫併爲其受權
create database metastore;
六、 下載jdbc connector
下載完成以後將其解壓,並把其中的mysql-connector-java-5.1.41-bin.jar文件拷貝到$HIVE_HOME/lib目錄
七、修改hive配置文件
cd /hive/conf
7.1複製初始化文件並重更名
cp hive-env.sh.template hive-env.sh cp hive-default.xml.template hive-site.xml cp hive-log4j2.properties.template hive-log4j2.properties cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties
7.2修改hive-env.sh
export JAVA_HOME=/usr/local/jdk1.8 ##Java路徑 export HADOOP_HOME=/usr/local/hadoop ##Hadoop安裝路徑 export HIVE_HOME=/usr/local/hive ##Hive安裝路徑 export HIVE_CONF_DIR=/hive/conf ##Hive配置文件路徑
7.3在hdfs 中建立下面的目錄 ,而且受權
hdfs dfs -mkdir -p /user/hive/warehouse hdfs dfs -mkdir -p /user/hive/tmp hdfs dfs -mkdir -p /user/hive/log hdfs dfs -chmod -R 777 /user/hive/warehouse hdfs dfs -chmod -R 777 /user/hive/tmp hdfs dfs -chmod -R 777 /user/hive/log
7.4修改hive-site.xml
<property> <name>hive.exec.scratchdir</name> <value>/user/hive/tmp</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.querylog.location</name> <value>/user/hive/log</value> </property> ## 配置 MySQL 數據庫鏈接信息 <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://172.18.0.5:3306/metastore?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>111111</value> </property>
7.5 建立tmp文件
mkdir /home/hadoop/hive/tmp
並在hive-site.xml中修改:
把{system:java.io.tmpdir} 改爲 /home/hadoop/hive/tmp/
把 {system:user.name} 改爲 {user.name}
八、初始化hive
schematool -dbType mysql -initSchema
九、啓動hive
hive
10. hive中建立表
新建create_table文件
REATE TABLE IF NOT EXISTS `default`.`d_abstract_event` ( `id` BIGINT, `network_id` BIGINT, `name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:49:25' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_bumper` ( `front_bumper_id` BIGINT, `end_bumper_id` BIGINT, `content_item_type` STRING, `content_item_id` BIGINT, `content_item_name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:05' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tracking` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `creative_id` BIGINT, `creative_name` STRING, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `placement_id` BIGINT, `placement_name` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_status` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `agency_id` BIGINT, `agency_name` STRING, `status` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_frequency_cap` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `frequency_cap` INT, `frequency_period` INT, `frequency_cap_type` STRING, `frequency_cap_scope` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_skippable` ( `id` BIGINT, `skippable` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `internal_id` STRING, `staging_internal_id` STRING, `budget_exempt` INT, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `ad_unit_type` STRING, `ad_unit_size` STRING, `placement_id` BIGINT, `placement_name` STRING, `placement_internal_id` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `io_internal_id` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_internal_id` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `advertiser_internal_id` STRING, `agency_id` BIGINT, `agency_name` STRING, `agency_internal_id` STRING, `price_model` STRING, `price_type` STRING, `ad_unit_price` DECIMAL(16,2), `status` STRING, `companion_ad_package_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_staging` ( `ad_tree_node_id` BIGINT, `adapter_status` STRING, `primary_ad_tree_node_id` BIGINT, `production_ad_tree_node_id` BIGINT, `hide` INT, `ignore` INT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_trait` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `trait_type` STRING, `parameter` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit_ad_slot_assignment` ( `id` BIGINT, `ad_unit_id` BIGINT, `ad_slot_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit` ( `id` BIGINT, `name` STRING, `ad_unit_type` STRING, `height` INT, `width` INT, `size` STRING, `network_id` BIGINT, `created_type` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_advertiser` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `agency_id` BIGINT, `agency_name` STRING, `advertiser_company_id` BIGINT, `agency_company_id` BIGINT, `billing_contact_company_id` BIGINT, `address_1` STRING, `address_2` STRING, `address_3` STRING, `city` STRING, `state_region_id` BIGINT, `country_id` BIGINT, `postal_code` STRING, `email` STRING, `phone` STRING, `fax` STRING, `url` STRING, `notes` STRING, `billing_term` STRING, `meta_data` STRING, `internal_id` STRING, `active` INT, `budgeted_imp` BIGINT, `num_of_campaigns` BIGINT, `adv_category_name_list` STRING, `adv_category_id_name_list` STRING, `updated_at` TIMESTAMP, `created_at` TIMESTAMP) COMMENT 'Imported by sqoop on 2017/06/27 09:31:22' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
cat create_table | hive
11. 啓動metadata service
presto須要使用hive的metadata service
nohup hive --service metadata &
至此hive的安裝就完成了。
七. 安裝presto
1. 下載presto-server-0.198.tar.gz
2. 解壓
cd presto-service-0.198 mkdir etc cd etc
3. 編輯配置文件:
node.environment=production node.id=ffffffff-0000-0000-0000-ffffffffffff node.data-dir=/opt/presto/data/discovery/
etc/jvm.config
-server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError
etc/config.properties
coordinator=true node-scheduler.include-coordinator=true http-server.http.port=8080 query.max-memory=5GB query.max-memory-per-node=1GB discovery-server.enabled=true discovery.uri=http://hadoop0:8080
catalog配置:
etc/catalog/hive.properties
connector.name=hive-hadoop2 hive.metastore.uri=thrift://hadoop0:9083 hive.config.resources=/usr/local/hadoop/etc/hadoop/core-site.xml,/usr/local/hadoop/etc/hadoop/hdfs-site.xml
4. 啓動hive service
./bin/launch start
5. Download presto-cli-0.198-executable.jar, rename it to presto
, make it executable with chmod +x
, then run it:
./presto --server localhost:8080 --catalog hive --schema default
這樣整個配置就完成啦。看一下效果吧,經過show tables來查看咱們在hive中建立的表。
參考:
https://blog.csdn.net/xu470438000/article/details/50512442‘
http://www.jb51.net/article/118396.htm
https://prestodb.io/docs/current/installation/cli.html