基於Presto和superset搭建數據分析平臺。 Presto能夠做爲數據倉庫,可以鏈接多種數據庫和NoSql,同時查詢性能很高; Superset提供了Presto鏈接,方便數據可視化和dashboard生成。mysql
基本概念
##datawarehouse 數據倉庫 整合各種數據庫數據,面向主題,方便分析。存儲元數據,模型信息,存儲數據(建索引、緩存、分區、pre-aggregation)等。sql
- greenplum
- hive
##OLAP 一些列數據分析操做,好比pivoting, slicing, dicing, drilling;能夠分析數據倉庫也能夠甚至是文件數據。數據庫
- Mondrian 開源的OLAP引擎
- MOLAP 數據在DW,多維格式存儲
- ROLAP 數據存在數據庫
- 大數據領域不少sql-on-hadoop均可以看做OLAP引擎。Drill, Impala,Kylin,Phoenix,Druid,Greenplum,HAWQ,Pinot,Presto,SparkSql
##MDX OLAP的操做一般用MDX表達,查詢多爲數據庫。OLAP服務會把MDX轉爲sql查詢。緩存
##MPP: massive parallel processing 相對sql-on-hadoop,mpp架構不依賴hadoop/spark runtime,mpp具備原生的分佈式執行引擎。架構
Presto w/ Hive and mysql
Presto屬於MPP架構的分析性系統。官方介紹:分佈式
Presto is a tool designed to efficiently query vast amounts of data using distributed queries. ... Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra. Presto was designed to handle data warehousing and analytics: data analysis, aggregating large amounts of data and producing reports. These workloads are often classified as Online Analytical Processing (OLAP).oop
相似數據倉庫,Presto能夠關聯分析多種數據源的數據,包括常見的關係型數據和大數據存儲。性能
例子http://getindata.com/tutorial-presto-combine-data-hive-mysql-one-sql-like-query/測試
部署組件大數據
- download hadoop 2.6 (deploy hdfs)
- hive 1.2.2 (deploy metaserver service)
- mysql
- deploy presto w/ catalog hive and mysql
測試數據
例子中經過Presto同時鏈接mysql和hive。mysql中存放結構化user信息,hive中存放日誌數據。 Hive中數據量比較大,1915萬行。 Mysql中900+行數據。
統計不一樣國家用戶的訪問量佔比:
SELECT u.country, COUNT(*) AS cnt FROM hive.tutorial.stream s JOIN mysql.tutorial.user u ON s.userid = u.userid GROUP BY u.country
Superset
開源BI系統,B/S架構。
##配置presto presto://192.168.56.101:8080/hive/tutorial ##sqllab 選擇Presto做爲Database,能夠關聯查詢Presto catalog中的全部數據源。