數據質量模塊是大數據平臺中必不可少的一個功能組件,Apache Griffin(如下簡稱Griffin)是一個開源的大數據數據質量解決方案,它支持批處理和流模式兩種數據質量檢測方式,能夠從不一樣維度(好比離線任務執行完畢後檢查源端和目標端的數據數量是否一致、源表的數據空值數量等)度量數據資產,從而提高數據的準確度、可信度。java
在Griffin的架構中,主要分爲Define、Measure和Analyze三個部分,以下圖所示:node
各部分的職責以下:mysql
基於以上功能,咱們大數據平臺計劃引入Griffin做爲數據質量解決方案,實現數據一致性檢查、空值統計等功能。如下是安裝步驟總結:git
初始化操做具體請參考Apache Griffin Deployment Guide,因爲個人測試環境中Hadoop集羣、Hive集羣已搭好,故這裏省略Hadoop、Hive安裝步驟,只保留拷貝配置文件、配置Hadoop配置文件目錄步驟。github
一、MySQL:spring
在MySQL中建立數據庫quartz,而後執行Init_quartz_mysql_innodb.sql腳本初始化表信息:sql
mysql -u <username> -p <password> < Init_quartz_mysql_innodb.sql
二、Hadoop和Hive:數據庫
從Hadoop服務器拷貝配置文件到Livy服務器上,這裏假設將配置文件放在/usr/data/conf目錄下。express
在Hadoop服務器上建立/home/spark_conf目錄,並將Hive的配置文件hive-site.xml上傳到該目錄下:apache
#建立/home/spark_conf目錄 hadoop fs -mkdir -p /home/spark_conf #上傳hive-site.xml hadoop fs -put hive-site.xml /home/spark_conf/
三、設置環境變量:
#!/bin/bash export JAVA_HOME=/data/jdk1.8.0_192 #spark目錄 export SPARK_HOME=/usr/data/spark-2.1.1-bin-2.6.3 #livy命令目錄 export LIVY_HOME=/usr/data/livy/bin #hadoop配置文件目錄 export HADOOP_CONF_DIR=/usr/data/conf
四、Livy配置:
更新livy/conf下的livy.conf配置文件:
livy.server.host = 127.0.0.1 livy.spark.master = yarn livy.spark.deployMode = cluster livy.repl.enable-hive-context = true
啓動livy:
livy-server start
五、Elasticsearch配置:
在ES裏建立griffin索引:
curl -XPUT http://es:9200/griffin -d ' { "aliases": {}, "mappings": { "accuracy": { "properties": { "name": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" }, "tmst": { "type": "date" } } } }, "settings": { "index": { "number_of_replicas": "2", "number_of_shards": "5" } } } '
在這裏我使用源碼編譯打包的方式來部署Griffin,Griffin的源碼地址是:https://github.com/apache/griffin.git,這裏我使用的源碼tag是griffin-0.4.0,下載完成在idea中導入並展開源碼的結構圖以下:
Griffin的源碼結構很清晰,主要包括griffin-doc、measure、service和ui四個模塊,其中griffin-doc負責存放Griffin的文檔,measure負責與spark交互,執行統計任務,service使用spring boot做爲服務實現,負責給ui模塊提供交互所需的restful api,保存統計任務,展現統計結果。
源碼導入構建完畢後,須要修改配置文件,具體修改的配置文件以下:
一、service/src/main/resources/application.properties:
# Apache Griffin應用名稱 spring.application.name=griffin_service # MySQL數據庫配置信息 spring.datasource.url=jdbc:mysql://10.104.20.126:3306/griffin_quartz?useSSL=false spring.datasource.username=xnuser spring.datasource.password=Xn20!@n0oLk spring.jpa.generate-ddl=true spring.datasource.driver-class-name=com.mysql.jdbc.Driver spring.jpa.show-sql=true # Hive metastore配置信息 hive.metastore.uris=thrift://namenodetest01.bi:9083 hive.metastore.dbname=default hive.hmshandler.retry.attempts=15 hive.hmshandler.retry.interval=2000ms # Hive cache time cache.evict.hive.fixedRate.in.milliseconds=900000 # Kafka schema registry,按需配置 kafka.schema.registry.url=http://namenodetest01.bi:8081 # Update job instance state at regular intervals jobInstance.fixedDelay.in.milliseconds=60000 # Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds jobInstance.expired.milliseconds=604800000 # schedule predicate job every 5 minutes and repeat 12 times at most #interval time unit s:second m:minute h:hour d:day,only support these four units predicate.job.interval=5m predicate.job.repeat.count=12 # external properties directory location external.config.location= # external BATCH or STREAMING env external.env.location= # login strategy ("default" or "ldap") login.strategy=default # ldap,登陸策略爲ldap時配置 ldap.url=ldap://hostname:port ldap.email=@example.com ldap.searchBase=DC=org,DC=example ldap.searchPattern=(sAMAccountName={0}) # hdfs default name fs.defaultFS= # elasticsearch配置 elasticsearch.host=griffindq02-test1-rgtj1-tj1 elasticsearch.port=9200 elasticsearch.scheme=http # elasticsearch.user = user # elasticsearch.password = password # livy配置 livy.uri=http://10.104.110.116:8998/batches # yarn url配置 yarn.uri=http://10.104.110.116:8088 # griffin event listener internal.event.listeners=GriffinJobEventHook
二、service/src/main/resources/quartz.properties
# # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # org.quartz.scheduler.instanceName=spring-boot-quartz org.quartz.scheduler.instanceId=AUTO org.quartz.threadPool.threadCount=5 org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX # If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate # If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate # If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate org.quartz.jobStore.useProperties=true org.quartz.jobStore.misfireThreshold=60000 org.quartz.jobStore.tablePrefix=QRTZ_ org.quartz.jobStore.isClustered=true org.quartz.jobStore.clusterCheckinInterval=20000
三、service/src/main/resources/sparkProperties.json:
{ "file": "hdfs:///griffin/griffin-measure.jar", "className": "org.apache.griffin.measure.Application", "name": "griffin", "queue": "default", "numExecutors": 2, "executorCores": 1, "driverMemory": "1g", "executorMemory": "1g", "conf": { "spark.yarn.dist.files": "hdfs:///home/spark_conf/hive-site.xml" }, "files": [ ] }
四、service/src/main/resources/env/env_batch.json:
{ "spark": { "log.level": "INFO" }, "sinks": [ { "type": "CONSOLE", "config": { "max.log.lines": 10 } }, { "type": "HDFS", "config": { "path": "hdfs://namenodetest01.bi.10101111.com:9001/griffin/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } }, { "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://10.104.110.119:9200/griffin/accuracy", "connection.timeout": "1m", "retry": 10 } } ], "griffin.checkpoint": [] }
配置文件修改好後,在idea裏的terminal裏執行以下maven命令進行編譯打包:
mvn -Dmaven.test.skip=true clean install
命令執行完成後,會在service和measure模塊的target目錄下分別看到service-0.4.0.jar和measure-0.4.0.jar兩個jar,將這兩個jar分別拷貝到服務器目錄下。這兩個jar的使用方式以下:
一、使用以下命令將measure-0.4.0.jar這個jar上傳到HDFS的/griffin文件目錄裏:
#改變jar名稱 mv measure-0.4.0.jar griffin-measure.jar #上傳griffin-measure.jar到HDFS文件目錄裏 hadoop fs -put measure-0.4.0.jar /griffin/
這樣作的目的主要是由於spark在yarn集羣上執行任務時,須要到HDFS的/griffin目錄下加載griffin-measure.jar,避免發生類org.apache.griffin.measure.Application找不到的錯誤。
二、運行service-0.4.0.jar,啓動Griffin管理後臺:
nohup java -jar service-0.4.0.jar>service.out 2>&1 &
幾秒鐘後,咱們能夠訪問Apache Griffin的默認UI(默認狀況下,spring boot的端口是8080)。
http://IP:8080
UI操做文檔連接:Apache Griffin User Guide。經過UI操做界面,咱們能夠建立本身的統計任務,部分結果展現界面以下:
一、在hive裏建立表demo_src和demo_tgt:
--create hive tables here. hql script --Note: replace hdfs location with your own path CREATE EXTERNAL TABLE `demo_src`( `id` bigint, `age` int, `desc` string) PARTITIONED BY ( `dt` string, `hour` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 'hdfs:///griffin/data/batch/demo_src'; --Note: replace hdfs location with your own path CREATE EXTERNAL TABLE `demo_tgt`( `id` bigint, `age` int, `desc` string) PARTITIONED BY ( `dt` string, `hour` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION 'hdfs:///griffin/data/batch/demo_tgt';
二、生成測試數據:
從http://griffin.apache.org/data/batch/地址下載全部文件到Hadoop服務器上,而後使用以下命令執行gen-hive-data.sh腳本:
nohup ./gen-hive-data.sh>gen.out 2>&1 &
注意觀察gen.out日誌文件,若是有錯誤,視狀況進行調整。這裏個人測試環境Hadoop和Hive安裝在同一臺服務器上,所以直接運行腳本。
三、經過UI界面建立統計任務,具體按照Apache Griffin User Guide 一步步操做。
# 啓動service.jar nohup java -jar service-0.4.0.jar>service.out 2>&1 & # 運行測試數據生成腳本 nohup ./gen-hive-data.sh>gen.out 2>&1 & # 查詢後臺任務 jobs -l # 查看livy日誌 tail -f /usr/data/livy/logs/livy-root-server.out # 啓動livy-server livy-server start # 中止livy-server livy-server stop
一、gen-hive-data.sh腳本生成數據失敗,報no such file or directory錯誤。
錯誤緣由:HDFS中的/griffin/data/batch/demo_src/和/griffin/data/batch/demo_tgt/目錄下"dt=時間"目錄不存在,如dt=20190113。
解決辦法:給腳本中增長hadoop fs -mkdir建立目錄操做,修改完後以下:
#!/bin/bash #create table hive -f create-table.hql echo "create table done" #current hour sudo ./gen_demo_data.sh cur_date=`date +%Y%m%d%H` dt=${cur_date:0:8} hour=${cur_date:8:2} partition_date="dt='$dt',hour='$hour'" sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql hive -f insert-data.hql src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour} hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour} hadoop fs -touchz ${src_done_path} hadoop fs -touchz ${tgt_done_path} echo "insert data [$partition_date] done" #last hour sudo ./gen_demo_data.sh cur_date=`date -d '1 hour ago' +%Y%m%d%H` dt=${cur_date:0:8} hour=${cur_date:8:2} partition_date="dt='$dt',hour='$hour'" sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql hive -f insert-data.hql src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour} hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour} hadoop fs -touchz ${src_done_path} hadoop fs -touchz ${tgt_done_path} echo "insert data [$partition_date] done" #next hours set +e while true do sudo ./gen_demo_data.sh cur_date=`date +%Y%m%d%H` next_date=`date -d "+1hour" '+%Y%m%d%H'` dt=${next_date:0:8} hour=${next_date:8:2} partition_date="dt='$dt',hour='$hour'" sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql hive -f insert-data.hql src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour} hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour} hadoop fs -touchz ${src_done_path} hadoop fs -touchz ${tgt_done_path} echo "insert data [$partition_date] done" sleep 3600 done set -e
二、HDFS的/griffin/persist目錄下沒有統計結果文件,檢查該目錄的權限,設置合適的權限便可。
三、ES中的metric數據爲空,有兩種可能:
四、啓動service-0.4.0.jar以後,訪問不到UI界面,查看啓動日誌無異常。檢查打包時是否是執行的mvn package命令,將該命令替換成mvn -Dmaven.test.skip=true clean install命令從新打包啓動便可。