Apache Griffin 入門指南

時間 2019-12-07

原文原文鏈接

數據質量模塊是大數據平臺中必不可少的一個功能組件，Apache Griffin（如下簡稱Griffin）是一個開源的大數據數據質量解決方案，它支持批處理和流模式兩種數據質量檢測方式，能夠從不一樣維度（好比離線任務執行完畢後檢查源端和目標端的數據數量是否一致、源表的數據空值數量等）度量數據資產，從而提高數據的準確度、可信度。java

在Griffin的架構中，主要分爲Define、Measure和Analyze三個部分，以下圖所示：node

各部分的職責以下：mysql

Define：主要負責定義數據質量統計的維度，好比數據質量統計的時間跨度、統計的目標（源端和目標端的數據數量是否一致，數據源裏某一字段的非空的數量、不重複值的數量、最大值、最小值、top5的值數量等）
Measure：主要負責執行統計任務，生成統計結果
Analyze：主要負責保存與展現統計結果

基於以上功能，咱們大數據平臺計劃引入Griffin做爲數據質量解決方案，實現數據一致性檢查、空值統計等功能。如下是安裝步驟總結：git

安裝部署

依賴準備

JDK (1.8 or later versions)
MySQL(version 5.6及以上)
Hadoop (2.6.0 or later)
Hive (version 2.x)
Spark (version 2.2.1)
Livy（livy-0.5.0-incubating）
ElasticSearch (5.0 or later versions)

初始化

初始化操做具體請參考Apache Griffin Deployment Guide，因爲個人測試環境中Hadoop集羣、Hive集羣已搭好，故這裏省略Hadoop、Hive安裝步驟，只保留拷貝配置文件、配置Hadoop配置文件目錄步驟。github

一、MySQL：spring

在MySQL中建立數據庫quartz，而後執行Init_quartz_mysql_innodb.sql腳本初始化表信息：sql

mysql -u <username> -p <password> < Init_quartz_mysql_innodb.sql

二、Hadoop和Hive：數據庫

從Hadoop服務器拷貝配置文件到Livy服務器上，這裏假設將配置文件放在/usr/data/conf目錄下。express

在Hadoop服務器上建立/home/spark_conf目錄，並將Hive的配置文件hive-site.xml上傳到該目錄下：apache

#建立/home/spark_conf目錄
hadoop fs -mkdir -p /home/spark_conf
#上傳hive-site.xml
hadoop fs -put hive-site.xml /home/spark_conf/

三、設置環境變量：

#!/bin/bash
export JAVA_HOME=/data/jdk1.8.0_192

#spark目錄
export SPARK_HOME=/usr/data/spark-2.1.1-bin-2.6.3
#livy命令目錄
export LIVY_HOME=/usr/data/livy/bin
#hadoop配置文件目錄
export HADOOP_CONF_DIR=/usr/data/conf

四、Livy配置：

更新livy/conf下的livy.conf配置文件：

livy.server.host = 127.0.0.1
livy.spark.master = yarn
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true

啓動livy：

livy-server start

五、Elasticsearch配置：

在ES裏建立griffin索引：

curl -XPUT http://es:9200/griffin -d '
{
    "aliases": {},
    "mappings": {
        "accuracy": {
            "properties": {
                "name": {
                    "fields": {
                        "keyword": {
                            "ignore_above": 256,
                            "type": "keyword"
                        }
                    },
                    "type": "text"
                },
                "tmst": {
                    "type": "date"
                }
            }
        }
    },
    "settings": {
        "index": {
            "number_of_replicas": "2",
            "number_of_shards": "5"
        }
    }
}
'

源碼打包部署

在這裏我使用源碼編譯打包的方式來部署Griffin，Griffin的源碼地址是：https://github.com/apache/griffin.git，這裏我使用的源碼tag是griffin-0.4.0，下載完成在idea中導入並展開源碼的結構圖以下：

Griffin的源碼結構很清晰，主要包括griffin-doc、measure、service和ui四個模塊，其中griffin-doc負責存放Griffin的文檔，measure負責與spark交互，執行統計任務，service使用spring boot做爲服務實現，負責給ui模塊提供交互所需的restful api，保存統計任務，展現統計結果。

源碼導入構建完畢後，須要修改配置文件，具體修改的配置文件以下：

一、service/src/main/resources/application.properties：

# Apache Griffin應用名稱
spring.application.name=griffin_service
# MySQL數據庫配置信息
spring.datasource.url=jdbc:mysql://10.104.20.126:3306/griffin_quartz?useSSL=false
spring.datasource.username=xnuser
spring.datasource.password=Xn20!@n0oLk
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore配置信息
hive.metastore.uris=thrift://namenodetest01.bi:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry，按需配置
kafka.schema.registry.url=http://namenodetest01.bi:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap，登陸策略爲ldap時配置
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name
fs.defaultFS=
# elasticsearch配置
elasticsearch.host=griffindq02-test1-rgtj1-tj1
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy配置
livy.uri=http://10.104.110.116:8998/batches
# yarn url配置
yarn.uri=http://10.104.110.116:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook

二、service/src/main/resources/quartz.properties

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
# 
#   http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000

三、service/src/main/resources/sparkProperties.json：

{
  "file": "hdfs:///griffin/griffin-measure.jar",
  "className": "org.apache.griffin.measure.Application",
  "name": "griffin",
  "queue": "default",
  "numExecutors": 2,
  "executorCores": 1,
  "driverMemory": "1g",
  "executorMemory": "1g",
  "conf": {
    "spark.yarn.dist.files": "hdfs:///home/spark_conf/hive-site.xml"
  },
  "files": [
  ]
}

四、service/src/main/resources/env/env_batch.json：

{
  "spark": {
    "log.level": "INFO"
  },
  "sinks": [
    {
      "type": "CONSOLE",
      "config": {
        "max.log.lines": 10
      }
    },
    {
      "type": "HDFS",
      "config": {
        "path": "hdfs://namenodetest01.bi.10101111.com:9001/griffin/persist",
        "max.persist.lines": 10000,
        "max.lines.per.file": 10000
      }
    },
    {
      "type": "ELASTICSEARCH",
      "config": {
        "method": "post",
        "api": "http://10.104.110.119:9200/griffin/accuracy",
        "connection.timeout": "1m",
        "retry": 10
      }
    }
  ],
  "griffin.checkpoint": []
}

配置文件修改好後，在idea裏的terminal裏執行以下maven命令進行編譯打包：

mvn -Dmaven.test.skip=true clean install

命令執行完成後，會在service和measure模塊的target目錄下分別看到service-0.4.0.jar和measure-0.4.0.jar兩個jar，將這兩個jar分別拷貝到服務器目錄下。這兩個jar的使用方式以下：

一、使用以下命令將measure-0.4.0.jar這個jar上傳到HDFS的/griffin文件目錄裏：

#改變jar名稱
mv measure-0.4.0.jar griffin-measure.jar
#上傳griffin-measure.jar到HDFS文件目錄裏
hadoop fs -put measure-0.4.0.jar /griffin/

這樣作的目的主要是由於spark在yarn集羣上執行任務時，須要到HDFS的/griffin目錄下加載griffin-measure.jar，避免發生類org.apache.griffin.measure.Application找不到的錯誤。

二、運行service-0.4.0.jar，啓動Griffin管理後臺：

nohup java -jar service-0.4.0.jar>service.out 2>&1 &

幾秒鐘後，咱們能夠訪問Apache Griffin的默認UI(默認狀況下，spring boot的端口是8080)。

http://IP:8080

UI操做文檔連接：Apache Griffin User Guide。經過UI操做界面，咱們能夠建立本身的統計任務，部分結果展現界面以下：

功能體驗

一、在hive裏建立表demo_src和demo_tgt：

--create hive tables here. hql script
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_src`(
  `id` bigint,
  `age` int,
  `desc` string) 
PARTITIONED BY (
  `dt` string,
  `hour` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
LOCATION
  'hdfs:///griffin/data/batch/demo_src';

--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_tgt`(
  `id` bigint,
  `age` int,
  `desc` string) 
PARTITIONED BY (
  `dt` string,
  `hour` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
LOCATION
  'hdfs:///griffin/data/batch/demo_tgt';

二、生成測試數據：

從http://griffin.apache.org/data/batch/地址下載全部文件到Hadoop服務器上，而後使用以下命令執行gen-hive-data.sh腳本：

nohup ./gen-hive-data.sh>gen.out 2>&1 &

注意觀察gen.out日誌文件，若是有錯誤，視狀況進行調整。這裏個人測試環境Hadoop和Hive安裝在同一臺服務器上，所以直接運行腳本。

三、經過UI界面建立統計任務，具體按照Apache Griffin User Guide 一步步操做。

踩坑記錄

一、gen-hive-data.sh腳本生成數據失敗，報no such file or directory錯誤。

錯誤緣由：HDFS中的/griffin/data/batch/demo_src/和/griffin/data/batch/demo_tgt/目錄下"dt=時間"目錄不存在，如dt=20190113。

解決辦法：給腳本中增長hadoop fs -mkdir建立目錄操做，修改完後以下：

#!/bin/bash

#create table
hive -f create-table.hql
echo "create table done"

#current hour
sudo ./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"

#last hour
sudo ./gen_demo_data.sh
cur_date=`date -d '1 hour ago' +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"
sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
hive -f insert-data.hql
src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
hadoop fs -touchz ${src_done_path}
hadoop fs -touchz ${tgt_done_path}
echo "insert data [$partition_date] done"

#next hours
set +e
while true
do
  sudo ./gen_demo_data.sh
  cur_date=`date +%Y%m%d%H`
  next_date=`date -d "+1hour" '+%Y%m%d%H'`
  dt=${next_date:0:8}
  hour=${next_date:8:2}
  partition_date="dt='$dt',hour='$hour'"
  sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql
  hive -f insert-data.hql
  src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE
  tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE
  hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour}
  hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}
  hadoop fs -touchz ${src_done_path}
  hadoop fs -touchz ${tgt_done_path}
  echo "insert data [$partition_date] done"
  sleep 3600
done
set -e

二、HDFS的/griffin/persist目錄下沒有統計結果文件，檢查該目錄的權限，設置合適的權限便可。

三、ES中的metric數據爲空，有兩種可能：