Oozie(HUE) 調度 Spark2

環境

JDK    1.8.0 
Hadoop 2.6.0
Scala  2.11.8  
Spark  2.1.2
Oozie  4.1
Hue    3.9

yarn local 模式

  • 進入 Workspace

workspace.png

  • 進入 lib 目錄,並上傳 jar 和 配置文件

lib.png

  • 拖拽 Spark Program

drag_spark.png

  • 選擇剛纔的 lib 目錄

select.png

  • 填入 jar 名稱,點擊 add 確認

add.png

  • 填寫業務主類名稱,並配置參數

main_opt.png

  • 點擊小齒輪,查看其餘參數

enter.png


see.png

  • 保存配置

save.png

  • 提交運行

submit.png

yarn cluster 模式

  • 進入 Workspace

yarn_cluster

  • 進入 lib 目錄,並上傳 jar 和 配置文件

upload

  • 拖拽 Spark Program

drag_spark

  • Files 隨便填,等會兒要刪除,Jar name 填入完整 HDFS 路徑
hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570773494.4/lib/DataWarehouse-1.0-SNAPSHOT.jar

add.png

  • 填寫業務主類名稱,點擊減號刪除 FILES,配置參數

image.png

hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570773494.4/lib/DataWarehouse-1.0-SNAPSHOT.jar
dw.user.qhy.wc.WordCount
--properties-file spark.properties
  • 點擊小齒輪,查看其餘參數

check

  • 將 client 改成 cluster

rename

  • 保存配置

save

  • 提交運行

submit

Oozie HTTP 接口

<workflow-app name="data_warehouse.test" xmlns="uri:oozie:workflow:0.5">
    <start to="spark-2d66"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-2d66">
        <spark xmlns="uri:oozie:spark-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <master>yarn</master>
            <mode>cluster</mode>
            <name>data_warehouse.workflow</name>
              <class>dw.update.JobStream</class>
            <jar>hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1578979482.24/lib/DataWarehouse-1.0.jar</jar>
              <spark-opts>--properties-file spark.properties</spark-opts>
              <arg>${CustomArgs}</arg>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>
  • 示例代碼(Python)
import requests
import json
import time
from pprint import pprint

HEADER = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Content-Type':
    'application/xml;charset=UTF-8',
}
# oozie.wf.application.path 裏面存放 workflow.xml
XML = '''
<configuration>
  <property>
    <name>oozie.wf.application.path</name>
    <value>hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1578979482.24</value>
  </property>
  <property>
    <name>oozie.use.system.libpath</name>
    <value>True</value>
  </property>
  <property>
    <name>user.name</name>
    <value>walker</value>
  </property>
  <property>
    <name>jobTracker</name>
    <value>rm1</value>
  </property>
  <property>
    <name>mapreduce.job.user.name</name>
    <value>walker</value>
  </property>
  <property>
    <name>nameNode</name>
    <value>hdfs://localcluster</value>
  </property>
  <property>
    <name>CustomArgs</name>
    <value>%s</value>
  </property>
</configuration>
'''

CustomArgs = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    'Content-Type':
    'application/xml;charset=UTF-8',
    'hello':
    'world'
}

XML = XML % json.dumps(CustomArgs)
# 提交任務
r = requests.post('http://oozie.walker:11000/oozie/v1/jobs?action=start', data=XML, headers=HEADER)
print(r.text)   # {"id":"0000034-191012235641226-oozie-vipc-W"}
# 獲取任務 ID
jobid = json.loads(r.text, encoding='utf8')['id']
# 查看運行狀態
url = 'http://oozie.walker:11000/oozie/v1/job/%s?show=info&timezone=GMT' % jobid 
while True:
    time.sleep(5)
    r = requests.get(url)
    pprint(r.text)    # 查看運行狀態

FAQ

  • 報相似以下錯誤(Attempt to add ... multiple times to the distributed cache)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, 
Attempt to add (hdfs://localcluster/user/hue/oozie/workspaces/hue-oozie-1570758098.65/lib/DataWarehouse-1.0-SNAPSHOT.jar) multiple times to the distributed cache.

能夠參考這篇文章的處理方式: java.lang.IllegalArgumentException: Attempt to add (custom-jar-with-spark-code.jar) multiple times to the distributed cachehtml


  • 報相似以下錯誤(kryo)
java.io.IOException: java.lang.NullPointerException
java.io.EOFException
com.esotericsoftware.kryo.KryoException

多是由於不當的使用了 kryo 序列化器,最簡單的解決方法是將java

spark.serializer=org.apache.spark.serializer.KryoSerializer

換回默認的node

spark.serializer=org.apache.spark.serializer.JavaSerializer

進一步可參考這篇文章的解決方案:Spark2 的序列化(JavaSerializer/KryoSerializer)python

本文出自 walker snapshot
相關文章
相關標籤/搜索