Hive解析多重嵌套JSON數組

最近在工做中遇到一個很難解析的JSON,他是一個嵌套的JSON數組的JSON,要使用Hive來進行解析,用Presto寫了一次,邏輯就很清晰,由於Presto自帶了JSON數據類型,轉換數組就很方便,而Hive解析完JSON數組後是一個字符串,只能使用split方法來對string類型的數據進行切分,因此若是遇到多層嵌套的數組,要注意切分方法,否則就會亂套。sql

須要解析的JSON

{
	"base": {
		"code": "xm",
		"name": "project"
	},
	"list": [{
		"ACode": "cp1",
		"AName": "Product1",
		"BList": [{
			"BCode": "gn1",
			"BName": "Feature1"
		}, {
			"BCode": "gn2",
			"BName": "Feature2"
		}]
	}, {
		"ACode": "cp2",
		"AName": "Product2",
		"BList": [{
			"BCode": "gn1",
			"BName": "Feature1"
		}]
	}]
}

解析出來的結果應該以下表所示json

code name ACode Aname Bcode Bname
xm project cp1 Product1 gn1 Feature1
xm project cp1 Product1 gn2 Feature2
xm project cp2 Product2 gn1 Feature1

解決方案

首先使用get_json_object方法,把須要解析的數組解析出來,而後使用regexp_replace}]},{替換成}]}||{,而後再使用split方法對||進行分割,分割成數組後,使用lateral view explode方法對其進行展開成多列即刻。數組

SELECT
    code
  , name
  , ai.ACode
  , ai.AName
  , bi.BCode
  , bi.BName
FROM
    (
        SELECT
            get_json_object(t.value, '$.base.code') AS code
          , get_json_object(t.value, '$.base.name') AS name
          , get_json_object(t.value, '$.list')      AS list
        FROM
            (
                SELECT
                    '{"base":{"code":"xm","name":"project"},"list":[{"ACode":"cp1","AName":"Product1","BList":[{"BCode":"gn1","BName":"Feature1"},{"BCode":"gn2","BName":"Feature2"}]},{"ACode":"cp2","AName":"Product2","BList":[{"BCode":"gn1","BName":"Feature1"}]}]}' as value
            )
            t
    )
    t lateral view explode(split(regexp_replace(regexp_extract(list,'^\\[(.+)\\]$',1),'\\}\\]\\}\\,\\{', '\\}\\]\\}\\|\\|\\{'),'\\|\\|')) list as a 
	lateral view json_tuple(a,'ACode','AName','BList') ai as ACode
    , AName
    , BList lateral view explode(split(regexp_replace(regexp_extract(BList,'^\\[(.+)\\]$',1),'\\}\\,\\{', '\\}\\|\\|\\{'),'\\|\\|')) BList as b 
    lateral view json_tuple(b,'BCode','BName') bi as BCode
    , BName
;

執行完rest

xm	project	cp1	Product1	gn1	Feature1
xm	project	cp1	Product1	gn2	Feature2
xm	project	cp2	Product2	gn1	Feature1
Time taken: 0.787 seconds, Fetched: 3 row(s)

hive

總結

  1. 嘗試切分爲數組後,使用lateral view posexplode方案,逐層解析,但這樣會致使笛卡爾。因此必須一次性所有解析好,而不是套用多個子查詢逐層解析;
  2. 使用OUTER字段,能使LATERAL VIEW不忽略NULL

include OUTER in the query to get rows with NULL valuescode

something like,regexp

select *  FROM table LATERAL VIEW OUTER explode (  split (  email  ,','  ) ) email AS email_id;
相關文章
相關標籤/搜索