hive之Json解析(普通Json和Json數組)

1、數據準備java

現準備原始json數據(test.json)以下：web

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"} {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"} {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"} {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"} {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"} {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"} {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} {"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"} {"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}

如今將數據導入到hive中，而且最終想要獲得這麼一個結果：apache

能夠使用：內置函數（get_json_object）或者自定義函數完成json

2、get_json_object(string json_string, string path)數組

返回值：String函數

說明：解析json的字符串json_string，返回path指定的內容。若是輸入的json字符串無效，那麼返回NUll，這個函數每次只能返回一個數據項。oop

0: jdbc:hive2://hadoop3:10000> select get_json_object('{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}','$.movie');

一、建立json表並將數據導入測試

0: jdbc:hive2://master:10000> create table json(data string);
No rows affected (0.572 seconds)
0: jdbc:hive2://master:10000> load data local inpath '/home/hadoop/json.txt' into table json; 
No rows affected (1.046 seconds)

0: jdbc:hive2://master:10000> select get_json_object(data,'$.movie') as movie from json;

3、json_tuple(jsonStr, k1, k2, ...)ui

參數爲一組鍵k1，k2，。。。。。和json字符串，返回值的元組。該方法比get_json_object高效，所以能夠在一次調用中輸入屢次鍵google

0: jdbc:hive2://master:10000> select b.b_movie,b.b_rate,b.b_timeStamp,b.b_uid from json a lateral view 
json_tuple(a.data,'movie','rate','timeStamp','uid') b as b_movie,b_rate,b_timeStamp,b_uid;

注意點：

　　json_tuple至關於get_json_object的優點就是一次能夠解析多個Json字段。可是若是咱們有個Json數組，這兩個函數都沒法處理

4、Json數組解析

一、使用Hive自帶的函數解析Json數組

Hive的內置的explode函數，explode()函數接收一個 array或者map 類型的數據做爲輸入，而後將 array 或 map 裏面的元素按照每行的形式輸出。其能夠配合 LATERAL VIEW 一塊兒使用。

hive> select explode(array('A','B','C')); OK A B C Time taken: 4.879 seconds, Fetched: 3 row(s) hive> select explode(map('A',10,'B',20,'C',30)); OK A 10 B 20 C 30 Time taken: 0.261 seconds, Fetched: 3 row(s)

這個explode函數和咱們解析json數據是有關係的，咱們能夠使用explode函數將json數組裏面的元素按照一行一行的形式輸出：

hive> SELECT explode(split(regexp_replace(regexp_replace('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]', '\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')); OK {"website":"www.baidu.com","name":"百度"} {"website":"google.com","name":"谷歌"} Time taken: 0.14 seconds, Fetched: 2 row(s)

說明:

SELECT explode(split( regexp_replace( regexp_replace( '[
                {"website":"www.baidu.com","name":"百度"}, {"website":"google.com","name":"谷歌"} ]', 
            '\\[|\\]',''),  --將 Json 數組兩邊的中括號去掉 '\\}\\,\\{'    --將 Json 數組元素之間的逗號換成分號 ,'\\}\\;\\{'), '\\;'));    --以分號做爲分隔符

結合 get_json_object 或 json_tuple 來解析裏面的字段：

hive> select json_tuple(json, 'website', 'name') from (SELECT explode(split(regexp_replace(regexp_replace('[{"website":"www.baidu.com","name":"百},{"website":"google.com","name":"谷歌"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) as json) test; OK www.baidu.com 百度 google.com 谷歌 Time taken: 0.283 seconds, Fetched: 2 row(s)

二、自定義函數解析JSON數組

雖然能夠使用Hive自帶的函數類解析Json數組，可是使用起來有些麻煩。Hive提供了強大的自定義函數（UDF）的接口，咱們能夠使用這個功能來編寫解析JSON數組的UDF。具體測試過程以下：

<dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>2.1.1</version>
        </dependency>
    </dependencies>

import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.json.JSONArray; import org.json.JSONException; import java.util.ArrayList;  @Description(name = "json_array", value = "_FUNC_(array_string) - Convert a string of a JSON-encoded array to a Hive array of strings.") public class JsonArray extends UDF{ public ArrayList<String> evaluate(String jsonString) { if (jsonString == null) { return null; } try { JSONArray extractObject = new JSONArray(jsonString); ArrayList<String> result = new ArrayList<String>(); for (int ii = 0; ii < extractObject.length(); ++ii) { result.add(extractObject.get(ii).toString()); } return result; } catch (JSONException e) { return null; } catch (NumberFormatException e) { return null; } } }

將上面的代碼進行編譯打包,jar包名爲：HiveJsonTest-1.0-SNAPSHOT.jar

hive> add jar /mnt/HiveJsonTest-1.0-SNAPSHOT.jar; Added [/mnt/HiveJsonTest-1.0-SNAPSHOT.jar] to class path Added resources: [/mnt/HiveJsonTest-1.0-SNAPSHOT.jar]

hive> create temporary function json_array as 'JsonArray'; OK Time taken: 0.111 seconds

hive> select explode(json_array('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com"name":"谷歌"}]')); OK {"website":"www.baidu.com","name":"百度"} {"website":"google.com","name":"谷歌"} Time taken: 10.427 seconds, Fetched: 2 row(s)

hive> select json_tuple(json, 'website', 'name') from (SELECT explode(json_array('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]')) as json) test; OK www.baidu.com 百度 google.com 谷歌 Time taken: 0.265 seconds, Fetched: 2 row(s)

三、自定義函數解析json對象

package com.laotou; import org.apache.commons.lang3.StringUtils; import org.apache.hadoop.hive.ql.exec.UDF; import org.json.JSONException; import org.json.JSONObject; import org.json.JSONTokener; /** * * add jar jar/bdp_udf_demo-1.0.0.jar; * create temporary function getJsonObject as 'com.laotou.JsonObjectParsing'; * Json對象解析UDF * @Author: * @Date: 2019/8/9 */
public class JsonObjectParsing extends UDF { public static String evaluate(String jsonStr, String keyName) throws JSONException { if(StringUtils.isBlank(jsonStr) || StringUtils.isBlank(keyName)){ return null; } JSONObject jsonObject = new JSONObject(new JSONTokener(jsonStr)); Object objValue = jsonObject.get(keyName); if(objValue==null){ return null; } return objValue.toString(); } }

三、1準備數據

三、2測試