[Hive]JsonSerde使用指南

時間 2019-11-12

原文原文鏈接

注意：java

重要的是每行必須是一個完整的JSON，一個JSON不能跨越多行，也就是說，serde不會對多行的Json有效。由於這是由Hadoop處理文件的工做方式決定，文件必須是可拆分的，例如，Hadoop將在行尾分割文本文件。apache

// this will work
{ "key" : 10 }
 
// this will not work
{
  "key" : 10

下載jar

使用以前先下載jar

http://www.congiu.net/hive-json-serde/json

若是要想在Hive中使用JsonSerde，須要把jar添加到Hive類路徑中：

add jar json-serde-1.3.7-jar-with-dependencies.jar;數組

與數組使用

源數據

{"country":"Switzerland","languages":["German","French","Italian"]}
{"country":"China","languages":["chinese"]}

hive表：

CREATE TABLE tmp_json_array (
    country string,
    languages array<string> 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
// 導入數據到表中
LOAD DATA LOCAL INPATH '/home/xiaosi/a.txt' OVERWRITE INTO TABLE  tmp_json_array;

使用：

hive> select languages[0] from tmp_json_array;
OK
German
chinese
Time taken: 0.096 seconds, Fetched: 2 row(s)

嵌套結構

源數據

{"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[6,7]}}
{"country":"China","languages":["chinese"],"religions":{"catholic":[10,20],"protestant":[40,50]}}

hive表：

CREATE TABLE tmp_json_nested (
    country string,
    languages array<string>,
    religions map<string,array<int>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
// 加載數據
LOAD DATA LOCAL INPATH '/home/xiaosi/a.txt' OVERWRITE INTO TABLE  tmp_json_nested ;

使用：

hive> select * from tmp_json_nested;
OK
Switzerland	["German","French","Italian"]	{"catholic":[6,7]}
China	["chinese"]	{"catholic":[10,20],"protestant":[40,50]}
Time taken: 0.113 seconds, Fetched: 2 row(s)
hive> select languages[0] from tmp_json_nested;
OK
German
chinese
Time taken: 0.122 seconds, Fetched: 2 row(s)
hive> select religions['catholic'][0] from tmp_json_nested;
OK
6
10
Time taken: 0.111 seconds, Fetched: 2 row(s)

壞數據

格式錯誤的數據的默認行爲是拋出異常。例如，對於格式不正確的json（languages後缺乏':'）：app

{"country":"Italy","languages"["Italian"],"religions":{"protestant":[40,50]}}

使用

hive> LOAD DATA LOCAL INPATH '/home/xiaosi/a.txt' OVERWRITE INTO TABLE  tmp_json_nested ;
Loading data to table default.tmp_json_nested
OK
Time taken: 0.23 seconds
hive> select * from tmp_json_nested;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: 
Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 31 [character 32 line 1]
Time taken: 0.096 seconds

這種方式不是一種好的策略，咱們數據中不免會遇到壞數據。以下操做能夠忽略壞數據：

ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");

更改設置後：

hive> ALTER TABLE tmp_json_nested SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
OK
Time taken: 0.122 seconds
hive> select * from tmp_json_nested;
OK
Switzerland	["German","French","Italian"]	{"catholic":[6,7]}
China	["chinese"]	{"catholic":[10,20],"protestant":[40,50]}
NULL	NULL	NULL
Time taken: 0.103 seconds, Fetched: 3 row(s)

如今不會致使查詢失敗，可是壞數據記錄將變爲NULL NULL NULL。oop

注意：若是JSON格式正確，可是不符合Hive範式，則不會跳過，依然會報錯：

{"country":"Italy","languages":"Italian","religions":{"catholic":"90"}}

使用：

hive> ALTER TABLE tmp_json_nested SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
OK
Time taken: 0.081 seconds
hive> select * from tmp_json_nested;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException:
java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONArray
Time taken: 0.097 seconds

將標量轉爲數組

這是一個常見的問題，某一個字段有時是一個標量，有時是一個數組，例如：this

{ field: "hello", .. }
{ field: [ "hello", "world" ], ...

在這種狀況下，若是將表聲明爲array<string>，若是SerDe找到一個標量，它將返回一個單元素的數組，從而有效地將標量提高爲數組。可是標量必須是正確的類型。.net

映射HIVE關鍵字

有時可能發生的是，JSON數據具備名爲hive中的保留字的屬性。例如，您可能有一個名爲「timestamp」的JSON屬性，它是hive中的保留字，當發出CREATE TABLE時，hive將失敗。此SerDe可使用SerDe屬性將hive列映射到名稱不一樣的屬性。code

{"country":"Switzerland","exec_date":"2017-03-14 23:12:21"}
{"country":"China","exec_date":"2017-03-16 03:22:18"}

CREATE TABLE tmp_json_mapping (
    country string,
    dt string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("mapping.dt"="exec_date")
STORED AS TEXTFILE;

hive> select * from tmp_json_mapping;
OK
Switzerland	2017-03-14 23:12:21
China	2017-03-16 03:22:18
Time taken: 0.081 seconds, Fetched: 2 row(s)

「mapping.dt」，表示dt列讀取JSON屬性爲exec_date的值。orm

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。