Spark SQL集合數據類型array\map的取值方式

時間 2019-12-10

標籤 spark sql 集合數據類型 array map 取值方式欄目 Spark 简体版

原文原文鏈接

Spark SQL集合數據類型array\map的取值方式

版權聲明：本文爲博主原創文章，未經博主容許不得轉載。sql

手動碼字不易，請你們尊重勞動成果，謝謝express

做者：http://blog.csdn.net/wang_wbqapache

本節主要討論集合數據類型：數組\列表array、字典map這兩種數據類型的索引，首先咱們仍是先構造數據結構與DataFrame：數組

scala> case class A(a: String, b: Int)
defined class A

scala> case class B(c: List[A], d: Map[String, A], e: Map[Int, String], f: Map[A, String])
defined class B

scala> def a_gen(i: Int) = A(s"str_$i", i)
a_gen: (i: Int)A                                                                                                                                                                    

scala> def b_gen(i: Int) = B((1 to 10).map(a_gen).toList, (1 to 10).map(j => s"key_$j" -> a_gen(j)).toMap, (1 to 10).map(j => j -> s"value_$j").toMap, (1 to 10).map(j => a_gen(j) -> s"value_$j").toMap)
b_gen: (i: Int)B

scala> val data = (1 to 10).map(b_gen)

scala> val df = spark.createDataFrame(data)
df: org.apache.spark.sql.DataFrame = [c: array<struct<a:string,b:int>>, d: map<string,struct<a:string,b:int>> ... 2 more fields]

scala> df.show
+--------------------+--------------------+--------------------+--------------------+
|                   c|                   d|                   e|                   f|
+--------------------+--------------------+--------------------+--------------------+
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
|[[str_1, 1], [str...|[key_2 -> [str_2,...|[5 -> value_5, 10...|[[str_8, 8] -> va...|
+--------------------+--------------------+--------------------+--------------------+


scala> df.printSchema
root
 |-- c: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: integer (nullable = false)
 |-- d: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: integer (nullable = false)
 |-- e: map (nullable = true)
 |    |-- key: integer
 |    |-- value: string (valueContainsNull = true)
 |-- f: map (nullable = true)
 |    |-- key: struct
 |    |-- value: string (valueContainsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: integer (nullable = false)

數組\列表`array`的索引方式

咱們首先來看一下數組\列表array的索引方式：數據結構

//c的數據類型爲array，咱們能夠單純使用點的方式把數組中的某個結構給提取出來
//一樣可使用expr("c['a']")或col("c")("a")的方式得到相同的結果。
scala> df.select("c.a").show(10, false)
+-----------------------------------------------------------------------+
|a                                                                      |
+-----------------------------------------------------------------------+
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
+-----------------------------------------------------------------------+


scala> df.select("c.a").printSchema
root
 |-- a: array (nullable = true)
 |    |-- element: string (containsNull = true)


//這裏介紹一個頗有用的表達式explode，它能把數組中的元素展開成多行數據
//好比：
//> SELECT explode(array(10, 20));
// 10
// 20
//還有一個比較有用的函數是posexplode，顧名思義，這個函數會增長一列原始數組的索引
scala> df.select(expr("explode(c.a)")).show
+------+
|   col|
+------+
| str_1|
| str_2|
| str_3|
| str_4|
| str_5|
| str_6|
| str_7|
| str_8|
| str_9|
|str_10|
| str_1|
| str_2|
| str_3|
| str_4|
| str_5|
| str_6|
| str_7|
| str_8|
| str_9|
|str_10|
+------+
only showing top 20 rows

scala> df.select(expr("explode(c.a)")).printSchema
root
 |-- col: string (nullable = true)

scala> df.select(expr("explode(c)")).show
+------------+
|         col|
+------------+
|  [str_1, 1]|
|  [str_2, 2]|
|  [str_3, 3]|
|  [str_4, 4]|
|  [str_5, 5]|
|  [str_6, 6]|
|  [str_7, 7]|
|  [str_8, 8]|
|  [str_9, 9]|
|[str_10, 10]|
|  [str_1, 1]|
|  [str_2, 2]|
|  [str_3, 3]|
|  [str_4, 4]|
|  [str_5, 5]|
|  [str_6, 6]|
|  [str_7, 7]|
|  [str_8, 8]|
|  [str_9, 9]|
|[str_10, 10]|
+------------+
only showing top 20 rows

scala> df.select(expr("explode(c)")).printSchema
root
 |-- col: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)

//inline也是一個很是有用的函數，它能夠把array[struct[XXX]]直接展開成XXX
scala> df.select(expr("inline(c)")).show
+------+---+
|     a|  b|
+------+---+
| str_1|  1|
| str_2|  2|
| str_3|  3|
| str_4|  4|
| str_5|  5|
| str_6|  6|
| str_7|  7|
| str_8|  8|
| str_9|  9|
|str_10| 10|
| str_1|  1|
| str_2|  2|
| str_3|  3|
| str_4|  4|
| str_5|  5|
| str_6|  6|
| str_7|  7|
| str_8|  8|
| str_9|  9|
|str_10| 10|
+------+---+
only showing top 20 rows

scala> df.select(expr("inline(c)")).printSchema
root
 |-- a: string (nullable = true)
 |-- b: integer (nullable = false)

字典`map`的索引方式

下面咱們來介紹map的類型的索引方式，其實也無外乎就是咱們以前經常使用的幾點
一、點表達式 a.b
二、中括號表達式 expr(「a[‘b’]」)
三、小括號表達式 col(「a」)(「b」)
只是最後取得的列名不一樣ide

scala> df.select(expr("posexplode(d)")).printSchema
root
 |-- pos: integer (nullable = false)
 |-- key: string (nullable = false)
 |-- value: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)


scala> df.select(expr("posexplode(e)")).printSchema
root
 |-- pos: integer (nullable = false)
 |-- key: integer (nullable = false)
 |-- value: string (nullable = true)

scala> df.select(expr("posexplode(f)")).show
+---+------------+--------+
|pos|         key|   value|
+---+------------+--------+
|  0|  [str_8, 8]| value_8|
|  1|[str_10, 10]|value_10|
|  2|  [str_3, 3]| value_3|
|  3|  [str_1, 1]| value_1|
|  4|  [str_6, 6]| value_6|
|  5|  [str_5, 5]| value_5|
|  6|  [str_7, 7]| value_7|
|  7|  [str_2, 2]| value_2|
|  8|  [str_4, 4]| value_4|
|  9|  [str_9, 9]| value_9|
|  0|  [str_8, 8]| value_8|
|  1|[str_10, 10]|value_10|
|  2|  [str_3, 3]| value_3|
|  3|  [str_1, 1]| value_1|
|  4|  [str_6, 6]| value_6|
|  5|  [str_5, 5]| value_5|
|  6|  [str_7, 7]| value_7|
|  7|  [str_2, 2]| value_2|
|  8|  [str_4, 4]| value_4|
|  9|  [str_9, 9]| value_9|
+---+------------+--------+

scala> df.select(expr("posexplode(f)")).printSchema
root
 |-- pos: integer (nullable = false)
 |-- key: struct (nullable = false)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)
 |-- value: string (nullable = true)

//咱們可使用點表達式去用map的key取value
//若是key不存在這行數據會爲null
scala> df.select("d.key_1").show
+----------+
|     key_1|
+----------+
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
+----------+


scala> df.select("d.key_1").printSchema
root
 |-- key_1: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)

//數字爲key一樣可使用
//對於數字來說，expr("e[1]")、expr("e['1']")、col("e")(1)、col("e")("1")這四種表達式均可用
//只是最後取得的列名不一樣
scala> df.select("e.1").show
+-------+
|      1|
+-------+
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
+-------+


scala> df.select("e.1").printSchema
root
 |-- 1: string (nullable = true)

在學習了struct和array的取值後，再看map的取值是否是就特別簡單了，下面咱們來看一個難一點的例子函數

最有意思的就是f這個map了，咱們用struct做爲map的key
這種狀況下，咱們能夠用namedExpressionSeq表達式類構造這個struct學習

scala> df.select(expr("f[('str_1' AS a, 1 AS b)]")).show
+---------------------------------------------+
|f[named_struct(a, str_1 AS `a`, b, 1 AS `b`)]|
+---------------------------------------------+
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
+---------------------------------------------+


scala> df.select(expr("f[('str_1' AS a, 1 AS b)]")).printSchema
root
 |-- f[named_struct(a, str_1 AS `a`, b, 1 AS `b`)]: string (nullable = true)

以上這種構造方式固然不是憑空想出來的，依據呢固然仍是我以前提到的另外一個博客裏介紹的查看方式http://www.javashuo.com/article/p-fnnqcoao-cz.htmlspa

咱們能夠在SqlBase.g4文件中找到如下詞法描述.net

primaryExpression
    : #前面太長不看
    | '(' namedExpression (',' namedExpression)+ ')'         #rowConstructor
    #中間太長不看
    | value=primaryExpression '[' index=valueExpression ']'  #subscript
    #後面太長不看
    ;

valueExpression
    : primaryExpression                                                                      
    #後面太長不看
    ;

namedExpression
    : expression (AS? (identifier | identifierList))?
    ;

從上面咱們能夠看出：
一、中括號裏須要放置valueExpression
二、valueExpression能夠是一個primaryExpression
三、primaryExpression能夠是一個'(' namedExpression (',' namedExpression)+ ')'結構
四、namedExpression又是一個exp AS alias的結構

所以，顯而易見，咱們能夠用這種方式來構造結構體去匹配map的key