《Spark The Definitive Guide》Chapter 5：基本結構化API操做

時間 2019-11-29

標籤 spark definitive guide chapter 基本構化 api 欄目 Spark 简体版

原文原文鏈接

Chapter 5：基本結構化API操做

前言

Schemas (模式)

我這裏使用的是書附帶的數據源中的 2015-summary.csv 數據git

scala> val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/2015-summary.csv")
df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> df.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
複製代碼

經過printSchema方法打印df的Schema。這裏Schema的構造有兩種方式，一是像上面同樣讀取數據時根據數據類型推斷出Schema（schema-on-read），二是自定義Schema。具體選哪一種要看你實際應用場景，若是你不知道輸入數據的格式，那就採用自推斷的。相反，若是知道或者在ETL清洗數據時就應該自定義Schema，由於Schema推斷會根據讀入數據格式的改變而改變。github

看下Schema具體是什麼，以下輸出可知自定義Schema要定義包含StructType和StructField兩種類型的字段，每一個字段又包含字段名、類型、是否爲null或缺失算法

scala> spark.read.format("csv").load("data/2015-summary.csv").schema
res1: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
複製代碼

一個自定義Schema的例子，具體就是先引入相關類StructType,StructField和相應內置數據類型（Chapter 4中說起的Spark Type），而後定義本身的Schema，最後就是讀入數據是經過schema方法指定本身定義的Schemasql

scala> import org.apache.spark.sql.types.{StructType,StructField,StringType,LongType}
import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}

scala> val mySchema = StructType(Array(
     |  StructField("DEST_COUNTRY_NAME",StringType,true),
     |  StructField("ORIGIN_COUNTRY_NAME",StringType,true),
     |  StructField("count",LongType,true)
     | ))
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,LongType,true))

scala> val df = spark.read.format("csv").schema(mySchema).load("data/2015-summary.csv")
df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> df.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)
複製代碼

看這裏StringType、LongType，其實就是Chapter 4中談過的Spark Type。還有就是上面自定義Schema真正用來的是把RDD轉換爲DataFrame，參見以前的筆記express

Columns(列) 和 Expressions(表達式)

書說起這裏我以爲講得過多了，其實質就是告訴你在spark sql中如何引用一列。下面列出這些apache

df.select("count").show
df.select(df("count")).show
df.select(df.col("count")).show #col方法可用column替換，可省略df直接使用col
df.select($"count").show #scala獨有的特性，但性能沒有改進，瞭解便可（書上還提到了符號`'`也能夠，如`'count`）
df.select(expr("count")).show
df.select(expr("count"),expr("count")+1 as "count+1").show(5) #as是取別名
df.select(expr("count+1")+1).show(5)
df.select(col("count")+1).show(5)
複製代碼

大體就上面這些了，主要是注意col和expr方法，兩者的區別是expr能夠直接把一個表達式的字符串做爲參數，即expr("count+1")等同於expr("count")+1、expr("count")+1編程

多提一句，SQL中select * from xxx在spark sql中能夠這樣寫df.select("*")/df.select(expr("*"))/df.select(col("*"))json

書中這一塊還講了爲啥上面這三個式子相同，spark會把它們編譯成相同的語法邏輯樹，邏輯樹的執行順序相同。編譯原理學過吧，自上而下的語法分析，LL(1)自左推導好比 (((col("someCol") + 5) * 200) - 6) < col("otherCol") 對應的邏輯樹以下api

Records(記錄) 和 Rows(行)

Chapter 4中談過DataFrame=DataSet[Row]，DataFrame中的一行記錄（Record）就是一個Row類型的對象。Spark 使用列表達式 expression 操做 Row 對象,以產生有效的結果值。Row 對象的內部表示爲:字節數組。由於咱們使用列表達式操做 Row 對象,因此,字節數據不會對最終用戶展現（用戶不可見）

咱們來自定義一個Row對象

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val myRow = Row("China",null,1,true)
myRow: org.apache.spark.sql.Row = [China,null,1,true]
複製代碼

首先要引入Row這個類，而後根據你的須要（對應指定的Schema）指定列的值和位置。爲啥說是對應Schema呢？明確一點，DataFrame纔有Schema，Row沒有，你之因此定義一個Row對象，不就是爲了轉成DataFrame嗎（後續可見將RDD轉爲DataFrame），否則RDD不能用嗎非得轉成Row，對吧。

訪問Row對象中的數據

scala> myRow(0)
res12: Any = China

scala> myRow.get
get          getByte    getDecimal   getInt       getLong   getShort    getTimestamp   
getAs        getClass   getDouble    getJavaMap   getMap    getString   getValuesMap   
getBoolean   getDate    getFloat     getList      getSeq    getStruct                  

scala> myRow.get(1)
res13: Any = null

scala> myRow.getBoolean(3)
res14: Boolean = true

scala> myRow.getString(0)
res15: String = China

scala> myRow(0).asInstanceOf[String]
res16: String = China
複製代碼

如上代碼，注意第二行輸入myRow.get提示了不少相應類型的方法

DataFrame 轉換操做(Transformations)

對應文檔：spark.apache.org/docs/2.4.0/…，書中給的是2.2.0的，更新一下

書中談及了單一使用DataFrame時的幾大核心操做：

添加行或列
刪除行或列
變換一行(列)成一列(行)
根據列值對Rows排序

DataFrame建立

以前大致上是說起了一些建立方法的，像從數據源 json、csv、parquet 中建立，或者jdbc、hadoop格式的文件便可。還有就是從RDD轉化成DataFrame，這裏書上沒有細講，但能夠看出就是兩種方式：經過自定義StructType建立DataFrame（編程接口）和經過case class 反射方式建立DataFrame（書中這一塊不明顯，由於它只舉例了一個Row對象的狀況）

參見我以前寫的：RDD如何轉化爲DataFrame

DataFrame還有一大優點是轉成臨時視圖，能夠直接使用SQL語言操做，以下：

df.createOrReplaceTempView("dfTable") #建立或替代臨時視圖
spark.sql("select * from dfTable where count>50").show
複製代碼

select 和 selectExpr

這兩個也很簡單就是SQL中的查詢語句select，區別在於select接收列 column 或表達式 expression，selectExpr接收字符串表達式 expression

df.select(col("DEST_COUNTRY_NAME") as "dest_country").show(2)

spark.sql("select DEST_COUNTRY_NAME as `dest_country` from dfTable limit 2").show
複製代碼

你可使用上文說起的Columns來替換col("DEST_COUNTRY_NAME")爲其餘不一樣寫法，但要注意Columns對象不能和String字符串一塊兒混用

scala> df.select(col("DEST_COUNTRY_NAME"),"EST_COUNTRY_NAME").show(2).show
<console>:26: error: overloaded method value select with alternatives:
  [U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
  (col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
  (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
  
 cannot be applied to (org.apache.spark.sql.Column, String)
       df.select(col("DEST_COUNTRY_NAME"),"EST_COUNTRY_NAME").show(2).show

# cannot be applied to (org.apache.spark.sql.Column, String)
複製代碼

你也能夠select多個列，逗號隔開就行了。若是你想給列名取別名的話，能夠像上面 col("DEST_COUNTRY_NAME") as "dest_country"同樣，也能夠 expr("DEST_COUNTRY_NAME as dest_country")（以前說過expr能夠表達式的字符串）

Scala中還有一個操做是把更改別名後又改成原來名字的，df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME")).show(2)，瞭解就好

而selectExpr就是簡化版的select(expr(xxx))，能夠當作一種構建複雜表達式的簡單方法。到底用哪一種，咱也很差說啥，咱也很差問，看本身狀況吧，反正均可以使用

df.selectExpr("DEST_COUNTRY_NAME as destination","ORIGIN_COUNTRY_NAME").show(2)

# 聚合
scala> df.selectExpr("avg(count)","count(distinct(DEST_COUNTRY_NAME))").show(5)
+-----------+---------------------------------+                                 
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+
# 等同於select的
scala> df.select(avg("count"),countDistinct("DEST_COUNTRY_NAME")).show()
+-----------+---------------------------------+                                 
| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|
+-----------+---------------------------------+
|1770.765625|                              132|
+-----------+---------------------------------+
# 等同於sql的
scala> spark.sql("SELECT avg(count), count(distinct(DEST_COUNTRY_NAME)) FROM dfTable
LIMIT 2")
複製代碼

轉換爲 Spark Types (Literals)

這裏我也搞不太明白它的意義在哪裏，書上說當你要比較一個值是否大於某個變量或者編程中建立的變量時會用到這個。而後舉了一個添加常數列1的例子

import org.apache.spark.sql.functions.lit
df.select(expr("*"), lit(1).as("One")).show(2)

-- in SQL
spark.sql(SELECT *, 1 as One FROM dfTable LIMIT 2)
複製代碼

實在是沒搞明白意義何在，好比說我查詢列count中大於其平均值的全部記錄

val result = df.select(avg("count")).collect()(0).getDouble(0)
df.where(col("count") > lit(result)).show() # 去掉lit也沒問題，因此，呵呵呵
複製代碼

添加或刪除列

DataFrame提供一個方法withColumn來添加列，如添加一個值爲1的列df.withColumn("numberOne",lit(1))，像極了pandas中的pd_df['numberOne'] = 1，不過withColumn是建立了新的DataFrame

還能經過實際的表達式賦予列值

scala> df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME ==DEST_COUNTRY_NAME")).show(2)
+-----------------+-------------------+-----+-------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|withinCountry|
+-----------------+-------------------+-----+-------------+
|    United States|            Romania|   15|        false|
|    United States|            Croatia|    1|        false|
+-----------------+-------------------+-----+-------------+
only showing top 2 rows
複製代碼

DataFrame提供了一個 drop 方法刪除列，其實學過R語言或者Python的話這裏很容易掌握，由於像pandas裏都有同樣的方法。 drop這個方法也會建立新的DataFrame，不得不說雞肋啊，直接經過select也是同樣的效果

scala> df1.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
 |-- numberOne: integer (nullable = false)

# 刪除多個列就多個字段逗號隔開
scala> df1.drop("numberOne").columns
res52: Array[String] = Array(DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count)
複製代碼

列名重命名

withColumnRenamed方法，如df.withColumnRenamed("DEST_COUNTRY_NAME","dest_country").columns，也是建立新DataFrame

保留字和關鍵字符

像列名中遇到空格或者破折號，可使用單引號'括起，以下

dfWithLongColName.selectExpr("`This Long Column-Name`","`This Long Column-Name` as `new col`").show(2)

spark.sql("SELECT `This Long Column-Name`, `This Long Column-Name` as `new col` FROM dfTableLong LIMIT 2")
複製代碼

設置區分大小寫

默認spark大小寫不敏感的，但能夠設置成敏感 spark.sql.caseSensitive屬性爲true便可

spark.sqlContext.setConf("spark.sql.caseSensitive","true")
複製代碼

這個意義並不是在此，而是告訴你如何在程序中查看/設置本身想要配置的屬性。就SparkSession而言吧，spark.conf.set，spark.conf.get便可，由於SparkSession包含了SparkContext、SQLContext、HiveContext

更改列的類型

和Hive中更改類型同樣的，cast方法

scala> df1.withColumn("LongOne",col("numberOne").cast("Long")).printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
 |-- numberOne: integer (nullable = false)
 |-- LongOne: long (nullable = false)

# 等同 SELECT *, cast(count as long) AS LongOne FROM dfTable
複製代碼

過濾Rows

就是where和filter兩個方法，選其一便可

scala> df.filter(col("DEST_COUNTRY_NAME")==="United States").filter($"count">2000).show
+-----------------+-------------------+------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME| count|
+-----------------+-------------------+------+
|    United States|      United States|370002|
|    United States|             Mexico|  7187|
|    United States|             Canada|  8483|
+-----------------+-------------------+------+

//SQL寫法
spark.sql("select * from dfTable where DEST_COUNTRY_NAME=='United States' and count>2000").show
複製代碼

有一點要注意的是，等於和不等於的寫法：===、=!=

書中在這裏還說起了——在使用 Scala 或 Java 的 Dataset API 時,filter 還接受 Spark 將應用於數據集中每一個記錄的任意函數

這裏補充一下，上面給出的示例是And條件判斷，那Or怎麼寫呢？

//SQL好寫
spark.sql("select * from dfTable where DEST_COUNTRY_NAME=='United States' and (count>200 or count<10)").show
//等價
df.filter(col("DEST_COUNTRY_NAME")==="United States").filter(expr("count>200").or(expr("count<10"))).show

//隨便舉個例子，還能夠這樣建立個Column來比較
val countFilter = col("count") > 2000
val destCountryFilter1 = col("DEST_COUNTRY_NAME") === "United States"
val destCountryFilter2 = col("DEST_COUNTRY_NAME") === "China"
//取否加!
df.where(!countFilter).where(destCountryFilter1.or(destCountryFilter2)).groupBy("DEST_COUNTRY_NAME").count().show
+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|    United States|  122|
|            China|    1|
+-----------------+-----+
複製代碼

Rows 去重

這個小標題可能有歧義，其實就是SQL中的distinct去重

//SQL
spark.sql("select COUNT(DISTINCT(ORIGIN_COUNTRY_NAME,DEST_COUNTRY_NAME)) FROM dfTable")
//df
df.select("ORIGIN_COUNTRY_NAME","DEST_COUNTRY_NAME").distinct.count
複製代碼

df 隨機取樣

scala> df.count
res1: Long = 256
# 種子
scala> val seed = 5
seed: Int = 5
# 是否替換原df
scala> val withReplacement = false
withReplacement: Boolean = false
# 抽樣比
scala> val fraction = 0.5
fraction: Double = 0.5
# sample
scala> df.sample(withReplacement,fraction,seed).count
res4: Long = 126
複製代碼

df 隨機切分

這個經常使用於機器學習作訓練集測試集切分（split），就比如是sklearn裏面的train_test_split。

def randomSplit(weights:Array[Double]):Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
def randomSplit(weights:Array[Double],seed:Long):Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]

# 傳入Array指定切割比例，seed是種子
# 返回的也是Array類型
scala> val result = df.randomSplit(Array(0.25,0.75),5)
scala> result(0).count
res12: Long = 60

scala> result(1).count
res13: Long = 196
複製代碼

join 鏈接

怎麼說呢，spark sql提供的方法沒有SQL方式操做靈活簡便吧，看例子：

# df1用的上面得出的
df.join(df1,df.col("count")===df1.col("count")).show
複製代碼

默認內鏈接（inner join），從圖中可見相同字段沒有合併，並且重命名很難。你也能夠以下用寫法

df.join(df1,"count").show
//多列join
df.join(df1,Seq("count","DEST_COUNTRY_NAME")).show
複製代碼

好處是相同字段合併了

還有就是左鏈接，右鏈接，外鏈接等等，在join方法中指明便可，以下

# 左外鏈接
df.join(df1,Seq("count","DEST_COUNTRY_NAME"),"leftouter").show
複製代碼

join type有如下可選：

Supported join types include: 'inner', 'outer', 'full', 'fullouter', 'full_outer', 'leftouter', 'left', 'left_outer', 'rightouter', 'right', 'right_outer', 'leftsemi', 'left_semi', 'leftanti', 'left_anti', 'cross'.

我更推薦轉成臨時表，經過SQL方式寫起來簡便

union 合併

這個用來合併DataFrame（或DataSet），它不是按照列名和並得，而是按照位置合併的（因此DataFrame的列名能夠不相同，但對應位置的列將合併在一塊兒）。還有它這個和SQL中union 集合合併不等價（會去重），這裏的union不會去重

scala> val rows = Seq(
     | Row("New Country","Other Country",5),
     | Row("New Country2","Other Country3",1)
     | )
scala> val rdd = spark.sparkContext.parallelize(rows)
scala> import org.apache.spark.sql.types.{StructType,StructField}
scala> import org.apache.spark.sql.types.{StringType,IntegerType}
scala> val schema = StructType(Array(
     | StructField("dest_country",StringType,true),
     | StructField("origin_country",StringType,true),
     | StructField("count",IntegerType,true)
     | ))
scala> val newDF = spark.createDataFrame(rdd,schema)
scala> newDF.show
+------------+--------------+-----+
|dest_country|origin_country|count|
+------------+--------------+-----+
| New Country| Other Country|    5|
|New Country2|Other Country3|    1|
+------------+--------------+-----+

scala> newDF.printSchema
root
 |-- dest_country: string (nullable = true)
 |-- origin_country: string (nullable = true)
 |-- count: integer (nullable = true)


scala> df.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)
# 合併後的Schema，可見和列名無關
scala> df.union(newDF).printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

scala> df.union(newDF).where(col("DEST_COUNTRY_NAME").contains("New Country")).show
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|      New Country|      Other Country|    5|
|     New Country2|     Other Country3|    1|
+-----------------+-------------------+-----+
複製代碼

它無論你兩個DataFrame的Schema是否對上，只要求列數相同，至於Column的Type會向上轉型（即Integer能夠向上轉爲String等）

scala> val df3 = df.select("ORIGIN_COUNTRY_NAME","count")
scala> df3.printSchema
root
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)
# 要求列數匹配
scala> df1.union(df3)
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 3 columns and the second table has 2 columns;;

scala> val df4 = df3.withColumn("newCol",lit("spark"))
scala> df4.printSchema
root
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)
 |-- newCol: string (nullable = false)
# 看最後的Column名和類型
scala> df.union(df4).printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: string (nullable = true)
複製代碼

排序

spark sql提供sort和orderby兩個方法，都接受字符串、表達式、Columns對象參數，默認升序排序（Asc）

import org.apache.spark.sql.functions.{asc,desc}
df.sort("count").show(2)
df.sort(desc("count")).show(2)
df.sort(col("count").desc).show(2)
df.sort(expr("count").desc_nulls_first).show(2)
df.orderBy(desc("count"), asc("DEST_COUNTRY_NAME")).show(2)
# 下面這個我試着沒有用
df.orderBy(expr("count desc")).show(2)
複製代碼

注意，上面有一個屬性：desc_nulls_first ，還有desc_nulls_last，同理asc也對應有兩個，這個用來指定排序時null數據是出如今前面仍是後面

出於優化目的,有時建議在另外一組轉換以前對每一個分區進行排序。您可使用 sortWithinPartitions 方法來執行如下操做:spark.read.format("json").load("/data/flight-data/json/*-summary.json").sortWithinPartitions("count")

前n個數據 (limit)

這個就像MySQL中取前n條數據同樣，select * from table limit 10;，spark sql也提供這麼一個方法df.limit(10).show

重分區

當你spark出現數據傾斜時，首先去UI查看是否是數據分佈不均，那就能夠調整分區數，提升並行度，讓同一個key的數據分散開來，能夠參考我以前寫的：MapReduce、Hive、Spark中數據傾斜問題解決概括總結。Repartition 和 Coalesce方法能夠用在這裏

def repartition(partitionExprs: org.apache.spark.sql.Column*)
def repartition(numPartitions: Int,partitionExprs: org.apache.spark.sql.Column*)
def repartition(numPartitions: Int): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
複製代碼

看這三個方法，參數Columns是指對哪一個列分區，numPartitions是分區數。還有repartition是對數據徹底進行Shuffle的

# 重分區
df.repartition(col("DEST_COUNTRY_NAME"))
# 指定分區數
df.repartition(5, col("DEST_COUNTRY_NAME"))
# 查看分區數
df.rdd.getNumPartitions
複製代碼

而coalesce 是不會致使數據徹底 shuffle的，並嘗試合併分區

df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)
複製代碼

將Rows返回給Driver程序

有如下幾個方法：collect、take、show，會將一些數據返回給Driver驅動程序，以便本地操做查看。

scala> df.take
   def take(n: Int): Array[org.apache.spark.sql.Row]
scala> df.takeAsList
   def takeAsList(n: Int): java.util.List[org.apache.spark.sql.Row]
scala> df.collectAsList
   def collectAsList(): java.util.List[org.apache.spark.sql.Row]
scala> df.collect
   def collect(): Array[org.apache.spark.sql.Row]
複製代碼

有一點是，collect謹慎使用，它會返回全部數據到本地，若是太大內存都裝不下，搞得driver崩潰。show方法這裏還能傳一個布爾型參數truncate，表示是否打印徹底超過20字符的字符串（就是有些值太長了，是否徹底打印）

還有一個方法 toLocalIterator 將分區數據做爲迭代器返回給驅動程序，以便迭代整個數據集，這個也會出現分區太大形成driver崩潰的出現