SparkSession - Spark SQL 的入口

時間 2019-11-20

標籤 sparksession spark sql 入口欄目 Spark 简体版

原文原文鏈接

SparkSession - Spark SQL 的入口

翻譯自：https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-SparkSession.htmlhtml

概述

SparkSession 是 Spark SQL 的入口。使用 Dataset 或者 Datafram 編寫 Spark SQL 應用的時候，第一個要建立的對象就是 SparkSession。git

Note：在 Spark 2.0 中， SparkSession 合併了 SQLContext 和 HiveContext。sql

你能夠經過 SparkSession.builder 來建立一個 SparkSession 的實例,並經過 stop 函數來中止 SparkSession。apache

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // avoid hardcoding the deployment environment
  .enableHiveSupport()              // self-explanatory, isn't it?
  .config("spark.sql.warehouse.dir", "target/spark-warehouse")
  .getOrCreate

你能夠在一個 Spark 應用中使用多個 SparkSession，這樣子就能夠經過 SparSession 將多個關係實體隔離開來(能夠參考 catalog 屬性)。安全

scala> spark.catalog.listTables.show
+------------------+--------+-----------+---------+-----------+
|              name|database|description|tableType|isTemporary|
+------------------+--------+-----------+---------+-----------+
|my_permanent_table| default|       null|  MANAGED|      false|
|              strs|    null|       null|TEMPORARY|       true|
+------------------+--------+-----------+---------+-----------+

在 SparkSession 的內部，包含了SparkContext， SharedState，SessionState 幾個對象。下表中介紹了每一個對象的大致功能：session

Name	Type	Description
sparkContext	SparkContext	spark功能的主要入口點。能夠經過 sparkConext在集羣上建立RDD, accumulators 和 broadcast variables
existingSharedState	Option[SharedState]	一個內部類負責保存不一樣session的共享狀態
parentSessionState	Option[SessionState]	複製父session的狀態

下圖是 SparkSession 的類和方法, 這些方法包含了建立 DataSet, DataFrame, Streaming 等等。app

Method	Description
builder	"Opens" a builder to get or create a SparkSession instance
version	Returns the current version of Spark.
implicits	Use import spark.implicits._ to import the implicits conversions and create Datasets from (almost arbitrary) Scala objects.
emptyDataset[T]	Creates an empty Dataset[T].
range	Creates a Dataset[Long].
sql	Executes a SQL query (and returns a DataFrame).
udf	Access to user-defined functions (UDFs).
table	Creates a DataFrame from a table.
catalog	Access to the catalog of the entities of structured queries
read	Access to DataFrameReader to read a DataFrame from external files and storage systems.
conf	Access to the current runtime configuration.
readStream	Access to DataStreamReader to read streaming datasets.
streams	Access to StreamingQueryManager to manage structured streaming queries.
newSession	Creates a new SparkSession.
stop	Stops the SparkSession.

Builder

Builder 是 SparkSession 的構造器。經過 Builder, 能夠添加各類配置。
Builder 的方法以下：函數

Method	Description
getOrCreate	獲取或者新建一個 sparkSession
enableHiveSupport	增長支持 hive Support
appName	設置 application 的名字
config	設置各類配置

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // avoid hardcoding the deployment environment
  .enableHiveSupport()              // self-explanatory, isn't it?
  .getOrCreate

ShareState

ShareState 是 SparkSession 的一個內部類，負責保存多個有效session之間的共享狀態。下表介紹了ShareState的屬性。oop

Name	Type	Description
cacheManager	CacheManager	這個是 SQLContext 的支持類，會自動保存 query 的查詢結果。這樣子查詢在執行過程當中，就能夠使用這些查詢結果
externalCatalog	ExternalCatalog	保存外部系統的 catalog
globalTempViewManager	GlobalTempViewManager	一個線程安全的類，用來管理 global temp view，並提供 create , update , remove 的等原子操做，來管理這些 view
jarClassLoader	NonClosableMutableURLClassLoader	加載用戶添加的 jar 包
listener	SQLListener	一個監聽類
sparkContext	SparkContext	Spark 的核心入口類
warehousePath	String	MetaStore 的地址，能夠經過 spark.sql.warehouse.dir 或者 hive-site.xml 中的 hive.metastore.warehouse.dir 來指定， Spark 會覆蓋 hive 的參數