SparkSQL數據源-Hive數據庫html
做者:尹正傑java
版權聲明:原創做品,謝絕轉載!不然將追究法律責任。mysql
一.Hive應用sql
1>.內嵌Hive應用shell
Apache Hive是Hadoop上的SQL引擎,Spark SQL編譯時能夠包含Hive支持,也能夠不包含。包含Hive支持的Spark SQL能夠支持Hive表訪問、UDF(用戶自定義函數)以及 Hive 查詢語言(HiveQL/HQL)等。
須要強調的一點是,若是要在Spark SQL中包含Hive的庫,並不須要事先安裝Hive。通常來講,最好仍是在編譯Spark SQL時引入Hive支持,這樣就可使用這些特性了。若是你下載的是二進制版本的 Spark,它應該已經在編譯時添加了 Hive 支持。
若要把Spark SQL鏈接到一個部署好的Hive上,你必須把hive-site.xml複製到 Spark的配置文件目錄中($SPARK_HOME/conf)。即便沒有部署好Hive,Spark SQL也能夠運行。
須要注意的是,若是你沒有部署好Hive,Spark SQL會在當前的工做目錄中建立出本身的Hive 元數據倉庫,叫做 metastore_db。
此外,若是你嘗試使用HiveQL中的 CREATE TABLE (並不是 CREATE EXTERNAL TABLE)語句來建立表,這些表會被放在你默認的文件系統中的 /user/hive/warehouse 目錄中(若是你的classpath中有配好的hdfs-site.xml,默認的文件系統就是HDFS,不然就是本地文件系統)。 若是要使用內嵌的Hive,什麼都不用作,直接用就能夠了。 固然能夠經過添加參數初次指定數據倉庫地址:--conf spark.sql.warehouse.dir=hdfs://hadoop101.yinzhengjie.org.cn:9000/spark-wearhouse 舒適提示: 若是你使用的是內部的Hive,在Spark2.0以後,spark.sql.warehouse.dir用於指定數據倉庫的地址,若是你須要是用HDFS做爲路徑,那麼須要將core-site.xml和hdfs-site.xml 加入到Spark conf目錄,不然只會建立master節點上的warehouse目錄,查詢時會出現文件找不到的問題,這是須要使用HDFS,則須要將metastore刪除,重啓集羣。
[root@hadoop105.yinzhengjie.org.cn ~]# vim /tmp/id.txt [root@hadoop105.yinzhengjie.org.cn ~]# [root@hadoop105.yinzhengjie.org.cn ~]# cat /tmp/id.txt 100 200 3 400 500 [root@hadoop105.yinzhengjie.org.cn ~]#
scala> spark.sql("show tables").show +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ +--------+---------+-----------+ scala> spark.sql("create table test(id int)") 20/07/15 04:10:36 WARN HiveMetaStore: Location: file:/root/spark-warehouse/test specified for non-external table:test res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("show tables").show +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ | default| test| false| +--------+---------+-----------+ scala> spark.sql("load data local inpath '/tmp/id.txt' into table test") res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select * from test").show +---+ | id| +---+ |100| |200| | 3| |400| |500| +---+ scala>
2>.外部Hive應用數據庫
若是想鏈接外部已經部署好的Hive,須要經過如下幾個步驟。 (1)將Hive中的hive-site.xml拷貝或者軟鏈接到Spark安裝目錄下的conf目錄下。 (2)打開spark shell,注意帶上訪問Hive元數據庫的JDBC客戶端,以下所示(若是你將對應的Hive的元數據庫驅動已經放在spark的安裝目錄下的jars目錄下則能夠不加"--jars"選項喲~)。 [root@hadoop105.yinzhengjie.org.cn ~]# spark-shell --jars mysql-connector-java-5.1.36-bin.jar
二.運行Spark SQL CLIapache
Spark SQL CLI能夠很方便的在本地運行Hive元數據服務以及從命令行執行查詢任務。其效果等效於你在spark-shell中執行的spark.sql("...")中執行的SQL語句。
[root@hadoop105.yinzhengjie.org.cn ~]# spark-sql 20/07/15 04:28:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 20/07/15 04:28:35 INFO SparkContext: Running Spark version 2.4.6 20/07/15 04:28:35 INFO SparkContext: Submitted application: SparkSQL::172.200.4.105 20/07/15 04:28:35 INFO SecurityManager: Changing view acls to: root 20/07/15 04:28:35 INFO SecurityManager: Changing modify acls to: root 20/07/15 04:28:35 INFO SecurityManager: Changing view acls groups to: 20/07/15 04:28:35 INFO SecurityManager: Changing modify acls groups to: 20/07/15 04:28:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 20/07/15 04:28:35 INFO Utils: Successfully started service 'sparkDriver' on port 33260. 20/07/15 04:28:35 INFO SparkEnv: Registering MapOutputTracker 20/07/15 04:28:35 INFO SparkEnv: Registering BlockManagerMaster 20/07/15 04:28:35 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 20/07/15 04:28:35 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 20/07/15 04:28:35 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ae5de0c5-5282-4cc6-8ce6-d1dbe34e82e9 20/07/15 04:28:35 INFO MemoryStore: MemoryStore started with capacity 366.3 MB 20/07/15 04:28:35 INFO SparkEnv: Registering OutputCommitCoordinator 20/07/15 04:28:36 INFO Utils: Successfully started service 'SparkUI' on port 4040. 20/07/15 04:28:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop105.yinzhengjie.org.cn:4040 20/07/15 04:28:36 INFO Executor: Starting executor ID driver on host localhost 20/07/15 04:28:36 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 29320. 20/07/15 04:28:36 INFO NettyBlockTransferService: Server created on hadoop105.yinzhengjie.org.cn:29320 20/07/15 04:28:36 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/07/15 04:28:36 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None) 20/07/15 04:28:36 INFO BlockManagerMasterEndpoint: Registering block manager hadoop105.yinzhengjie.org.cn:29320 with 366.3 MB RAM, BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None) 20/07/15 04:28:36 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None) 20/07/15 04:28:36 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop105.yinzhengjie.org.cn, 29320, None) 20/07/15 04:28:36 INFO EventLoggingListener: Logging events to hdfs://hadoop101.yinzhengjie.org.cn:9000/yinzhengjie/spark/jobhistory/local-1594758516077 20/07/15 04:28:36 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/root/spark-warehouse/'). 20/07/15 04:28:36 INFO SharedState: Warehouse path is 'file:/root/spark-warehouse/'. 20/07/15 04:28:37 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 20/07/15 04:28:37 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. 20/07/15 04:28:37 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/root/spark-warehouse/ 20/07/15 04:28:37 INFO metastore: Mestastore configuration hive.metastore.warehouse.dir changed from /user/hive/warehouse to file:/root/spark-warehouse/ 20/07/15 04:28:37 INFO HiveMetaStore: 0: Shutting down the object store... 20/07/15 04:28:37 INFO audit: ugi=root ip=unknown-ip-addr cmd=Shutting down the object store... 20/07/15 04:28:37 INFO HiveMetaStore: 0: Metastore shutdown complete. 20/07/15 04:28:37 INFO audit: ugi=root ip=unknown-ip-addr cmd=Metastore shutdown complete. 20/07/15 04:28:37 INFO HiveMetaStore: 0: get_database: default 20/07/15 04:28:37 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 20/07/15 04:28:37 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 20/07/15 04:28:37 INFO ObjectStore: ObjectStore, initialize called 20/07/15 04:28:37 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing 20/07/15 04:28:37 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY 20/07/15 04:28:37 INFO ObjectStore: Initialized ObjectStore Spark master: local[*], Application Id: local-1594758516077 20/07/15 04:28:37 INFO SparkSQLCLIDriver: Spark master: local[*], Application Id: local-1594758516077 spark-sql> show tables; #查看如今已有的表 20/07/15 04:29:19 INFO HiveMetaStore: 0: get_database: global_temp 20/07/15 04:29:19 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: global_temp 20/07/15 04:29:19 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 20/07/15 04:29:19 INFO HiveMetaStore: 0: get_database: default 20/07/15 04:29:19 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 20/07/15 04:29:19 INFO HiveMetaStore: 0: get_database: default 20/07/15 04:29:19 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 20/07/15 04:29:19 INFO HiveMetaStore: 0: get_tables: db=default pat=* 20/07/15 04:29:19 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_tables: db=default pat=* 20/07/15 04:29:19 INFO CodeGenerator: Code generated in 184.459346 ms default test false #很明顯,目前我們就一張表喲~ Time taken: 1.518 seconds, Fetched 1 row(s) 20/07/15 04:29:19 INFO SparkSQLCLIDriver: Time taken: 1.518 seconds, Fetched 1 row(s) spark-sql>
三.代碼中使用Hivevim
1>.添加依賴app
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.1.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec --> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.1</version> </dependency>
2>.建立SparkSession時須要添加hive支持dom
val warehouseLocation: String = new File("spark-warehouse").getAbsolutePath /** * 若使用的是外部Hive,則須要將hive-site.xml添加到ClassPath下。 */ val spark = SparkSession .builder() .appName("Spark Hive Example") .config("spark.sql.warehouse.dir", warehouseLocation) //使用內置Hive須要指定一個Hive倉庫地址。若使用外部的hive則無需指定 .enableHiveSupport() //啓用hive的支持 .getOrCreate()