個人環境:hadoop 2.7.一、spark 1.6.0、hive 2.0、java 1.7java
目標:經過java -jar xxx.jar的方式來運行提交spark應用,執行查詢hive sql。python
問題一:首先要提一下,按照java -jar執行,會報java.lang.OutOfMemoryError: PermGen space錯誤,因此須要使用如下參數啓動mysql
java -Xms1024m -Xmx1024m -XX:MaxNewSize=256m -XX:MaxPermSize=256m -jar spark.jar
問題二:若是不增長datanucleus的三個jar包,會報以下的錯誤http://zengzhaozheng.blog.51cto.com/8219051/1597902?utm_source=tuicool&utm_medium=referral web
javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) 。。。 NestedThrowablesStackTrace: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.forName(JDOHelper.java:2015) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:57) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) 。。。
問題三:java代碼中SparkConf設置的master,即你選擇的spark模式。我這裏使用yarn-client模式,若是寫yarn-cluster是會報錯的。http://stackoverflow.com/questions/31327275/pyspark-on-yarn-cluster-mode, 其網頁內容的總結部分:sql
1.若是你想把spark代碼直接嵌入你的web app中,你須要使用yarn-client 2.若是你想讓你的spark代碼足夠鬆散耦合到yarn-cluster模式能夠實際使用,你能夠另起一個python的子線程來調用spark-submit來執行yarn-cluster模式。
問題四: 須要增長三個配置文件:core-site.xml、hdfs-site.xml、hive-site.xml。否則啓動java -jar命令會直接報錯。apache
因此,正確的java調用spark執行hive sql的代碼以下:api
建立java工程,引入spark-assembly-1.6.0-hadoop2.6.0.jar包。這個包在spark的安裝目錄的lib目錄下有,178M,真的很大。服務器
java調用代碼以下,個人代碼之後會打包爲spark.jar,存放目錄爲/data/houxm/spark/spark.jar:app
package cn.centaur.test.spark; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.hive.HiveContext; public class SimpleDemo { public static void main(String[] args) { String[] jars = new String[]{"/data/houxm/spark/spark.jar"}; SparkConf conf = new SparkConf().setAppName("simpledemo").setMaster("yarn-client").set("executor-memory", "2g").setJars(jars).set("driver-class-path", "/data/spark/lib/mysql-connector-java-5.1.21.jar"); JavaSparkContext sc = new JavaSparkContext(conf); HiveContext hiveCtx = new HiveContext(sc); testHive(hiveCtx); sc.stop(); sc.close(); } //測試spark sql查詢hive上面的表 public static void testHive(HiveContext hiveCtx) { hiveCtx.sql("create table temp_spark_java as select mobile,num from default.mobile_id_num02 limit 10"); } }
在java項目的根目錄新建MANIFEST.MF文件,代碼以下:eclipse
Manifest-Version: 1.0 Class-Path: /data/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar /data/spark/lib/mysql-connector-java-5.1.21.jar /data/spark/lib/datanucleus-api-jdo-3.2.6.jar /data/spark/lib/datanucleus-core-3.2.10.jar /data/spark/lib/datanucleus-rdbms-3.2.9.jar Main-Class: cn.centaur.test.spark.SimpleDemo
在resources目錄(個人是maven工程,普通java工程在src下加入文件便可)下加入core-site.xml、hdfs-site.xml、hive-site.xml三個配置文件。
使用eclipse,按照此manifest文件把java代碼打包。生成jar文件,上傳至服務器,便可運行。