CentOS6下配置Spark、Python開發環境記錄

1. 使用$SPARK_HOME/sbin/下的pyspark啓動時,報錯Traceback (most recent call last):

File "/home/joy/spark/spark/python/pyspark/shell.py", line 28, in

import py4j zipimport.ZipImportError: can't decompress data; zlib not available
html

首先按照搜索結果使用 yum install -y zlib* 安裝了欠缺的包,可是仍報錯,後使用sudo命令執行./pyspark便可正常執行。目前必須使用sudo命令才能正常執行,可能與環境設置有關,待解決——由於使用sudo命令安裝,因此文件的全部者爲root,chown更改全部者。
可是這樣必須使用sudo安裝pip,爲了一勞永逸,從新編譯python
http://blog.csdn.net/woszsj/article/details/16848871
解決方法:java

一、安裝依賴zlib、zlib-develnode

二、從新編譯安裝Pythonpython

./configure
編輯Modules/Setup文件
找到下面這句,去掉註釋mysql

zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz

從新編譯安裝:make & make install
編譯後報錯仍有部分模塊未編譯成功
Python build finished, but the necessary bits to build these modules were not found:
_bsddb _curses _curses_panel
_sqlite3 _ssl _tkinter
bsddb185 bz2 dbm
dl gdbm imageop
參考
http://blog.csdn.net/huanle0610/article/details/41174943git

不管報錯信息如何,意思很明確,咱們編譯的時候,系統沒有辦法找到對應的模塊信息,爲了解決這些報錯,咱們就須要提早安裝依賴包,這些依賴包對應列表以下(不必定徹底):sql

模塊 依賴 說明
_bsddb bsddb Interface to Berkeley DB library。Berkeley數據庫的接口.bsddb is deprecated since 2.6. The ideal is to use the bsddb3 module.
_curses ncurses Terminal handling for character-cell displays。
_curses_panel ncurses A panel stack extension for curses。
_sqlite3 sqlite DB-API 2.0 interface for SQLite databases。SqlLite,CentOS能夠安裝sqlite-devel
_ssl openssl-devel.i686 TLS/SSL wrapper for socket objects。
_tkinter N/A a thin object-oriented layer on top of Tcl/Tk。若是不使用桌面程序能夠忽略TKinter
bsddb185 old bsddb module 老的bsddb模塊,可忽略。
bz2 bzip2-devel.i686 Compression compatible with bzip2。bzip2-devel
dbm bsddb Simple 「database」 interface。
dl N/A Call C functions in shared objects.Python2.6開始,已經棄用。
gdbm gdbm-devel.i686 GNU’s reinterpretation of dbm
imageop N/A Manipulate raw image data。已經棄用。
readline readline-devel GNU readline interface
sunaudiodev N/A Access to Sun audio hardware。這個是針對Sun平臺的,CentOS下能夠忽略
zlib Zlib Compression compatible with gzip

在CentOS下,能夠安裝這些依賴包:readline-devel,sqlite-devel,bzip2-devel.i686,openssl-devel.i686,gdbm-devel.i686,libdbi-devel.i686,ncurses-libs,zlib-devel.i686。完成這些安裝以後,能夠再次編譯,上表中指定爲棄用或者忽略的模塊錯誤能夠忽略。shell

在編譯完成以後,就能夠接着上面的第六步安裝Python到指定目錄下。安裝完成以後,咱們能夠到安裝目錄下查看Python是否正常安裝。數據庫

3. SparkSQL準備

參考文章:http://www.2cto.com/database/201504/392307.htmlapache

首先呢,看使用HiveContext都須要哪些要求,這裏參考了這篇文章:http://www.cnblogs.com/byrhuangqiang/p/4012087.html
文章中有這麼三個要求:
一、檢查$SPARK_HOME/lib目錄下是否有datanucleus-api-jdo-3.2.1.jar、datanucleus-rdbms-3.2.1.jar
、datanucleus-core-3.2.2.jar 這幾個jar包。
二、檢查$SPARK_HOME/conf目錄下是否有從$HIVE_HOME/conf目錄下拷貝過來的hive-site.xml。
三、提交程序的時候將數據庫驅動程序的jar包指定到DriverClassPath,如bin/spark-submit --driver-class-path *.jar。或者在spark-env.sh中設置SPARK_CLASSPATH。

參考文章,將$HIVE_HOME/lib下以datanucleus開頭的幾個jar包複製到$SPARK_HOME/lib下;$HIVE_HOME/conf下的hive-site.xml 複製到 $SPARK_HOME/conf下;將$HIVE_HOME/lib 下的mysql-connector複製到$SPARK_HOME/jars下,

2. 啓動spark-shell時報錯

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/01/17 11:42:58 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/01/17 11:43:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/01/17 11:43:00 WARN Utils: Your hostname, node1 resolves to a loopback address: 127.0.0.1; using 192.168.85.128 instead (on interface eth1)
17/01/17 11:43:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/01/17 11:43:11 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.server2.thrift.http.min.worker.threads does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.mapjoin.optimized.keys does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.mapjoin.lazy.hashtable does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.maxslots does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.attempts does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.server2.thrift.http.max.worker.threads does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.sendqueue does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.optimize.multigroupby.common.distincts does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.interval does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.parallelism does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.stats.map.parallelism does not exist
17/01/17 11:43:11 WARN HiveConf: HiveConf of name hive.datampi.memusedpercent does not exist
17/01/17 11:43:12 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/joy/spark/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/joy/spark/spark/jars/datanucleus-rdbms-3.2.9.jar."
17/01/17 11:43:12 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/joy/spark/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/joy/spark/spark/jars/datanucleus-api-jdo-3.2.6.jar."
17/01/17 11:43:12 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/home/joy/spark/spark/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/home/joy/spark/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
17/01/17 11:43:16 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.server2.thrift.http.min.worker.threads does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.mapjoin.optimized.keys does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.mapjoin.lazy.hashtable does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.maxslots does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.attempts does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.server2.thrift.http.max.worker.threads does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.sendqueue does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.optimize.multigroupby.common.distincts does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.interval does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.parallelism does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.stats.map.parallelism does not exist
17/01/17 11:43:16 WARN HiveConf: HiveConf of name hive.datampi.memusedpercent does not exist
17/01/17 11:43:22 ERROR ObjectStore: Version information found in metastore differs 0.13.0 from expected schema version 1.2.0. Schema verififcation is disabled hive.metastore.schema.verification so setting version.
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
  at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
  ... 47 elided
Caused by: java.lang.reflect.InvocationTargetException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
  ... 58 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
  at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
  at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:86)
  at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
  at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
  at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
  at org.apache.spark.sql.internal.SessionState.<init>(SessionState.scala:157)
  at org.apache.spark.sql.hive.HiveSessionState.<init>(HiveSessionState.scala:32)
  ... 63 more
Caused by: java.lang.reflect.InvocationTargetException: java.lang.reflect.InvocationTargetException: java.lang.RuntimeException: java.io.FileNotFoundException: File /hive/tmp does not exist
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)
  ... 71 more
**Caused by: java.lang.reflect.InvocationTargetException: java.lang.RuntimeException: java.io.FileNotFoundException: File /hive/tmp does not exist**
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:366)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:270)
  at org.apache.spark.sql.hive.HiveExternalCatalog.<init>(HiveExternalCatalog.scala:65)
  ... 76 more
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File /hive/tmp does not exist
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
  at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:192)
  ... 84 more
Caused by: java.io.FileNotFoundException: File /hive/tmp does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
  at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
  at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
  ... 85 more

分析報錯信息,發現出錯緣由爲/hive/tmp不存在的FileNotExist錯誤,查找hive-site.xml文件,該路徑爲 hive.exec.scratchdir 值, hive.exec.scratchdir 爲 HDFS路徑,用於存儲不一樣 map/reduce 階段的執行計劃和這些階段的中間輸出結果。

在終端輸入hadoop fs -ls /hive,執行結果爲

Found 2 items
drwxr-xr-x   - joy supergroup          0 2016-06-12 21:35 /hive/log
drwxr-xr-x   - joy supergroup          0 2017-01-16 14:17 /hive/tmp

權限分配不對,應該增長g+w,hadoop fs -chmod g+w /hive/tmp 以及hadoop fs -chmod g+w /hive/log,可是依然報錯不存在

在$SPARK_HOME/conf下的spark-env.sh中增長HADOOP_CONF_DIR,增長後報錯信息變動爲

java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
  at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
  ... 47 elided
Caused by: java.lang.reflect.InvocationTargetException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
  ... 58 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
  at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
  at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:86)
  at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
  at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
  at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
  at org.apache.spark.sql.internal.SessionState.<init>(SessionState.scala:157)
  at org.apache.spark.sql.hive.HiveSessionState.<init>(HiveSessionState.scala:32)
  ... 63 more
Caused by: java.lang.reflect.InvocationTargetException: java.lang.reflect.InvocationTargetException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /hive/tmp on HDFS should be writable. Current permissions are: rwxrwxr-x
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)
  ... 71 more
Caused by: java.lang.reflect.InvocationTargetException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /hive/tmp on HDFS should be writable. Current permissions are: rwxrwxr-x
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
  at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:366)
  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:270)
  at org.apache.spark.sql.hive.HiveExternalCatalog.<init>(HiveExternalCatalog.scala:65)
  ... 76 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /hive/tmp on HDFS should be writable. Current permissions are: rwxrwxr-x
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
  at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:192)
  ... 84 more
Caused by: java.lang.RuntimeException: The root scratch dir: /hive/tmp on HDFS should be writable. Current permissions are: rwxrwxr-x
  at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
  at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
  at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
  ... 85 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql

出錯信息指出文件夾權限不正確,再次使用hadoop fs -ls /hive

drwxrwxr-x   - joy supergroup          0 2016-06-12 21:35 /hive/log
drwxrwxr-x   - joy supergroup          0 2017-01-16 14:17 /hive/tmp

將文件夾權限改成777,最終啓動成功

4. KeyError: u'y'

出錯信息相似於如下:

Traceback (most recent call last):
  File "/Users/lyj/Programs/kiseliugit/MyPysparkCodes/test/spark2.0.py", line 5, in <module>
spark = SparkSession.builder.master("local").appName('test 2.0').config(conf=SparkConf()).getOrCreate()
  File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
  File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/context.py", line 243, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/Users/lyj/Programs/Apache/Spark2/python/pyspark/java_gateway.py", line 116, in launch_gateway
java_import(gateway.jvm, "org.apache.spark.SparkConf")
  File "/Library/Python/2.7/site-packages/py4j/java_gateway.py", line 90, in java_import
return_value = get_return_value(answer, gateway_client, None, None)
  File "/Library/Python/2.7/site-packages/py4j/protocol.py", line 306, in get_return_value
value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
KeyError: u'y'

出錯緣由爲py4j版本太低,使用pip upgrade升級便可
參考:http://stackoverflow.com/questions/38637988/how-could-i-write-the-right-entry-point-in-spark-2-0-program-actually-pyspark-2

相關文章
相關標籤/搜索