今天將代碼以Spark On Yarn Cluster
的方式提交,遇到了不少不少問題.特意記錄一下.java
代碼經過--master yarn-client
提交是沒有問題的,可是經過--master yarn-cluster
老是報錯,並且是各類各樣的錯誤.node
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1405) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2284) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427) ...
這個bug一般會提示咱們是否將Jar
包部署到全部的slave
上了,可是yarn-cluster通常會經過RPC
框架分發Jar包,即便將Jar
包一一部署到slave機器中,並無任何效果,仍然報這個錯誤.web
開始經過google
,stackoverflow
查找相關信息.產生這種問題的緣由可謂錯綜複雜,有的說類加載器的問題,有的說UDF的問題.其中有一個引發了個人注意:sql
若是在代碼中引用了
Java
代碼,最好將代碼打成的Jar
放在$SPARK_HOME/jars
目錄下,確保jar包是在classpath
下.shell
按照這個解答的方式安排了一下jar
包,而後從新執行.經過yarn的web頁面觀察運行日誌,沒有這個報錯了.可是任務失敗了,報了另外一個錯誤:apache
java.io.FileNotFoundException: File does not exist: hdfs://master:9000/xxx/xxxx/xxxx/application_1495996836198_0003/__spark_libs__1200479165381142167.zip at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ...
這個錯誤就讓我很熟悉了,我在代碼建立sparkSession
的時候設置了master
,master
地址是spark master
的url
,因此當在yarn上提交任務的時候,最終會按照代碼中的配置開始standalone
模式,這會形成混亂,因此會產生一些莫名其妙的bug.app
修改一下代碼從新打包就行了框架
解決辦法:oop
val spark = SparkSession.builder() // .master("spark://master:7077") //註釋掉master的設置 .appName("xxxxxxx") .getOrCreate();
中間還遇到了其餘不少bug,好比沒法反序列化ui
SerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
再或者這種類型轉換錯誤
org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.Tuple2 cannot be cast to com.xxx.xxxxx.ResultMerge
這些報錯經過註釋掉master
的設置後都會消失.
各類異常交錯出現,這是很容易讓人迷惑的.
幸虧最後報了一個熟悉的錯誤java.io.FileNotFoundException
,問題才得以解決.
報錯以下:
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1729427003-192.168.1.219-1527744820505:blk_1073742492_1669; getBlockSize()=24; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.219:50010,DS-e478076c-c3aa-4870-adce-7ffd6a49efe4,DISK], DatanodeInfoWithStorage[192.168.1.21:50010,DS-af806575-7404-45fd-bae0-0fcc59de7598,DISK]]}
這是由於在操做一個正在寫入的hdfs
文件,一般可能出如今flume寫入的文件未正常關閉,或者hdfs重啓致使的文件問題.
能夠經過命令查看一下哪些文件是OPENFORWRITTING
或者MISSING
:
hadoop fsck / -openforwrite | egrep "MISSING|OPENFORWRITE"
經過上面的命令能夠肯定具體文件,而後將其刪除便可.