Spark On Yarn的各類Bug

今天將代碼以Spark On Yarn Cluster的方式提交,遇到了不少不少問題.特意記錄一下.java

代碼經過--master yarn-client提交是沒有問題的,可是經過--master yarn-cluster老是報錯,並且是各類各樣的錯誤.node

1.ClassCastException

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
    at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
    at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1405)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2284)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
...

這個bug一般會提示咱們是否將Jar包部署到全部的slave上了,可是yarn-cluster通常會經過RPC框架分發Jar包,即便將Jar包一一部署到slave機器中,並無任何效果,仍然報這個錯誤.web

開始經過google,stackoverflow查找相關信息.產生這種問題的緣由可謂錯綜複雜,有的說類加載器的問題,有的說UDF的問題.其中有一個引發了個人注意:sql

若是在代碼中引用了Java代碼,最好將代碼打成的Jar放在$SPARK_HOME/jars目錄下,確保jar包是在classpath下.shell

按照這個解答的方式安排了一下jar包,而後從新執行.經過yarn的web頁面觀察運行日誌,沒有這個報錯了.可是任務失敗了,報了另外一個錯誤:apache

2.FileNotFoundException

java.io.FileNotFoundException: File does not exist: hdfs://master:9000/xxx/xxxx/xxxx/application_1495996836198_0003/__spark_libs__1200479165381142167.zip
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
    at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
...

這個錯誤就讓我很熟悉了,我在代碼建立sparkSession的時候設置了master,master地址是spark masterurl,因此當在yarn上提交任務的時候,最終會按照代碼中的配置開始standalone模式,這會形成混亂,因此會產生一些莫名其妙的bug.app

修改一下代碼從新打包就行了框架

解決辦法:oop

val spark = SparkSession.builder()
//    .master("spark://master:7077")  //註釋掉master的設置
    .appName("xxxxxxx")
    .getOrCreate();

中間還遇到了其餘不少bug,好比沒法反序列化ui

SerializerInstance.deserialize(JavaSerializer.scala:114)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

再或者這種類型轉換錯誤

org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: scala.Tuple2 cannot be cast to com.xxx.xxxxx.ResultMerge

這些報錯經過註釋掉master的設置後都會消失.

各類異常交錯出現,這是很容易讓人迷惑的.

幸虧最後報了一個熟悉的錯誤java.io.FileNotFoundException,問題才得以解決.

3.HDFS的bug

報錯以下:

java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1729427003-192.168.1.219-1527744820505:blk_1073742492_1669; getBlockSize()=24; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.219:50010,DS-e478076c-c3aa-4870-adce-7ffd6a49efe4,DISK], DatanodeInfoWithStorage[192.168.1.21:50010,DS-af806575-7404-45fd-bae0-0fcc59de7598,DISK]]}

這是由於在操做一個正在寫入的hdfs文件,一般可能出如今flume寫入的文件未正常關閉,或者hdfs重啓致使的文件問題.

能夠經過命令查看一下哪些文件是OPENFORWRITTING或者MISSING:

hadoop fsck / -openforwrite | egrep "MISSING|OPENFORWRITE"

經過上面的命令能夠肯定具體文件,而後將其刪除便可.

相關文章
相關標籤/搜索