最近在作sparkstreaming整合kafka的時候遇到了一個問題:html
能夠抽象成這樣一個問題:有狀態的wordCount,且按照word的第一個字母爲key,可是要求輸出的格式爲(word,1)這樣的形式apache
舉例來講:app
例如第一批數據爲: hello how when hellosocket
則要求輸出爲:(hello,1) (how,2) (when,1) (hello,3)post
第二批數據爲: hello how when what hi測試
則要求輸出爲: (hello,4) (how,5) (when,2) (what,3) (hi,6)spa
首先了解一下mapWithState的常規用法:scala
ref: https://www.jianshu.com/p/a54b142067e5code
http://sharkdtu.com/posts/spark-streaming-state.htmlhtm
稍微總結一下mapWithState的幾個tips:
解決問題的思路:
State中保存狀態爲(String,Int) 元組類型, 其中String爲word的全量, 而Int爲word的計數.
import org.apache.spark.SparkConf import org.apache.spark.streaming.dstream.MapWithStateDStream import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext} object MapWithStateApp { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[*]").setAppName("MapWithStateApp") val ssc = new StreamingContext(conf,Seconds(5)) ssc.checkpoint("C:\\Users\\hylz\\Desktop\\checkpoint") val lines = ssc.socketTextStream("192.168.100.11",8888) val words = lines.flatMap(_.split(" ")) def mappingFunc(key: String, value: Option[(String, Int)], state: State[(String, Int)]): (String, Int) = { val cnt: Int = value.getOrElse((null, 0))._2 + state.getOption.getOrElse((null, 0))._2 val allField: String = value.getOrElse((null, 0))._1 state.update((allField, cnt)) (allField, cnt) } val cnt: MapWithStateDStream[String, (String, Int), (String, Int), (String, Int)] = words.map(x => (x.substring(0, 1), (x, 1))).mapWithState(StateSpec.function(mappingFunc _)) cnt.print() ssc.start() ssc.awaitTermination() } }
測試結果以下
input: hello how when hello
input: hello how when what hi