如今須要蒐集用戶的行爲記錄,以前咱們打算採用AWS提供的服務,大體架構是這樣的:html
創建一個rest來收集來自服務器或者是終端的(從手機端,網頁)的數據,以後將這些數據放到 Kinesis Streaming之中,而後經過AWS的firehose將數據放到S3或者RedShift中。可是如今有兩個問題,Amazon中國 目前尚未firehose這個service,二是可能打算將獲取到的數據流放到HDFS或者其餘地方,而且可能須要作一些比較簡單的運算。考慮到效率以及拓展性,咱們採用了Spark Streaming來代替firehose(關於Spark Streaming效率問題,能夠參考這裏)。java
爲何不本身寫呢?我想用這句話回答會比較好-"You can spend more time focusing on your application and less time on your infrastructure."。看你關心的層面了,作數據分析,應該把更多精力放在覈心業務上。apache
關於Spark Streaming以及如何作集成Kinesis,能夠參看官方文檔:服務器
1.Spark Streaming Programming Guide架構
2.Spark Streaming + Kinesis Integrationapp
關於spark-streaming-kinesis-asl_2.10, 和spark-core會有版本衝突,下面給出個人依賴關係,供你們參考:less
scalaVersion := "2.10.4" libraryDependencies ++= Seq( "com.amazonaws" % "aws-java-sdk-kinesis" % "1.10.4", "com.amazonaws" % "amazon-kinesis-client" % "1.4.0", "org.apache.spark" % "spark-core_2.10" % "1.4.1" % "provided", "org.apache.hadoop" % "hadoop-client" % "2.6.0", "org.apache.hbase" % "hbase-client" % "1.0.0", "org.apache.hbase" % "hbase-common" % "1.0.0", "org.apache.spark" % "spark-streaming_2.10" % "1.4.1", "org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.4.1" )