十位一線專家分享Spark現狀與將來----峯會摘錄

CSDN大數據技術:node

十位一線專家分享Spark現狀與將來(一)算法

十位一線專家分享Spark現狀與將來(二)app

十位一線專家分享Spark現狀與將來(三)框架

 


部分摘錄:分佈式

加州大學伯克利分校AMP實驗室博士Matei Zaharia:Spark的現狀和將來 ----(Matei Zaharia是加州大學伯克利分校AMP實驗室博士研究生,Databricks公司的聯合創始人兼現任CTO。Zaharia致力於於大規模數據密集型計算的系統和算法。研究項目包括:Spark、Shark、Multi-Resource Fairness、MapReduce Scheduling、SNAP Sequence Aligner)oop

Spark是發源於美國加州大學伯克利分校AMPLab的集羣計算平臺,立足於內存計算,從多迭代批量處理出發,兼收幷蓄數據倉庫、流處理和圖計算等多種計算範式,是罕見的全能選手。性能

Project History:測試

  Spark started as research project in 2009大數據

  Open sourced in 2010spa

  Growing community since

  Entered Apache lncubator in June 2013

Release Growth:

  Spark 0.6 ---- Java API、Maven、standalone mode ,17 contributors

  Spark 0.7 ---- Python API、Spark Streaming ,31 contributors

  Spark 0.8 ---- YARN、MLlib、monitoring UI ,67 contributors ---- High availability for standalone mode (0.8.1)

  Spark 0.9 ---- Scala 2.10 support、Configuration system、Spark Streaming improvement

Projects Bulit on Spark:

  Shark(SQL)、Spark Streaming(real-time)、GraphX(graph)、MLbase(machine learning)

 

Databricks公司CEO Ion Stoica:將數據轉化爲價值 ----(Ion Stoica是UC Berkeley計算機教授,AMPLab共同創始人,彈性P2P協議Chord、集羣內存計算框架Spark、集羣資源管理平臺Mesos都出自他)

Turning Data into Value 

What do We Need?

  interactive queries(交互式查詢) ---- enable faster decision

  Queries on streaming data(基於數據流的查詢) ---- enable decisions on real-time data ---- Eg:fraud detection(欺詐檢測)、detect DDoS attacks(檢測DDoS攻擊)

  Sophisticated data processing(複雜的數據處理) ---- enable "better" decision

Our Goal:

  Support batch、Streaming、and interactive computation(批處理、流處理、交互計算)...... in a unified framework

  Easy to develop sophisticated algorithms(e.g..,graph,ML algos)

 

 Big Data Challenge:Time 、Money 、Answer Quality

處理速度與精確性的權衡:反比

 

 

Tim Tully :集成Spark/Shark到雅虎數據分析平臺

Sharethrough數據專家Ryan Weald:產品化Spark流媒體

Keys to Fault Tolerance:

  Receive fault tolerance ---- Use Actors with supervisor、Use self healing connection pools

  Monitoring job progress

RDDs:彈性分佈式數據集

  Low latency & Scale (低延時&大規模)

  iterative and Interactive computation (迭代式和交互式計算)

 

Databricks創始人Patrick Wendell:理解Spark應用程序的性能 ---- (專一於大規模數據密集型計算。致力於Spark的性能基準測試,同時是spark-perf的合著者。這次峯會他就Spark 深度挖掘、UI概述和測試設備、普通性能和錯誤)

Summary of Components:

  Tasks:Fundamental unit of work

  Stage:Set of tasks that run in parallel

  DAG:Logical graph of RDD operations

  RDD:Parallel dataset with partitions

Demo of perf UI ---- Problems:

  Scheduling and launching tasks

  Execution of tasks

  Writing data between stages

  Collecting results

 

Databricks客戶端解決方案主管Pat McDonough:用Spark並行程序設計 ---- (從Spark的性能、組件等方面全面介紹Spark的各類優異性能)

 

 

UC Berkeley博士Tathagata Das:用Spark流實時大數據處理 ---- (什麼是Spark流,爲何選擇Spark流,其性能和容錯機制)

DStreams+RDDs=Power

Fault-tolerance:

  Batches of input data are replicated in memory for fault-tolerance

  Data lost due to worker failure,can be recomputed from replicated input data

  All transformations are fault-tolerant,and exactly-once transformations

Higher throughput than Storm:

  Spark Streaming:670K records/sec/node

  Storm:115K records/sec/node

Fast Fault Recovery:

  Recovers from faults/stragglers within 1 sec

Spark 0.9 in Jan 2014 ---- out of alpha

  Automated master fault recovery

  Performance optimizations

  Web UI,and better monitoring capabilities

    Cluster Manager UI ---- Standalone mode:<master>:8080

    Executor Logs ---- Stored by cluster manager on each worker

    Spark Driver Logs ---- Spark initializes a log4j when created ,Include log4j.properties file on the classpath

    Application Web UI ---- http://spark-application-host:4040 ---- For executor / task / stage / memory status,etc

相關文章
相關標籤/搜索