CSDN大數據技術:node
部分摘錄:分佈式
加州大學伯克利分校AMP實驗室博士Matei Zaharia:Spark的現狀和將來 ----(Matei Zaharia是加州大學伯克利分校AMP實驗室博士研究生,Databricks公司的聯合創始人兼現任CTO。Zaharia致力於於大規模數據密集型計算的系統和算法。研究項目包括:Spark、Shark、Multi-Resource Fairness、MapReduce Scheduling、SNAP Sequence Aligner)oop
Spark是發源於美國加州大學伯克利分校AMPLab的集羣計算平臺,立足於內存計算,從多迭代批量處理出發,兼收幷蓄數據倉庫、流處理和圖計算等多種計算範式,是罕見的全能選手。性能
Project History:測試
Spark started as research project in 2009大數據
Open sourced in 2010spa
Growing community since
Entered Apache lncubator in June 2013
Release Growth:
Spark 0.6 ---- Java API、Maven、standalone mode ,17 contributors
Spark 0.7 ---- Python API、Spark Streaming ,31 contributors
Spark 0.8 ---- YARN、MLlib、monitoring UI ,67 contributors ---- High availability for standalone mode (0.8.1)
Spark 0.9 ---- Scala 2.10 support、Configuration system、Spark Streaming improvement
Projects Bulit on Spark:
Shark(SQL)、Spark Streaming(real-time)、GraphX(graph)、MLbase(machine learning)
Databricks公司CEO Ion Stoica:將數據轉化爲價值 ----(Ion Stoica是UC Berkeley計算機教授,AMPLab共同創始人,彈性P2P協議Chord、集羣內存計算框架Spark、集羣資源管理平臺Mesos都出自他)
Turning Data into Value
What do We Need?
interactive queries(交互式查詢) ---- enable faster decision
Queries on streaming data(基於數據流的查詢) ---- enable decisions on real-time data ---- Eg:fraud detection(欺詐檢測)、detect DDoS attacks(檢測DDoS攻擊)
Sophisticated data processing(複雜的數據處理) ---- enable "better" decision
Our Goal:
Support batch、Streaming、and interactive computation(批處理、流處理、交互計算)...... in a unified framework
Easy to develop sophisticated algorithms(e.g..,graph,ML algos)
Big Data Challenge:Time 、Money 、Answer Quality
處理速度與精確性的權衡:反比
Sharethrough數據專家Ryan Weald:產品化Spark流媒體
Keys to Fault Tolerance:
Receive fault tolerance ---- Use Actors with supervisor、Use self healing connection pools
Monitoring job progress
RDDs:彈性分佈式數據集
Low latency & Scale (低延時&大規模)
iterative and Interactive computation (迭代式和交互式計算)
Databricks創始人Patrick Wendell:理解Spark應用程序的性能 ---- (專一於大規模數據密集型計算。致力於Spark的性能基準測試,同時是spark-perf的合著者。這次峯會他就Spark 深度挖掘、UI概述和測試設備、普通性能和錯誤)
Summary of Components:
Tasks:Fundamental unit of work
Stage:Set of tasks that run in parallel
DAG:Logical graph of RDD operations
RDD:Parallel dataset with partitions
Demo of perf UI ---- Problems:
Scheduling and launching tasks
Execution of tasks
Writing data between stages
Collecting results
Databricks客戶端解決方案主管Pat McDonough:用Spark並行程序設計 ---- (從Spark的性能、組件等方面全面介紹Spark的各類優異性能)
UC Berkeley博士Tathagata Das:用Spark流實時大數據處理 ---- (什麼是Spark流,爲何選擇Spark流,其性能和容錯機制)
DStreams+RDDs=Power
Fault-tolerance:
Batches of input data are replicated in memory for fault-tolerance
Data lost due to worker failure,can be recomputed from replicated input data
All transformations are fault-tolerant,and exactly-once transformations
Higher throughput than Storm:
Spark Streaming:670K records/sec/node
Storm:115K records/sec/node
Fast Fault Recovery:
Recovers from faults/stragglers within 1 sec
Spark 0.9 in Jan 2014 ---- out of alpha
Automated master fault recovery
Performance optimizations
Web UI,and better monitoring capabilities
Cluster Manager UI ---- Standalone mode:<master>:8080
Executor Logs ---- Stored by cluster manager on each worker
Spark Driver Logs ---- Spark initializes a log4j when created ,Include log4j.properties file on the classpath
Application Web UI ---- http://spark-application-host:4040 ---- For executor / task / stage / memory status,etc