十位一線專家分享Spark現狀與將來----峯會摘錄

時間 2019-12-10

標籤十位一線專家分享 spark 現狀將來峯會摘錄欄目 Spark 简体版

原文原文鏈接

CSDN大數據技術：node

部分摘錄：分佈式

加州大學伯克利分校AMP實驗室博士Matei Zaharia：Spark的現狀和將來 ----（Matei Zaharia是加州大學伯克利分校AMP實驗室博士研究生，Databricks公司的聯合創始人兼現任CTO。Zaharia致力於於大規模數據密集型計算的系統和算法。研究項目包括：Spark、Shark、Multi-Resource Fairness、MapReduce Scheduling、SNAP Sequence Aligner）oop

Spark是發源於美國加州大學伯克利分校AMPLab的集羣計算平臺，立足於內存計算，從多迭代批量處理出發，兼收幷蓄數據倉庫、流處理和圖計算等多種計算範式，是罕見的全能選手。性能

Project History：測試

　　Spark started as research project in 2009大數據

　　Open sourced in 2010spa

　　Growing community since

　　Entered Apache lncubator in June 2013

Release Growth：

　　Spark 0.6 ---- Java API、Maven、standalone mode ，17 contributors

　　Spark 0.7 ---- Python API、Spark Streaming ，31 contributors

　　Spark 0.8 ---- YARN、MLlib、monitoring UI ，67 contributors ---- High availability for standalone mode (0.8.1)

　　Spark 0.9 ---- Scala 2.10 support、Configuration system、Spark Streaming improvement

Projects Bulit on Spark：

　　Shark(SQL)、Spark Streaming(real-time)、GraphX(graph)、MLbase(machine learning)

Databricks公司CEO Ion Stoica：將數據轉化爲價值 ----（Ion Stoica是UC Berkeley計算機教授，AMPLab共同創始人，彈性P2P協議Chord、集羣內存計算框架Spark、集羣資源管理平臺Mesos都出自他）

Turning Data into Value

What do We Need？

　　interactive queries(交互式查詢) ---- enable faster decision

　　Queries on streaming data(基於數據流的查詢) ---- enable decisions on real-time data ---- Eg：fraud detection(欺詐檢測)、detect DDoS attacks(檢測DDoS攻擊)

　　Sophisticated data processing(複雜的數據處理) ---- enable "better" decision

Our Goal：

　　Support batch、Streaming、and interactive computation（批處理、流處理、交互計算）...... in a unified framework

　　Easy to develop sophisticated algorithms(e.g..，graph，ML algos)

Big Data Challenge：Time 、Money 、Answer Quality

處理速度與精確性的權衡：反比

Tim Tully ：集成Spark/Shark到雅虎數據分析平臺

Sharethrough數據專家Ryan Weald：產品化Spark流媒體

Keys to Fault Tolerance：

　　Receive fault tolerance ---- Use Actors with supervisor、Use self healing connection pools

　　Monitoring job progress

RDDs：彈性分佈式數據集

　　Low latency & Scale (低延時&大規模)

　　iterative and Interactive computation (迭代式和交互式計算)

Databricks創始人Patrick Wendell：理解Spark應用程序的性能 ---- (專一於大規模數據密集型計算。致力於Spark的性能基準測試，同時是spark-perf的合著者。這次峯會他就Spark 深度挖掘、UI概述和測試設備、普通性能和錯誤)

Summary of Components：

　　Tasks：Fundamental unit of work

　　Stage：Set of tasks that run in parallel

　　DAG：Logical graph of RDD operations

　　RDD：Parallel dataset with partitions

Demo of perf UI ---- Problems：

　　Scheduling and launching tasks

　　Execution of tasks

　　Writing data between stages

　　Collecting results

Databricks客戶端解決方案主管Pat McDonough：用Spark並行程序設計 ---- (從Spark的性能、組件等方面全面介紹Spark的各類優異性能)

UC Berkeley博士Tathagata Das：用Spark流實時大數據處理 ---- (什麼是Spark流，爲何選擇Spark流，其性能和容錯機制)

DStreams+RDDs=Power

Fault-tolerance：

　　Batches of input data are replicated in memory for fault-tolerance

　　Data lost due to worker failure，can be recomputed from replicated input data

　　All transformations are fault-tolerant，and exactly-once transformations

Higher throughput than Storm：

　　Spark Streaming：670K records/sec/node

　　Storm：115K records/sec/node

Fast Fault Recovery：

　　Recovers from faults/stragglers within 1 sec

Spark 0.9 in Jan 2014 ---- out of alpha

　　Automated master fault recovery

　　Performance optimizations

　　Web UI，and better monitoring capabilities

　　　　Cluster Manager UI ---- Standalone mode：<master>：8080

　　　　Executor Logs ---- Stored by cluster manager on each worker

　　　　Spark Driver Logs ---- Spark initializes a log4j when created ，Include log4j.properties file on the classpath

　　　　Application Web UI ---- http://spark-application-host:4040 ---- For executor / task / stage / memory status，etc

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。