『 Spark 』1. spark 簡介

時間 2019-11-11

標籤 Spark spark 簡介欄目 Spark 简体版

原文原文鏈接

寫在前面

本系列是綜合了本身在學習spark過程當中的理解記錄＋對參考文章中的一些理解＋我的實踐spark過程當中的一些心得而來。寫這樣一個系列僅僅是爲了梳理我的學習spark的筆記記錄，並不是爲了作什麼教程，因此一切以我的理解梳理爲主，沒有必要的細節就不會記錄了。若想深刻了解，最好閱讀參考文章和官方文檔。github

其次，本系列是基於目前最新的 spark 1.6.0 系列開始的，spark 目前的更新速度很快，記錄一下版本好仍是必要的。
最後，若是各位以爲內容有誤，歡迎留言備註，全部留言 24 小時內一定回覆，很是感謝。
Tips: 若是插圖看起來不明顯，能夠：1. 放大網頁；2. 新標籤中打開圖片，查看原圖哦。sql

1. 如何向別人介紹 spark

Apache Spark™ is a fast and general engine for large-scale data processing.編程

Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including :app

Spark SQL for SQL and structured data processing, extends to DataFrames and DataSets
MLlib for machine learning
GraphX for graph processing
Spark Streaming for stream data processing

2. spark 誕生的一些背景

Spark started in 2009, open sourced 2010, unlike the various specialized systems[hadoop, storm], Spark’s goal was to :ide

generalize MapReduce to support new apps within same engineoop
- it's perfectly compatible with hadoop, can run on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
speed up iteration computing over hadoop.學習
- use memory + disk instead of disk as data storage medium
- design a new programming modal, RDD, which make the data processing more graceful [RDD transformation, action, distributed jobs, stages and tasks]

3. 爲什麼選用 spark

designed, implemented and used as libs, instead of specialized systems;
- much more useful and maintainable

from history, it is designed and improved upon hadoop and storm, it has perfect genes;
documents, community, products and trends;
it provides sql, dataframes, datasets, machine learning lib, graph computing lib and activitily growth 3-party lib, easy to use, cover lots of use cases in lots field;
it provides ad-hoc exploring, which boost your data exploring and pre-processing and help you build your data ETL, processing job;