咱們興奮地宣佈,從今天開始,Apache Spark1.5.0的預覽數據磚是可用的。咱們的用戶如今能夠選擇提供集羣與Spark 1.5或先前的火花版本準備好幾個點擊。html
正式,Spark 1.5預計將在數週內公佈,和社區所作的QA測試的版本。鑑於火花的快節奏發展,咱們以爲這是很重要的,使咱們的用戶儘快開發和利用新特性。與傳統的本地軟件部署,它能夠須要幾個月,甚至幾年,從供應商收到軟件更新。數據磚的雲模型,咱們能夠在幾小時內更新,讓用戶試他們的火花版本的選擇。
git
The last few releases of Spark focus on making data science more accessible, through high-level programming APIs such as DataFrames, machine learning pipelines, and R language support. A large part of Spark 1.5, on the other hand, focuses on under-the-hood changes to improve Spark’s performance, usability, and operational stability.github
Spark 1.5 delivers the first phase of Project Tungsten, a new execution backend for DataFrames/SQL. Through code generation and cache-aware algorithms, Project Tungsten improves the runtime performance with out-of-the-box configurations. Through explicit memory management and external operations, the new backend also mitigates the inefficiency in JVM garbage collection and improves robustness in large-scale workloads.web
Over the next few weeks, we will be writing about Project Tungsten. To give you a sneak peek, the above chart compares the out-of-the-box (i.e. no configuration changes) performance of an aggregation query (16 million records and 1 million composite keys) using Spark 1.4 and Spark 1.5 on my laptop.算法
Streaming workloads typically run 24/7 and have stringent stability requirements. In this release, Typesafe has introduced Backpressure in Spark Streaming. With this feature, Spark Streaming can dynamically control the data ingest rates to adapt to unpredictable variations in processing load. This allows streaming applications to be more robust against bursty workloads and downstream delays.sql
Of course, Spark 1.5 is the work of more than 220 open source contributors from over 80 organizations, and includes a lot more than the above two. Some examples include:apache
New machine learning algorithms: multilayer perceptron classifier, PrefixSpan for sequential pattern mining, association rule generation, etc.後端
Improved R language support and GLMs with R formula.api
Better instrumentation and reporting of memory usage in web UI.緩存
Stay tuned for future blog posts covering the release as well as deep dives into specific improvements.
Launching a Spark 1.5 cluster is as easy as selecting Spark 1.5 experimental version in the cluster creation interface in Databricks.
Once you hit confirm, you will get a Spark cluster ready to go with Spark 1.5.0 and start testing the new release. Multiple Spark version support in Databricks also enables users to run Spark 1.5 canary clusters side-by-side with existing production Spark clusters.
You can find the work-in-progress documentation for Spark 1.5.0 here. Please be aware that just like any other preview software, Spark 1.5.0 support is experimental. There will be bugs and quirks that we find and fix in the next couple of weeks. The good news is that you don’t have to worry about following the development or upgrading yourself. As we discover and fix bugs in the open source project, the Spark 1.5 option in Databricks will also be updated automatically. If you encounter a bug, please report it by filing a JIRA ticket.
To try Databricks, sign up for a free 30-day trial.
在上一次北京sparkmeetup技術分享會上,一個spark commiter就說他們忙着Spark 1.5(核心工做就說Tungsten),一個新的DataFrames / SQL執行後端。項目支持緩存經過代碼生成算法,提升運行時性能與Tungsten的開箱即用配置。經過顯式的內存管理和外部操做,新的後端也減輕了低效JVM的垃圾收集,提升了魯棒性在大規模的工做負載
目前來看,spark1.5第一階段目前是完成,估計後期應該有不少優化和代碼修復,但可嚐嚐甜頭,若是想了解1.5版本代碼,看github spark1.5 branch,我的感受 主要仍是spark sql的提高吧,由於大多數公司都是 spark on yarn的方式,大多數任務提高但願在spark sql上面