Apache Beam雄心勃勃的目標：統一大數據開發

時間 2019-11-10

標籤 apache beam 雄心勃勃目標統一數據開發欄目 Apache 简体版

原文原文鏈接

Apache Beam’s Ambitious Goal: Unify Big Data Development

Apache Beam雄心勃勃的目標：統一大數據開發html

Alex Woodiegit

If you’re tired of using multiple technologies to accomplish various big data tasks, you may wantgithub

若是你對於使用多種技術來完成各類大數據任務感到疲倦，你也許想要web

to consider Apache Beam, a new distributed processing tool from Google that’s now incubatingexpress

使用Apache Beam,一個來自谷歌正在ASF中孵化的分佈式進程工具。apache

at the ASF.編程

One of the challenges of big data development is the need to use lots of different technologies,網絡

大數據開發的挑戰之一是須要使用許多各類不一樣的技術，session

frameworks, APIs, languages, and software development kits. Depending on what you’re tryingapp

框架，APIs,語言，軟件開發包。根據你想要嘗試去作的

to do–and where you’re trying to do it–you may choose MapReduce for batch processing,

哪裏去作的，你能夠選擇MapReduce進行批量處理。

Apache Spark SQL for interactive queries, Apache Flink for real-time streaming, or a machine

交互式查詢的Apache Spark SQL,實時流處理Apache Flink,或者是運行在雲端上的機器學習框架

learning framework running on the cloud.

While the open source movement has provided an abundance of riches for big data developers, it

雖然開源爲大數據開發人員提供了豐富的資源。

has increased pressure on the developer to pick 「the right」 tool for what she is trying to

與此同時，它也爲開發者選擇所謂「正確的」工具來完成工做增長了麻煩。

accomplish. This can be a bit overwhelming for those new to big data application development,

對於新的大數據應用程序開發來講，這會帶來勢不可當的效果，

and it could even slow or even hinder adoption of open source tools. (Indeed, the complexity of

甚至會放緩或者阻礙開源工具的使用（此外，

having to manually stitch everything together is perhaps the most common rallying cry heard by

對於全部的東西不得不手動的縫合在一塊兒的複雜性，也許是大數據平臺支持者最多見的反饋。）

backers of proprietary big data platforms.)

Enter Google (NASDAQ: GOOG). The Web giant is hoping to eliminate some of this second-

輸入Google(納斯達克：GOOG).網絡巨頭但願經過Apache Beam來消除二次預測和痛苦的工具之間的轉換

guessing and painful tool-jumping with Apache Beam, which it’s positioning as a single programming and runtime model that not only unifies development for batch, interactive, and

它將定位因而單一的編程和運行模型，不只統一批量,交互,和流處理框架，

streaming workflows, but also provides a single model for both cloud and on-premise development.

同時爲雲和內置部署開發提供單一模型。

The software is based on the technologies Google uses with its Cloud Dataflow service, which the

該軟件是基於谷歌使用它的雲數據流服務技術，

company launched in 2014 as the second coming of MapReduce for the current generation of

該公司於2014年推出了MapReduce的第二個版本，用於解決當前階段的分佈式數據處理挑戰

distributed data processing challenges. (It’s worth noting that FlumeJava and MillWheel also

（值得注意的是FlumeJava和MillWheel也影響着Dataflow 模型）

influenced the Dataflow model).

Source: beam.incubator.apache.org/presentation-materials/

The open source Apache Beam project essentially is the combination of the Dataflow Software Development Kit (SDK) and the Dataflow model, along with series of 「runners」 that extend out to run-time frameworks, namely Apache Spark, Apache Flink, and Cloud Dataflow itself, which Google lets you try out for free and will charge you money to use in production.

Apache Beam開源項目的本質是Dataflow軟件開發包和Dataflow模型的結合，以及一系列擴展到運行框架的「運行器」，即Apache Spark,Apache Flink,和雲Dataflow自己，谷歌提供免費試用，投入生產收費。

Apache Beam provides a unified model for not only designing, but also executing (via runners), a variety of data-oriented workflows, including data processing, data ingestion, and data integration, according to the Apache Beam project page. 「Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow,」 the project says.

根據Apache Beam項目頁面，Apache Beam 提供一個統一的模型，它不只在設計方面，同時在執行（經過運行），向數據工做流程，包括數據處理，數據萃取，數據集成。「數據流管道簡化了大規模批量處理和流數據處理的機制，而且能夠在一些運行時間上運行，好比Apache Flink,Apache Spark,和谷歌雲數據流」

The project, which was originally named Apache Dataflow before taking the Apache Beam moniker, is being championed by Jean-Baptiste Onofré, who is currently an SOA software architect at French data integration toolmaker Talend and works on many Apache Software Foundation projects. Joining Google in the project are data Artisans, which developed and maintains the Beam runner for Flink, and Cloudera, which developed and maintains the runner for Spark. Developers from Cask and PayPal are also involved.

項目在使用Apache Beam moniker以前是叫 Apache Dataflow,

Onofré describes the impetus behind the technology in a recent post to his blog:

「Imagine, you have a Hadoop cluster where you used MapReduce jobs,」 he writes. 「Now, you want to ‘migrate’ these jobs to Spark: you have to refactore [sic] all your jobs which requires lot of works and cost a lot. And after that, see the effort and cost if you want to change for a new platform like Flink: you have to refactore [sic] your jobs again.

「Dataflow aims to provide an abstraction layer between your code and the execution runtime,」 he continues. 「The SDK allows you to use an unified programming model: you implement your data processing logic using the Dataflow SDK, the same code will run on different backends. You don’t need to refactore [sic] and change the code anymore!

There are four main constructs in the Apache Beam SDK, according to the Apache Beam proposal posted to the ASF’s website. These constructs include:

Pipelines–the data processing job made of a series of computations including input, processing, and output;
PCollections–Bounded (or unbounded) datasets which represent the input, intermediate and output data in pipelines;
PTransforms–A data processing step in a pipeline in which one or more PCollections are an input and output;
and I/O Sources and Sinks–APIs for reading and writing data which are the roots and endpoints of the pipeline.

「Beam can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation,」 the . The underlying programming model for Beam provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control.

Source: beam.incubator.apache.org/presentation-materials/

Many of the concepts behind Beam are similar to those found in Spark. However, there are important differences, as Google engineers discussed in a recent article.

「Spark has had a huge and positive impact on the industry thanks to doing a number of things much better than other systems had done before,」 the engineers write. 「But Dataflow holds distinct advantages in programming model flexibility, power, and expressiveness, particularly in the out-of-order processing and real-time session management arenas…. The fact is: no other massive-scale data parallel programming model provides the depth-of-capability and ease-of-use that Dataflow/Beam does.」

Portability of code is a key feature of Beam. 「Beam was designed from the start to provide a portable programming layer,」 Onofré and others write in the Beam proposal. 「When you define a data processing pipeline with the Beam model, you are creating a job which is capable of being processed by any number of Beam processing engines.」

Beam’s Java-based SDK is currently available at GitHub (as well as on Stack Overflow), and a second SDK for Python is currently in the works. The developers have an ambitious set of goals, including creating additional Beam runners (Apache Storm and MapReduce are possible contenders), as well as support for other programming.

Beam developers note that the project is also closely related to Apache Crunch, a Java-based framework for Hadoop and Spark that simplifies the programming of data pipelines for common tasks such as joining and aggregations, which are tedious to implement in MapReduce.

Google announced in January that it wanted to donate Dataflow to the ASF, and the ASF accepted the proposal in early February, when it was renamed Apache Beam. The project, which is in the process of moving from GitHub to Apache, is currently incubating.

「In the long term, we believe Beam can be a powerful abstraction layer for data processing,」 the Beam proposal says. 「By providing an abstraction layer for data pipelines and processing, data workflows can be increasingly portable, resilient to breaking changes in tooling, and compatible across many execution engines, runtimes, and open source projects.」

Google Releases Cloud Processor For Hadoop, Spark

Google Reimagines MapReduce, Launches Dataflow