Flink入門（五）——DataSet Api編程指南

時間 2020-03-02

標籤 flink 入門 dataset api 編程指南简体版

原文原文鏈接

Apache Flink

Apache Flink 是一個兼顧高吞吐、低延遲、高性能的分佈式處理框架。在實時計算崛起的今天，Flink正在飛速發展。因爲性能的優點和兼顧批處理，流處理的特性，Flink可能正在顛覆整個大數據的生態。html

DataSet API

首先要想運行Flink，咱們須要下載並解壓Flink的二進制包，下載地址以下：https://flink.apache.org/down... java

咱們能夠選擇Flink與Scala結合版本，這裏咱們選擇最新的1.9版本Apache Flink 1.9.0 for Scala 2.12進行下載。linux

下載成功後，在windows系統中能夠經過Windows的bat文件或者Cygwin來運行Flink。c++

在linux系統中分爲單機，集羣和Hadoop等多種狀況。git

請參考：Flink入門（三）——環境與部署github

Flink的編程模型，Flink提供了不一樣的抽象級別以開發流式或者批處理應用，本文咱們來介紹DataSet API ，Flink最經常使用的批處理編程模型。算法

Flink中的DataSet程序是實現數據集轉換的常規程序（例如，Filter，映射，鏈接，分組）。數據集最初是從某些來源建立的（例如，經過讀取文件或從本地集合建立）。結果經過接收器返回，接收器能夠例如將數據寫入（分佈式）文件或標準輸出（例如命令行終端）。Flink程序能夠在各類環境中運行，獨立運行或嵌入其餘程序中。執行能夠在本地JVM中執行，也能夠在許多計算機的集羣上執行。apache

示例程序編程

如下程序是WordCount的完整工做示例。您能夠複製並粘貼代碼以在本地運行它。segmentfault

Java

public class WordCountExample {
    public static void main(String[] args) throws Exception {
        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<String> text = env.fromElements(
            "Who's there?",
            "I think I hear them. Stand, ho! Who's there?");

        DataSet<Tuple2<String, Integer>> wordCounts = text
            .flatMap(new LineSplitter())
            .groupBy(0)
            .sum(1);

        wordCounts.print();
    }

    public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
            for (String word : line.split(" ")) {
                out.collect(new Tuple2<String, Integer>(word, 1));
            }
        }
    }
}

Scala

import org.apache.flink.api.scala._

object WordCount {
  def main(args: Array[String]) {

    val env = ExecutionEnvironment.getExecutionEnvironment
    val text = env.fromElements(
      "Who's there?",
      "I think I hear them. Stand, ho! Who's there?")

    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .groupBy(0)
      .sum(1)

    counts.print()
  }
}

數據集轉換

數據轉換將一個或多個DataSet轉換爲新的DataSet。程序能夠將多個轉換組合到複雜的程序集中。

DataSet API 中最重要的就是這些算子，咱們將數據接入後，經過這些算子對數據進行處理，獲得咱們想要的結果。

Java版算子以下：

轉換	描述
Map	採用一個數據元並生成一個數據元。`data.map(new MapFunction<String, Integer>() { public Integer map(String value) { return Integer.parseInt(value); } });`
FlatMap	採用一個數據元並生成零個，一個或多個數據元。`data.flatMap(new FlatMapFunction<String, String>() { public void flatMap(String value, Collector<String> out) { for (String s : value.split(" ")) { out.collect(s); } } });`
MapPartition	在單個函數調用中轉換並行分區。該函數將分區做爲`Iterable`流來獲取，而且能夠生成任意數量的結果值。每一個分區中的數據元數量取決於並行度和先前的算子操做。`data.mapPartition(new MapPartitionFunction<String, Long>() { public void mapPartition(Iterable<String> values, Collector<Long> out) { long c = 0; for (String s : values) { c++; } out.collect(c); } });`
Filter	計算每一個數據元的布爾函數，並保存函數返回true的數據元。重要信息：系統假定該函數不會修改應用謂詞的數據元。違反此假設可能會致使錯誤的結果。`data.filter(new FilterFunction<Integer>() { public boolean filter(Integer value) { return value > 1000; } });`
Reduce	經過將兩個數據元重複組合成一個數據元，將一組數據元組合成一個數據元。Reduce能夠應用於完整數據集或分組數據集。`data.reduce(new ReduceFunction<Integer> { public Integer reduce(Integer a, Integer b) { return a + b; } });`若是將reduce應用於分組數據集，則能夠經過提供`CombineHint`to 來指定運行時執行reduce的組合階段的方式 `setCombineHint`。在大多數狀況下，基於散列的策略應該更快，特別是若是不一樣鍵的數量與輸入數據元的數量相比較小（例如1/10）。
ReduceGroup	將一組數據元組合成一個或多個數據元。ReduceGroup能夠應用於完整數據集或分組數據集。`data.reduceGroup(new GroupReduceFunction<Integer, Integer> { public void reduce(Iterable<Integer> values, Collector<Integer> out) { int prefixSum = 0; for (Integer i : values) { prefixSum += i; out.collect(prefixSum); } } });`
Aggregate	將一組值聚合爲單個值。聚合函數能夠被認爲是內置的reduce函數。聚合能夠應用於完整數據集或分組數據集。`Dataset<Tuple3<Integer, String, Double>> input = // [...] DataSet<Tuple3<Integer, String, Double>> output = input.aggregate(SUM, 0).and(MIN, 2);`您還可使用簡寫語法進行最小，最大和總和聚合。 `Dataset<Tuple3<Integer, String, Double>> input = // [...] DataSet<Tuple3<Integer, String, Double>> output = input.sum(0).andMin(2);`
Distinct	返回數據集的不一樣數據元。它相對於數據元的全部字段或字段子集從輸入DataSet中刪除重複條目。`data.distinct();`使用reduce函數實現Distinct。您能夠經過提供`CombineHint`to 來指定運行時執行reduce的組合階段的方式 `setCombineHint`。在大多數狀況下，基於散列的策略應該更快，特別是若是不一樣鍵的數量與輸入數據元的數量相比較小（例如1/10）。
Join	經過建立在其鍵上相等的全部數據元對來鏈接兩個數據集。可選地使用JoinFunction將數據元對轉換爲單個數據元，或使用FlatJoinFunction將數據元對轉換爲任意多個（包括無）數據元。請參閱鍵部分以瞭解如何定義鏈接鍵。`result = input1.join(input2) .where(0) // key of the first input (tuple field 0) .equalTo(1); // key of the second input (tuple field 1)`您能夠經過Join Hints指定運行時執行鏈接的方式。提示描述了經過分區或廣播進行鏈接，以及它是使用基於排序仍是基於散列的算法。有關可能的提示和示例的列表，請參閱「轉換指南」。若是未指定提示，系統將嘗試估算輸入大小，並根據這些估計選擇最佳策略。`// This executes a join by broadcasting the first data set // using a hash table for the broadcast data result = input1.join(input2, JoinHint.BROADCAST_HASH_FIRST) .where(0).equalTo(1);`請注意，鏈接轉換僅適用於等鏈接。其餘鏈接類型須要使用OuterJoin或CoGroup表示。
OuterJoin	在兩個數據集上執行左，右或全外鏈接。外鏈接相似於常規（內部）鏈接，並建立在其鍵上相等的全部數據元對。此外，若是在另外一側沒有找到匹配的Keys，則保存「外部」側（左側，右側或二者都滿）的記錄。匹配數據元對（或一個數據元和`null`另外一個輸入的值）被賦予JoinFunction以將數據元對轉換爲單個數據元，或者轉換爲FlatJoinFunction以將數據元對轉換爲任意多個（包括無）數據元。請參閱鍵部分以瞭解如何定義鏈接鍵。`input1.leftOuterJoin(input2) // rightOuterJoin or fullOuterJoin for right or full outer joins .where(0) // key of the first input (tuple field 0) .equalTo(1) // key of the second input (tuple field 1) .with(new JoinFunction<String, String, String>() { public String join(String v1, String v2) { // NOTE: // - v2 might be null for leftOuterJoin // - v1 might be null for rightOuterJoin // - v1 OR v2 might be null for fullOuterJoin } });`
CoGroup	reduce 算子操做的二維變體。將一個或多個字段上的每一個輸入分組，而後關聯組。每對組調用轉換函數。請參閱keys部分以瞭解如何定義coGroup鍵。`data1.coGroup(data2) .where(0) .equalTo(1) .with(new CoGroupFunction<String, String, String>() { public void coGroup(Iterable<String> in1, Iterable<String> in2, Collector<String> out) { out.collect(...); } });`
Cross	構建兩個輸入的笛卡爾積（交叉乘積），建立全部數據元對。可選擇使用CrossFunction將數據元對轉換爲單個數據元`DataSet<Integer> data1 = // [...] DataSet<String> data2 = // [...] DataSet<Tuple2<Integer, String>> result = data1.cross(data2);`注：交叉是一個潛在的很是計算密集型算子操做它甚至能夠挑戰大的計算集羣！建議使用crossWithTiny（）和crossWithHuge（）來提示系統的DataSet大小。
Union	生成兩個數據集的並集。`DataSet<String> data1 = // [...] DataSet<String> data2 = // [...] DataSet<String> result = data1.union(data2);`
Rebalance	均勻地Rebalance 數據集的並行分區以消除數據誤差。只有相似Map的轉換可能會遵循Rebalance 轉換。`DataSet<String> in = // [...] DataSet<String> result = in.rebalance() .map(new Mapper());`
Hash-Partition	散列分區給定鍵上的數據集。鍵能夠指定爲位置鍵，表達鍵和鍵選擇器函數。`DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.partitionByHash(0) .mapPartition(new PartitionMapper());`
Range-Partition	Range-Partition給定鍵上的數據集。鍵能夠指定爲位置鍵，表達鍵和鍵選擇器函數。`DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.partitionByRange(0) .mapPartition(new PartitionMapper());`
Custom Partitioning	手動指定數據分區。注意：此方法僅適用於單個字段鍵。`DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.partitionCustom(Partitioner<K> partitioner, key)`
Sort Partition	本地按指定順序對指定字段上的數據集的全部分區進行排序。能夠將字段指定爲元組位置或字段表達式。經過連接sortPartition（）調用來完成對多個字段的排序。`DataSet<Tuple2<String,Integer>> in = // [...] DataSet<Integer> result = in.sortPartition(1, Order.ASCENDING) .mapPartition(new PartitionMapper());`
First-n	返回數據集的前n個（任意）數據元。First-n能夠應用於常規數據集，分組數據集或分組排序數據集。分組鍵能夠指定爲鍵選擇器函數或字段位置鍵。`DataSet<Tuple2<String,Integer>> in = // [...] // regular data set DataSet<Tuple2<String,Integer>> result1 = in.first(3); // grouped data set DataSet<Tuple2<String,Integer>> result2 = in.groupBy(0) .first(3); // grouped-sorted data set DataSet<Tuple2<String,Integer>> result3 = in.groupBy(0) .sortGroup(1, Order.ASCENDING) .first(3);`

數據源

數據源建立初始數據集，例如來自文件或Java集合。建立數據集的通常機制是在InputFormat後面抽象的。Flink附帶了幾種內置格式，能夠從通用文件格式建立數據集。他們中的許多人在ExecutionEnvironment上都有快捷方法。

基於文件的：

readTextFile(path)/ TextInputFormat- 按行讀取文件並將其做爲字符串返回。
readTextFileWithValue(path)/ TextValueInputFormat- 按行讀取文件並將它們做爲StringValues返回。StringValues是可變字符串。
readCsvFile(path)/ CsvInputFormat- 解析逗號（或其餘字符）分隔字段的文件。返回元組或POJO的DataSet。支持基本java類型及其Value對應做爲字段類型。
readFileOfPrimitives(path, Class)/ PrimitiveInputFormat- 解析新行（或其餘字符序列）分隔的原始數據類型（如String或）的文件Integer。
readFileOfPrimitives(path, delimiter, Class)/ PrimitiveInputFormat- 解析新行（或其餘字符序列）分隔的原始數據類型的文件，例如String或Integer使用給定的分隔符。
readSequenceFile(Key, Value, path)/ SequenceFileInputFormat- 建立一個JobConf並從類型爲SequenceFileInputFormat，Key class和Value類的指定路徑中讀取文件，並將它們做爲Tuple2 <Key，Value>返回。

基於集合：

fromCollection(Collection) - 從Java Java.util.Collection建立數據集。集合中的全部數據元必須屬於同一類型。
fromCollection(Iterator, Class) - 從迭代器建立數據集。該類指定迭代器返回的數據元的數據類型。
fromElements(T ...) - 根據給定的對象序列建立數據集。全部對象必須屬於同一類型。
fromParallelCollection(SplittableIterator, Class) - 並行地從迭代器建立數據集。該類指定迭代器返回的數據元的數據類型。
generateSequence(from, to) - 並行生成給定間隔中的數字序列。

通用：

readFile(inputFormat, path)/ FileInputFormat- 接受文件輸入格式。
createInput(inputFormat)/ InputFormat- 接受通用輸入格式。

例子

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// read text file from local files system
DataSet<String> localLines = env.readTextFile("file:///path/to/my/textfile");

// read text file from a HDFS running at nnHost:nnPort
DataSet<String> hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");

// read a CSV file with three fields
DataSet<Tuple3<Integer, String, Double>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
                           .types(Integer.class, String.class, Double.class);

// read a CSV file with five fields, taking only two of them
DataSet<Tuple2<String, Double>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
                               .includeFields("10010")  // take the first and the fourth field
                           .types(String.class, Double.class);

// read a CSV file with three fields into a POJO (Person.class) with corresponding fields
DataSet<Person>> csvInput = env.readCsvFile("hdfs:///the/CSV/file")
                         .pojoType(Person.class, "name", "age", "zipcode");

// read a file from the specified path of type SequenceFileInputFormat
DataSet<Tuple2<IntWritable, Text>> tuples =
 env.readSequenceFile(IntWritable.class, Text.class, "hdfs://nnHost:nnPort/path/to/file");

// creates a set from some given elements
DataSet<String> value = env.fromElements("Foo", "bar", "foobar", "fubar");

// generate a number sequence
DataSet<Long> numbers = env.generateSequence(1, 10000000);

// Read data from a relational database using the JDBC input format
DataSet<Tuple2<String, Integer> dbData =
    env.createInput(
      JDBCInputFormat.buildJDBCInputFormat()
                     .setDrivername("org.apache.derby.jdbc.EmbeddedDriver")
                     .setDBUrl("jdbc:derby:memory:persons")
                     .setQuery("select name, age from persons")
                     .setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.INT_TYPE_INFO))
                     .finish()
    );

// Note: Flink's program compiler needs to infer the data types of the data items which are returned
// by an InputFormat. If this information cannot be automatically inferred, it is necessary to
// manually provide the type information as shown in the examples above.

收集數據源和接收器

經過建立輸入文件和讀取輸出文件來完成分析程序的輸入並檢查其輸出是很麻煩的。Flink具備特殊的數據源和接收器，由Java集合支持以簡化測試。一旦程序通過測試，源和接收器能夠很容易地被讀取/寫入外部數據存儲（如HDFS）的源和接收器替換。

在開發中，咱們常常直接使用接收器對數據源進行接收。

final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();

// Create a DataSet from a list of elements
DataSet<Integer> myInts = env.fromElements(1, 2, 3, 4, 5);

// Create a DataSet from any Java collection
List<Tuple2<String, Integer>> data = ...
DataSet<Tuple2<String, Integer>> myTuples = env.fromCollection(data);

// Create a DataSet from an Iterator
Iterator<Long> longIt = ...
DataSet<Long> myLongs = env.fromCollection(longIt, Long.class);

廣播變量

除了常規的算子操做輸入以外，廣播變量還容許您爲算子操做的全部並行實例提供數據集。這對於輔助數據集或與數據相關的參數化很是有用。而後，算子能夠將數據集做爲集合訪問。

// 1. The DataSet to be broadcast
DataSet<Integer> toBroadcast = env.fromElements(1, 2, 3);

DataSet<String> data = env.fromElements("a", "b");

data.map(new RichMapFunction<String, String>() {
    @Override
    public void open(Configuration parameters) throws Exception {
      // 3. Access the broadcast DataSet as a Collection
      Collection<Integer> broadcastSet = getRuntimeContext().getBroadcastVariable("broadcastSetName");
    }


    @Override
    public String map(String value) throws Exception {
        ...
    }
}).withBroadcastSet(toBroadcast, "broadcastSetName"); // 2. Broadcast the DataSet

分佈式緩存

Flink提供了一個分佈式緩存，相似於Apache Hadoop，能夠在本地訪問用戶函數的並行實例。此函數可用於共享包含靜態外部數據的文件，如字典或機器學習的迴歸模型。

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// register a file from HDFS
env.registerCachedFile("hdfs:///path/to/your/file", "hdfsFile")

// register a local executable file (script, executable, ...)
env.registerCachedFile("file:///path/to/exec/file", "localExecFile", true)

// define your program and execute
...
DataSet<String> input = ...
DataSet<Integer> result = input.map(new MyMapper());
...
env.execute();

以上就是DataSet API 的使用，其實和spark很是的類似，咱們將數據接入後，能夠利用各類算子對數據進行處理。

Flink Demo代碼

Flink系列文章：

Flink入門（一）——Apache Flink介紹
 Flink入門（二）——Flink架構介紹

Flink入門（三）——環境與部署

Flink入門（四）——編程模型

更多實時計算,Flink,Kafka等相關技術博文，歡迎關注實時流式計算