Flink的AggregateFunction

時間 2019-11-19

標籤 flink aggregatefunction 简体版

原文原文鏈接

AggregateFunction 是什麼

Flink 的AggregateFunction是一個基於中間計算結果狀態進行增量計算的函數。因爲是迭代計算方式，因此，在窗口處理過程當中，不用緩存整個窗口的數據，因此效率執行比較高。java

AggregateFunction定義

/**
 * The {@code AggregateFunction} is a flexible aggregation function, characterized by the
 * following features:
 *
 * <ul>
 *     <li>The aggregates may use different types for input values, intermediate aggregates,
 *         and result type, to support a wide range of aggregation types.</li>
 *
 *     <li>Support for distributive aggregations: Different intermediate aggregates can be
 *         merged together, to allow for pre-aggregation/final-aggregation optimizations.</li>
 * </ul>
 *
 * <p>The {@code AggregateFunction}'s intermediate aggregate (in-progress aggregation state)
 * is called the <i>accumulator</i>. Values are added to the accumulator, and final aggregates are
 * obtained by finalizing the accumulator state. This supports aggregation functions where the
 * intermediate state needs to be different than the aggregated values and the final result type,
 * such as for example <i>average</i> (which typically keeps a count and sum).
 * Merging intermediate aggregates (partial aggregates) means merging the accumulators.
 *
 * <p>The AggregationFunction itself is stateless. To allow a single AggregationFunction
 * instance to maintain multiple aggregates (such as one aggregate per key), the
 * AggregationFunction creates a new accumulator whenever a new aggregation is started.
 *
 * <p>Aggregation functions must be {@link Serializable} because they are sent around
 * between distributed processes during distributed execution.
 *
 * <h1>Example: Average and Weighted Average</h1>
 *
 * <pre>{@code
 * // the accumulator, which holds the state of the in-flight aggregate
 * public class AverageAccumulator {
 *     long count;
 *     long sum;
 * }
 *
 * // implementation of an aggregation function for an 'average'
 * public class Average implements AggregateFunction<Integer, AverageAccumulator, Double> {
 *
 *     public AverageAccumulator createAccumulator() {
 *         return new AverageAccumulator();
 *     }
 *
 *     public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) {
 *         a.count += b.count;
 *         a.sum += b.sum;
 *         return a;
 *     }
 *
 *     public void add(Integer value, AverageAccumulator acc) {
 *         acc.sum += value;
 *         acc.count++;
 *     }
 *
 *     public Double getResult(AverageAccumulator acc) {
 *         return acc.sum / (double) acc.count;
 *     }
 * }
 *
 * // implementation of a weighted average
 * // this reuses the same accumulator type as the aggregate function for 'average'
 * public class WeightedAverage implements AggregateFunction<Datum, AverageAccumulator, Double> {
 *
 *     public AverageAccumulator createAccumulator() {
 *         return new AverageAccumulator();
 *     }
 *
 *     public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) {
 *         a.count += b.count;
 *         a.sum += b.sum;
 *         return a;
 *     }
 *
 *     public void add(Datum value, AverageAccumulator acc) {
 *         acc.count += value.getWeight();
 *         acc.sum += value.getValue();
 *     }
 *
 *     public Double getResult(AverageAccumulator acc) {
 *         return acc.sum / (double) acc.count;
 *     }
 * }
 * }</pre>
 *
 * @param <IN>  The type of the values that are aggregated (input values)
 * @param <ACC> The type of the accumulator (intermediate aggregate state).
 * @param <OUT> The type of the aggregated result
 */
@PublicEvolving
public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {

	/**
	 * Creates a new accumulator, starting a new aggregate.
	 *
	 * <p>The new accumulator is typically meaningless unless a value is added
	 * via {@link #add(Object, Object)}.
	 *
	 * <p>The accumulator is the state of a running aggregation. When a program has multiple
	 * aggregates in progress (such as per key and window), the state (per key and window)
	 * is the size of the accumulator.
	 *
	 * @return A new accumulator, corresponding to an empty aggregate.
	 */
	ACC createAccumulator();

	/**
	 * Adds the given input value to the given accumulator, returning the
	 * new accumulator value.
	 *
	 * <p>For efficiency, the input accumulator may be modified and returned.
	 *
	 * @param value The value to add
	 * @param accumulator The accumulator to add the value to
	 */
	ACC add(IN value, ACC accumulator);

	/**
	 * Gets the result of the aggregation from the accumulator.
	 *
	 * @param accumulator The accumulator of the aggregation
	 * @return The final aggregation result.
	 */
	OUT getResult(ACC accumulator);

	/**
	 * Merges two accumulators, returning an accumulator with the merged state.
	 *
	 * <p>This function may reuse any of the given accumulators as the target for the merge
	 * and return that. The assumption is that the given accumulators will not be used any
	 * more after having been passed to this function.
	 *
	 * @param a An accumulator to merge
	 * @param b Another accumulator to merge
	 *
	 * @return The accumulator with the merged state
	 */
	ACC merge(ACC a, ACC b);
}

有定義可知，須要實現4個接口緩存

ACC createAccumulator(); 迭代狀態的初始值

ACC add(IN value, ACC accumulator); 每一條輸入數據，和迭代數據如何迭代

ACC merge(ACC a, ACC b); 多個分區的迭代數據如何合併

OUT getResult(ACC accumulator); 返回數據，對最終的迭代數據如何處理，並返回結果。

下面是一個求平均值的demoless

val input:DataStream[(String, Int)] = …………
val result: DataStream[Double] = input.keyBy(_._1)
	// 設置窗口爲滑動窗口，使用事件時間，窗口大小1小時，滑動步長10秒
      .window(SlidingEventTimeWindows.of(Time.hours(1), Time.seconds(10)))
      .aggregate(new AggregateFunction[(String, Int), (Int, Int), Double] {
        // 迭代的初始值
        override def createAccumulator(): (Int, Int) = (0, 0)

        // 每個數據如何和迭代數據 迭代
        override def add(value: (Int, Int), accumulator: (Int, Int)): (Int, Int) = (accumulator._1 + value._1, accumulator._2 + 1)

        // 每一個分區數據之間如何合併數據
        override def merge(a: (Int, Int), b: (Int, Int)): (Int, Int) = (a._1 + b._1, a._2 + b._2)
      })
        // 返回結果
        override def getResult(accumulator: (Int, Int)): Double = accumulator._1 / accumulator._2

上面的代碼，輸入的數據是（String，Int）。String能夠認爲是key，Int能夠認爲是分數。ide