Flink 的AggregateFunction是一個基於中間計算結果狀態進行增量計算的函數。因爲是迭代計算方式,因此,在窗口處理過程當中,不用緩存整個窗口的數據,因此效率執行比較高。java
/** * The {@code AggregateFunction} is a flexible aggregation function, characterized by the * following features: * * <ul> * <li>The aggregates may use different types for input values, intermediate aggregates, * and result type, to support a wide range of aggregation types.</li> * * <li>Support for distributive aggregations: Different intermediate aggregates can be * merged together, to allow for pre-aggregation/final-aggregation optimizations.</li> * </ul> * * <p>The {@code AggregateFunction}'s intermediate aggregate (in-progress aggregation state) * is called the <i>accumulator</i>. Values are added to the accumulator, and final aggregates are * obtained by finalizing the accumulator state. This supports aggregation functions where the * intermediate state needs to be different than the aggregated values and the final result type, * such as for example <i>average</i> (which typically keeps a count and sum). * Merging intermediate aggregates (partial aggregates) means merging the accumulators. * * <p>The AggregationFunction itself is stateless. To allow a single AggregationFunction * instance to maintain multiple aggregates (such as one aggregate per key), the * AggregationFunction creates a new accumulator whenever a new aggregation is started. * * <p>Aggregation functions must be {@link Serializable} because they are sent around * between distributed processes during distributed execution. * * <h1>Example: Average and Weighted Average</h1> * * <pre>{@code * // the accumulator, which holds the state of the in-flight aggregate * public class AverageAccumulator { * long count; * long sum; * } * * // implementation of an aggregation function for an 'average' * public class Average implements AggregateFunction<Integer, AverageAccumulator, Double> { * * public AverageAccumulator createAccumulator() { * return new AverageAccumulator(); * } * * public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) { * a.count += b.count; * a.sum += b.sum; * return a; * } * * public void add(Integer value, AverageAccumulator acc) { * acc.sum += value; * acc.count++; * } * * public Double getResult(AverageAccumulator acc) { * return acc.sum / (double) acc.count; * } * } * * // implementation of a weighted average * // this reuses the same accumulator type as the aggregate function for 'average' * public class WeightedAverage implements AggregateFunction<Datum, AverageAccumulator, Double> { * * public AverageAccumulator createAccumulator() { * return new AverageAccumulator(); * } * * public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b) { * a.count += b.count; * a.sum += b.sum; * return a; * } * * public void add(Datum value, AverageAccumulator acc) { * acc.count += value.getWeight(); * acc.sum += value.getValue(); * } * * public Double getResult(AverageAccumulator acc) { * return acc.sum / (double) acc.count; * } * } * }</pre> * * @param <IN> The type of the values that are aggregated (input values) * @param <ACC> The type of the accumulator (intermediate aggregate state). * @param <OUT> The type of the aggregated result */ @PublicEvolving public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable { /** * Creates a new accumulator, starting a new aggregate. * * <p>The new accumulator is typically meaningless unless a value is added * via {@link #add(Object, Object)}. * * <p>The accumulator is the state of a running aggregation. When a program has multiple * aggregates in progress (such as per key and window), the state (per key and window) * is the size of the accumulator. * * @return A new accumulator, corresponding to an empty aggregate. */ ACC createAccumulator(); /** * Adds the given input value to the given accumulator, returning the * new accumulator value. * * <p>For efficiency, the input accumulator may be modified and returned. * * @param value The value to add * @param accumulator The accumulator to add the value to */ ACC add(IN value, ACC accumulator); /** * Gets the result of the aggregation from the accumulator. * * @param accumulator The accumulator of the aggregation * @return The final aggregation result. */ OUT getResult(ACC accumulator); /** * Merges two accumulators, returning an accumulator with the merged state. * * <p>This function may reuse any of the given accumulators as the target for the merge * and return that. The assumption is that the given accumulators will not be used any * more after having been passed to this function. * * @param a An accumulator to merge * @param b Another accumulator to merge * * @return The accumulator with the merged state */ ACC merge(ACC a, ACC b); }
有定義可知,須要實現4個接口緩存
- ACC createAccumulator(); 迭代狀態的初始值
- ACC add(IN value, ACC accumulator); 每一條輸入數據,和迭代數據如何迭代
- ACC merge(ACC a, ACC b); 多個分區的迭代數據如何合併
- OUT getResult(ACC accumulator); 返回數據,對最終的迭代數據如何處理,並返回結果。
下面是一個求平均值的demoless
val input:DataStream[(String, Int)] = ………… val result: DataStream[Double] = input.keyBy(_._1) // 設置窗口爲滑動窗口,使用事件時間,窗口大小1小時,滑動步長10秒 .window(SlidingEventTimeWindows.of(Time.hours(1), Time.seconds(10))) .aggregate(new AggregateFunction[(String, Int), (Int, Int), Double] { // 迭代的初始值 override def createAccumulator(): (Int, Int) = (0, 0) // 每個數據如何和迭代數據 迭代 override def add(value: (Int, Int), accumulator: (Int, Int)): (Int, Int) = (accumulator._1 + value._1, accumulator._2 + 1) // 每一個分區數據之間如何合併數據 override def merge(a: (Int, Int), b: (Int, Int)): (Int, Int) = (a._1 + b._1, a._2 + b._2) }) // 返回結果 override def getResult(accumulator: (Int, Int)): Double = accumulator._1 / accumulator._2
上面的代碼,輸入的數據是(String,Int)。String能夠認爲是key,Int能夠認爲是分數。ide
以上面的demo爲例講解。函數