Apache beam其餘學習記錄

時間 2019-11-11

原文原文鏈接

Combine與GroupByKey

GroupByKey是把相關key的元素聚合到一塊兒，一般是造成一個Iterable的value，如：安全

cat, [1,5,9]
dog, [5,2]
and, [1,2,6]

Combine是對聚合後的Iterable進行處理（如求和，求均值），返回一個結果。內置的Combine.perKey（）方法實際上是GroupByKey和Combine的結合，先聚合和處理。
Beam中還有許多內置的處理類，好比Sum.integersPerKey（），Count.perElement()等
在全局窗口下，對於空輸入，Combine操做後通常會返回默認值（好比Sum的默認返回值爲0），若是設置了.withoutDefault（），則返回空的PCollection。
在非全局窗口下，用戶必須指明空輸入時的返回類型，若是Combine的輸出結果要做爲下一級處理的輸入，通常設置爲.asSingletonView（），表示返回默認值，這樣即便空窗口也有默認值返回，保證了窗口的數量不變；若是設置了.withoutDefault(),則空的窗口返回空PCollection，通常做爲最後的輸出結果。app

Platten與Patition

用於PCollection與PCollectionList的轉換。
官方文檔給的Platten代碼很容易理解：ide

// Flatten takes a PCollectionList of PCollection objects of a given type.
// Returns a single PCollection that contains all of the elements in the PCollection objects in that list.
PCollection<String> pc1 = ...;
PCollection<String> pc2 = ...;
PCollection<String> pc3 = ...;
PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3);

PCollection<String> merged = collections.apply(Flatten.<String>pCollections());

將一個PCollectionList={ PCollection{String1}， PCollection{String2}， PCollection{String3} }轉換爲一個PCollection={String1， String2， String3}.
而Patition恰好反過來，要將PCollection轉換爲PCollectionList須要指明分紅的list長度以及如何劃分，所以須要傳遞劃分長度size和劃分方法Fn。編碼

// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
    students.apply(Partition.of(10, new PartitionFn<Student>() {
        public int partitionFor(Student student, int numPartitions) {
            return student.getPercentile()  // 0..99
                 * numPartitions / 100;
        }}));

其中partitionFor（）方法返回的是在PCollectionList中的位置下標。線程

Side Input

不能使用硬編碼數據，一般是轉換中間產生的數據。通常用於跟主輸入數據進行比較，所以要求Side Input數據的窗口要與主輸入數據的窗口儘可能一致，若是不一致，Beam會盡量地從Side Input中找到合適的位置的數據進行比較。對於設置了多個觸發器的Side Input，自動選擇最後一個觸發的結算結果。code

附屬輸出數據 Additional Outputs

這一部分官方的代碼已經寫得很清楚，看代碼便可。對象

數據編碼

在Pipeline的數據處理過程當中常常須要對數據元素進行字節轉換，所以須要制定字節轉換的編碼格式。對於絕大部分類型的數據，Beam都提供了默認的編碼類型，用戶也能夠經過SetCoder指定編碼類型。
1）從內存讀取的輸入數據通常要求用戶指定其編碼類型；
2）用戶自定義的類對象通常要求用戶指定其編碼類型，或者能夠在類定義上使用@DefaultCoder（AvroCoder.class)指定默認編碼類型。ip