reduce端鏈接-分區分組聚合

時間 2020-02-17

原文原文鏈接

1.1.1 reduce端鏈接-分區分組聚合

reduce端鏈接則是利用了reduce的分區功能將stationid相同的分到同一個分區，在利用reduce的分組聚合功能，將同一個stationid的氣象站數據和溫度記錄數據分爲一組，reduce函數讀取分組後的第一個記錄（就是氣象站的名稱）與其餘記錄組合後輸出，實現鏈接。例如鏈接下面氣象站數據集和溫度記錄數據集。先用幾條數據作分析說明，實際確定不僅這點數據。html

氣象站數據集，氣象站id和名稱數據表java

StationId StationNameapache

1~hangzhouapp

2~shanghaiide

3~beijing函數

溫度記錄數據集oop

StationId TimeStamp Temperaturethis

3~20200216~6spa

3~20200215~2orm

3~20200217~8

1~20200211~9

1~20200210~8

2~20200214~3

2~20200215~4

目標：是將上面兩個數據集進行鏈接，將氣象站名稱按照氣象站id加入氣象站溫度記錄中最輸出結果：

1~hangzhou ~20200211~9

1~hangzhou ~20200210~8

2~shanghai ~20200214~3

2~shanghai ~20200215~4

3~beijing ~20200216~6

3~beijing ~20200215~2

3~beijing ~20200217~8

詳細步驟以下

（1） 兩個maper讀取兩個數據集的數據輸出到同一個文件

由於是不一樣的數據格式，因此須要建立兩個不一樣maper分別讀取，輸出到同一個文件中，因此要用MultipleInputs設置兩個文件路徑，設置兩個mapper。

（2） 建立一個組合鍵<stationed,mark>用於map輸出結果排序。

組合鍵使得map輸出按照stationid升序排列，stationid相同的按照第二字段升序排列。mark只有兩個值，氣象站中讀取的數據，mark爲0，溫度記錄數據集中讀取的數據mark爲1。這樣就能保證stationid相同的記錄中第一條就是氣象站名稱，其他的是溫度記錄數據。組合鍵TextPair定義以下

package Temperature;





import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.WritableComparable;



import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;



public class TextPair implements WritableComparable<TextPair> {

    private Text first;

    private Text second;



    public TextPair(Text first, Text second) {

        this.first = first;

        this.second = second;

    }



    public int compareTo(TextPair o) {

        int cmp=first.compareTo(o.getFirst());

        if (cmp!=0)//第一字段不一樣按第一字段升序排列

        {

            return cmp;

        }

        ///第一字段相同，按照第二字段升序排列

        return second.compareTo(o.getSecond());

    }



    public void write(DataOutput dataOutput) throws IOException {

        first.write(dataOutput);

        second.write(dataOutput);

    }



    public void readFields(DataInput dataInput) throws IOException {

        first.readFields(dataInput);

        second.readFields(dataInput);

    }



    public Text getFirst() {

        return first;

    }



    public void setFirst(Text first) {

        this.first = first;

    }



    public Text getSecond() {

        return second;

    }

    public void setSecond(Text second) {

        this.second = second;

    }

}

定義maper輸出的結果以下，前面是組合鍵，後面是值。

<1,0> hangzhou

<1,1> 20200211~9

<1,1> 20200210~8

<2,0> shanghai

<2,1> 20200214~3

<2,1> 20200215~4

<3,0> beijing

<3,1> 20200216~6

<3,1> 20200215~2

<3,1> 20200217~8

（3）map結果傳入reduce按stationid分區再分組聚合

map輸出結果會按照組合鍵第一個字段stationid升序排列，相同stationid的記錄按照第二個字段升序排列，氣象站數據和記錄數據混合再一塊兒，shulfe過程當中，map將數據傳給reduce，會通過partition分區，相同stationid的數據會被分到同一個reduce，一個reduce中stationid相同的數據會被分爲一組。假設採用兩個reduce任務，分區按照stationid%2，則分區後的結果爲

分區1