hadoop學習二：MapReduce源碼分析總結

時間 2019-12-11

標籤 hadoop 學習 mapreduce 源碼分析總結欄目 Hadoop 简体版

原文原文鏈接

一、Map-Reduce的邏輯過程

假設咱們須要處理一批有關天氣的數據，其格式以下： html

按照ASCII碼存儲，每行一條記錄
每一行字符從0開始計數，第15個到第18個字符爲年
第25個到第29個字符爲溫度，其中第25位是符號+/-

0067011990999991950051507+0000+ java

0043011990999991950051512+0022+ 緩存

0043011990999991950051518-0011+ app

0043012650999991949032412+0111+ 分佈式

0043012650999991949032418+0078+ ide

0067011990999991937051507+0001+ 函數

0043011990999991937051512-0002+ oop

0043011990999991945051518+0001+ spa

0043012650999991945032412+0002+ 命令行

0043012650999991945032418+0078+

如今須要統計出每一年的最高溫度。

Map-Reduce主要包括兩個步驟：Map和Reduce

每一步都有key-value對做爲輸入和輸出：

map階段的key-value對的格式是由輸入的格式所決定的，若是是默認的TextInputFormat，則每行做爲一個記錄進程處理，其中key爲此行的開頭相對於文件的起始位置，value就是此行的字符文本
map階段的輸出的key-value對的格式必須同reduce階段的輸入key-value對的格式相對應

對於上面的例子，在map過程，輸入的key-value對以下：

(0, 0067011990999991950051507+0000+)

(33, 0043011990999991950051512+0022+)

(66, 0043011990999991950051518-0011+)

(99, 0043012650999991949032412+0111+)

(132, 0043012650999991949032418+0078+)

(165, 0067011990999991937051507+0001+)

(198, 0043011990999991937051512-0002+)

(231, 0043011990999991945051518+0001+)

(264, 0043012650999991945032412+0002+)

(297, 0043012650999991945032418+0078+)

在map過程當中，經過對每一行字符串的解析，獲得年-溫度的key-value對做爲輸出：

(1950, 0)

(1950, 22)

(1950, -11)

(1949, 111)

(1949, 78)

(1937, 1)

(1937, -2)

(1945, 1)

(1945, 2)

(1945, 78)

在reduce過程，將map過程當中的輸出，按照相同的key將value放到同一個列表中做爲reduce的輸入

(1950, [0, 22, –11])

(1949, [111, 78])

(1937, [1, -2])

(1945, [1, 2, 78])

在reduce過程當中，在列表中選擇出最大的溫度，將年-最大溫度的key-value做爲輸出：

(1950, 22)

(1949, 111)

(1937, 1)

(1945, 78)

其邏輯過程可用以下圖表示：

二、編寫Map-Reduce程序

編寫Map-Reduce程序，通常須要實現兩個函數：mapper中的map函數和Reducer中的reduce函數

通常遵循如下格式

map: (K1, V1) -> list(K2, V2)

public interface Mapper<K1,V1,K2,V2> extends JobConfigurable, Closeable{
    void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter)throw IOException;
}

reduce: (K2,list(v)) -> list(K3,V3)

public interface Reducer<K2, V2, K3, V3> extends JobConfigurable, Closeable {

  void reduce(K2 key, Iterator<V2> values,

              OutputCollector<K3, V3> output, Reporter reporter)

    throws IOException;

}

對於上面的例子，則實現的mapper以下：

public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

    @Override

    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

        String line = value.toString();

        String year = line.substring(15, 19);

        int airTemperature;

        if (line.charAt(25) == '+') {

            airTemperature = Integer.parseInt(line.substring(26, 30));

        } else {

            airTemperature = Integer.parseInt(line.substring(25, 30));

        }

        output.collect(new Text(year), new IntWritable(airTemperature));

    }

}

實現的reducer以下：

public class MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

        int maxValue = Integer.MIN_VALUE;

        while (values.hasNext()) {

            maxValue = Math.max(maxValue, values.next().get());

        }

        output.collect(key, new IntWritable(maxValue));

    }

}

欲運行上面實現的Mapper和Reduce，則須要生成一個Map-Reduce的任務（Job），基本包括如下三部分：‘

輸入的數據，即須要處理的數據
Map-Reduce程序，即上面實現的Mapper和Reducer
此任務的配置項JobConf

欲配置JobConf，須要大體瞭解Hadoop運行Job的基本原理：

Hadoop將Job分紅task進行處理，共有兩種task：map task 和 reduce task;
Hadoop有兩類的節點控制Job的運行：JobTracker和TaskTracker
JobTracker協調整個job的運行，將task分配到不一樣的TaskTracker上，
TaskTracker負責運行task，並將結果返回給JobTracker
Hadoop將輸入數據分紅固定大小的塊，咱們稱之爲input split
Hadoop爲每一個input split建立一個task，在此task中依次處理此split中的一個個記錄（record）
Hadoop會盡可能讓輸入數據塊所在的DataNode和task所執行的DataNode（每一個DataNode上都有一個TaskTracker）爲同一個，能夠提升運行效率，因此input slipt的大小也通常是HDFS的block的大小。
Reduce Task的輸入通常爲Map Task的輸出，Reduce Task的輸出爲整個Job的輸出。
在reduce中，相同的key的全部的記錄必定會到同一個TaskTracker上面運行，然而不一樣的key能夠在不一樣的TaskTracker上面運行，咱們稱爲partition。
partition的規則爲：（K2,V2）-> Integer，也即根據K2，生成一個partition的id，具備相同id的K2則進入同一個partition，在同一個TaskTracker上被同一個Reducer處理。

public interface Partitioner<K2, V2> extends JobConfigurable {

  int getPartition(K2 key, V2 value, int numPartitions);

}

下圖大概描述了Map-Reduce的Job運行的基本原理：

下面咱們討論JobConf，其有不少的項能夠進行配置：

setInputFormat：設置map的輸入格式，默認爲TextInputFormat，key爲LongWritable，value爲Text
setNumMapTasks：設置map的任務的個數，此設置一般不起做用，map任務的個數取決於輸入的數據所能分紅的input slipt的一個個record，依次調用Mapper的map函數。
setMapOutputKeyClass和setMapOutputValueClass：設置Mapper的輸出的key-value對的格式。
setOutputKeyClass和setOutputValueClass：設置Reducer的輸出的key-value對的格式
setPartitionerClass和setNumReduceTasks：設置Partitioner，默認爲HashPartitioner，其根據key的hash值來決定進入哪一個partition，每一個partition被一個reduce task處理，因此partition的個數等於reduce task的個數。
setReducerClass：設置Reducer，默認爲IdentityReducer
setOutputFormat：設置任務的輸出格式，默認爲TextOutputFormat
FileInputFormat.addInputPath：設置輸入文件的路徑，可使一個文件，一個路徑，一個通配符。能夠被調用屢次添加多個路徑。
FileOutputFormat.setOutputPath：設置輸出文件的路徑，在job運行前此路徑不該該存在

固然不用全部的都設置，由上面的例子，能夠編寫Map-Reduce程序以下：

public class MapTemperature {
   public static void main (String[] args)throws IOException{
     if(args.length != 2){
         System.err.println("Usage: MaxTemperature <input path> <output path>");
         System.exit(-1);
      }
        JobConf conf = new JobConf(MaxTemperature.class);

        conf.setJobName("Max temperature");

        FileInputFormat.addInputPath(conf, new Path(args[0]));

        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        conf.setMapperClass(MaxTemperatureMapper.class);

        conf.setReducerClass(MaxTemperatureReducer.class);

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(IntWritable.class);

        JobClient.runJob(conf);

    }

}