MapReduce 原理與 Python 實踐

時間 2019-12-07

原文原文鏈接

MapReduce 原理與 Python 實踐

1. MapReduce 原理

如下是我的在MongoDB和Redis實際應用中總結的Map-Reduce的理解html

Hadoop 的 MapReduce 是基於 Google - MapReduce: Simplified Data Processing on Large Clusters 的一種實現。對 MapReduce 的基本介紹以下：python

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.編程

MapReduce 是一種編程模型，用於處理大規模的數據。用戶主要經過指定一個 map 函數和一個 reduce 函數來處理一個基於key/value pair的數據集合，輸出中間的基於key/value pair的數據集合；而後再建立一個Reduce函數用來合併全部的具備相同中間key值的中間value值。看到 map/reduce 很容易就聯想到函數式編程，而實際上論文中也提到確實受到 Lisp 和其它函數式編程語言的啓發。以 Python 爲例，map/reduce 的用法以下：bash

from functools import reduce
    from operator import add
    ls = map(lambda x: len(x), ["ana", "bob", "catty", "dogge"])
    # print(list(ls))
    # => [3, 3, 5, 5]
    reduce(add, ls)
    # => 16

MapReduce 的優點在於對大規模數據進行切分（split），並在分佈式集羣上分別運行 map/reduce 並行加工，而用戶只須要針對數據處理邏輯編寫簡單的 map/reduce 函數，MapReduce 則負責保證分佈式運行和容錯機制。Hadoop 的 MapReduce 雖然由 Java 實現，但同時提供 Streaming API 能夠經過標準化輸入/輸出容許咱們使用任何編程語言來實現 map/reduce。markdown

以官方提供的 WordCount 爲例，輸入爲兩個文件：架構

hadoop fs -cat file0
    # Hello World Bye World

    hadoop fs -cat file1
    # Hello Hadoop Goodbye Hadoop

利用 MapReduce 來計算全部文件中單詞出現數量的統計。MapReduce 的運行過程以下圖所示：app

MapReduce

2.Python map/reduce

Hadoop 的 Streaming API 經過 STDIN/STDOUT 傳遞數據，所以 Python 版本的 map 能夠寫做：編程語言

#!/usr/bin/env python3
    import sys

    def read_inputs(file):
      for line in file:
        line = line.strip()
        yield line.split()
    def main():
      file = sys.stdin
      lines = read_inputs(file)
      for words in lines:
        for word in words:
          print("{}\t{}".format(word, 1))
    if __name__ == "__main__":
      main()

運行一下：分佈式

chmod +x map.py
    echo "Hello World Bye World" | ./map.py
    Hello   1
    #World   1
    #Bye     1
    #World   1

reduce 函數以此讀取通過排序以後的 map 函數的輸出，並統計單詞的次數：函數式編程

#!/usr/bin/env python3
    import sys

    def read_map_outputs(file):
      for line in file:
        yield line.strip().split("\t", 1)
    def main():
      current_word = None
      word_count   = 0
      lines = read_map_outputs(sys.stdin)
      for word, count in lines:
        try:
          count = int(count)
        except ValueError:
          continue
        if current_word == word:
          word_count += count
        else:
          if current_word:
            print("{}\t{}".format(current_word, word_count))
          current_word = word
          word_count = count
      if current_word:
        print("{}\t{}".format(current_word, word_count))
    if __name__ == "__main__":
      main()

reduce 的輸入是排序後的 map 輸出：

chmod +x reduce.py
    echo "Hello World Bye World" | ./map.py | sort | ./reduce.py

    # Bye     1
    # Hello   1
    # World   2

這其實與 MapReduce 的執行流程是一致的，下面咱們經過 MapReduce 來執行（已啓動 Hadoop），須要用到 hadoop-streaming-2.6.4.jar，不一樣的 Hadoop 版本位置可能不一樣：

cd $HADOOP_INSTALL && find ./ -name "hadoop-streaming*.jar"
    # ./share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar

    mkdir wordcount -p wordcount/input
    cd wordcount
    echo "Hello World Bye World" >> input/file0
    echo "Hello Hadoop Goodbye Hadoop" >> input/file1

    hadoop jar $HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar \
    -input $(pwd)/input \
    -output output \
    -mapper $(pwd)/map.py \
    -reducer $(pwd)/reduce.py

執行完成以後會在 output 目錄產生結果：

hadoop fs -ls output
    # Found 2 items
    # -rw-r--r--   1 rainy rainy          0 2016-03-13 02:15 output/_SUCCESS
    # -rw-r--r--   1 rainy rainy         41 2016-03-13 02:15 output/part-00000
    hadoop fs -cat output/part-00000
    # Bye     1
    # Goodbye 1
    # Hadoop  2
    # Hello   2
    # World   2

3. 總結

Hadoop 的架構讓 MapReduce 的實際執行過程簡化了許多，但這裏省略了不少細節的內容，尤爲是針對徹底分佈式模式，而且要在輸入文件足夠大的狀況下才能體現出優點。這裏處理純文本文檔做爲示例，但我想要作的是經過鏈接 MongoDB 直接讀取數據到 HDFS 而後進行 MapReduce 處理，但考慮到數據量仍然不是很大（700,000條記錄）的狀況，不知道是否會比直接 Python + MongoDB 更快。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。