用Python實現基於Hadoop Stream的mapreduce任務

時間 2019-11-06

標籤 python 實現基於 hadoop stream mapreduce 任務欄目 Python 简体版

原文原文鏈接

用Python實現基於Hadoop Stream的mapreduce任務

由於Hadoop Stream的存在，使得任何支持讀寫標準數據流的編程語言實現map和reduce操做成爲了可能。python

爲了方便測試map代碼和reduce代碼，下面給出一個Linux環境下的shell 命令：shell

cat inputFileName | python map.py | sort | python map.py > outputFileName編程

能夠輕鬆的在沒有hadoop 環境的機器上進行測試。app

下面介紹，在Hadoop環境中的，如何用Python完成Map和Reduce兩個任務的代碼編寫。編程語言

任務示例

這裏依然採用大部分講述MapReduce文章中所採用的WordCount任務做爲示例。改任務須要統計給的海量文檔中，各類單詞出現的次數，其實就是統計詞頻（tf)。oop

map.py

import sys

for line in sys.stdin:
    words = line.strip().split()
    for word in words:
        print("{}\t{}".format(word, 1))

reduce.py

import sys

word, curWord, wordCount = None, None, 0

for line in sys.stdin:
    word, count = line.strip().split('\t')
    count = int(count)
    if word == curWord: wordCount += count
    else:
        print("{}\t{}".format(word, wordCount))
        curWord, wordCount = curWord, count
        
if word and word == curWord:
    print("{}\t{}".format(word, wordCount))

能夠在單機上執行前面所述的命令沒有問題後，而後執行下面的shell命令測試

hadoop jar $HADOOP_STREAMING \ 
-D mapred.job.name="自定義的job名字" \ 
-D mapred.map.tasks=1024 \
-D mapred.reduce.tasks=1024
-files map.py \ 
-files reduce.py \
-mapper "python map.py" \
-reducer "python reduce.py" \
-input /user/rte/hdfs_in/* \
-output /user/rte/hdfs_out