[轉]Writing an Hadoop MapReduce Program in Python

時間 2019-11-15

標籤 writing hadoop mapreduce program python 欄目 Hadoop 简体版

原文原文鏈接

mapper.py

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

reducer.py

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    #   current_word - string containing a word (the key)
    #   group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

轉自：http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/python

1. [轉]Anatomy of a Program in Memory
2. Hadoop Compatibility in Flink
3. AN INTRODUCTION TO HADOOP
4. Jython_Using Jython in an IDE
5. [Hadoop] MapReduce
6. Reading and Writing CSV Files in C#
7. 【python】An Introduction to Interactive Programming in Python(week two)
8. [轉] Linux Daemon Writing HOWTO
9. Hadoop－－MapReduce
10. HADOOP - QUICK GUIDE-[3]-Mapreduce
更多相關文章...
• SQLite - Python - SQLite教程
• Docker 安裝 Python - Docker教程
• YAML 入門教程
• C# 中 foreach 遍歷的用法

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。