在 Hadoop 中使用 Python 進行統計開發

時間 2019-11-07

原文原文鏈接

Hadoop 的 Native 語言是 Java，它也提供其餘語言（如 C、Python）的接口。在 Hadoop 下面其餘語言是怎麼工做的呢？原理是使用 HadoopStreaming 的標準輸入 STDIN 和標準輸出 STDOUT 來幫咱們在 Map 和 Reduce 間傳遞數據。python

Python map/reduce

編寫 map.pyapp

#!/usr/bin/env python
import sys

def read_inputs(file):  
  for line in file:
    line = line.strip()
    yield line.split()

def main():  
  file = sys.stdin
  lines = read_inputs(file)
  for words in lines:
    for word in words:
      print("{}\t{}".format(word, 1))

if __name__ == "__main__":  
  main()

測試oop

echo "Hello world Bye world" | ./map.py 
Hello   1
world   1
Bye 1
world   1

編寫 reduce.py測試

#!/usr/bin/env python
import sys

def read_map_outputs(file):  
  for line in file:
    yield line.strip().split("\t", 1)

def main():  
  current_word = None
  word_count   = 0
  lines = read_map_outputs(sys.stdin)
  for word, count in lines:
    try:
      count = int(count)
    except ValueError:
      continue
    if current_word == word:
      word_count += count
    else:
      if current_word:
        print("{}\t{}".format(current_word, word_count))
      current_word = word
      word_count = count
  if current_word:
    print("{}\t{}".format(current_word, word_count))

if __name__ == "__main__":  
  main()

測試code

echo "Hello World Bye World Hello" | ./map.py | sort | ./reduce.py
Bye 1
Hello   2
World   2

上面都是使用 Python 本身的特性去進行統計，下面展現使用 Hadoop 的流程來執行orm

使用 MapReduce 執行 Python 腳本

查找 hadoop-stream 庫的位置接口

find ./ -name "hadoop-streaming*.jar"  
./local/hadoop/share/hadoop/tools/sources/hadoop-streaming-2.7.3-test-sources.jar
./local/hadoop/share/hadoop/tools/sources/hadoop-streaming-2.7.3-sources.jar
./local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar

在 HDFS 上創建讀入文件夾 inputip

hadoop -fs mkdir inputhadoop

將待處理文件放入 HDFSinput

hadoop -fs put allfiles input

運行命令處理

hadoop jar ~/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -input input -output output -mapper ./map.py -reducer ./reduce.py

處理後的文件

Bye 1
Goodbye 1
Hadoop  2
Hello   2
World   2

Python 代碼中 map.py 的 print 會將行輸入到 Hadoop 中，而 reduce.py 中的 print 會將 hadoop 流中的數據輸出到 HDFS 中。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。