[hadoop]Hadoop Streaming筆記1

How Streaming Works

主要是對官方文檔的一些理解html

主要參考了
Apache Hadoop文檔
《Hadoop: a definitive guide》
Hadoop權威指南python

Mapper

When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized.apache

Mapper初始化的時候會給可執行文件起一個單獨的進程。app

As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper.ide

當mapper task運行的時候, mapper的 input就被轉換成了lines
(好比輸入一個文本文件,會被轉換成文本文件裏面的內容
輸入一個目錄, 就會被轉換成目錄中的文件名或者說是文件的path)
而後這些lines就被輸入了 executable 的process的 stdin, 等到執行完畢, mapper又會將process的stdout的每一行轉換成 k-v對, 也就是mapper的輸出。oop

【inputs】 --converts--> 【lines】 ---->【stdin of process】 --->【stdout of process】---(collected by mapper)-->【lines】---(converted by mapper)-->【key/value pair】 ---【collected by mapper(output of mapper)】ui

By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.this

默認狀況下, process的stdout 的每一行都會用第一個 \t 進行分割,左邊是key右邊是value。rest

Reducer

When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized.
相似mapper, reducer task也會爲executable起一個進程,若是有的話。code

As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process.

In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer.

【key/value pair】 --converted by reducer--> 【lines】 ---> 【stdin of process】-> 【stdout of process】--(collected by reducer)-->【lines】--converted by reducers-->【key/value pair(output of reducer)】

By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \  #轉化成的lines就是dir內包含文件的path
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

Packaging Files With Job Submissions

You can specify any executable as the mapper and/or the reducer. The executables do not need to pre-exist on the machines in the cluster; however, if they don't, you will need to use "-file" option to tell the framework to pack your executable files as a part of job submission. For example:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py

The above example specifies a user defined Python executable as the mapper. The option "-file myPythonScript.py" causes the python executable shipped to the cluster machines as a part of job submission.

-file選項會將文件傳送給集羣中的機器,任務完成以後刪除。

In addition to executable files, you can also package other auxiliary files (such as dictionaries, configuration files, etc) that may be used by the mapper and/or the reducer. For example:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper myPythonScript.py \ -reducer /bin/wc \ -file myPythonScript.py \ -file myDictionary.txt #也能夠傳送別的文件,好比字典什麼的

相關文章
相關標籤/搜索