用PHP寫Hadoop的MapReduce程序

時間 2019-11-06

原文原文鏈接

Hadoop流

雖然Hadoop是用java寫的，可是Hadoop提供了Hadoop流，Hadoop流提供一個API, 容許用戶使用任何語言編寫map函數和reduce函數.
Hadoop流動關鍵是，它使用UNIX標準流做爲程序與Hadoop之間的接口。所以，任何程序只要能夠從標準輸入流中讀取數據，而且能夠把數據寫入標準輸出流中，那麼就能夠經過Hadoop流使用任何語言編寫MapReduce程序的map函數和reduce函數。
例如：bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /usr/local/hadoop/mapper.php -reducer /usr/local/hadoop/reducer.php -input test/* -output out4
Hadoop流引入的包：hadoop-streaming-0.20.203.0.jar,Hadoop根目錄下是沒有hadoop-streaming.jar的，由於 streaming是一個contrib，因此要去contrib下面找，以hadoop-0.20.2爲例，它在這裏：
-input：指明輸入hdfs文件的路徑
-output：指明輸出hdfs文件的路徑
-mapper：指明map函數
-reducer：指明reduce函數 php

mapper函數

mapper.php文件，寫入以下代碼： java

[php]

#!/usr/local/php/bin/php
<?php
$word2count = array();
// input comes from STDIN (standard input)
// You can this code :$stdin = fopen(「php://stdin」, 「r」);
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}
// write the results to STDOUT (standard output)
// what we output here will be the input for the
// Reduce step, i.e. the input for reducer.py
foreach ($word2count as $word => $count) {
// tab-delimited
echo $word, chr(9), $count, PHP_EOL;
}
?>

這段代碼的大體意思是：把輸入的每行文本中的單詞找出來，並以」 python

hello 1
world 1″

這樣的形式輸出出來。 linux

和以前寫的PHP基本沒有什麼不一樣，對吧，可能稍微讓你感到陌生有兩個地方： shell

PHP做爲可執行程序

第一行的 bash

[php]

#!/usr/local/php/bin/php

告訴linux，要用#!/usr/local/php/bin/php這個程序做爲如下代碼的解釋器。寫過linux shell的人應該很熟悉這種寫法了，每一個shell腳本的第一行都是這樣: #!/bin/bash, #!/usr/bin/python
有了這一行，保存好這個文件之後，就能夠像這樣直接把mapper.php看成cat, grep同樣的命令執行了：./mapper.php app