用PHP編寫Hadoop的MapReduce程序

時間 2019-11-06

原文原文鏈接

目錄(?)[+] php

Hadoop流

雖然Hadoop是用java寫的，可是Hadoop提供了Hadoop流，Hadoop流提供一個API, 容許用戶使用任何語言編寫map函數和reduce函數.
Hadoop流動關鍵是，它使用UNIX標準流做爲程序與Hadoop之間的接口。所以，任何程序只要能夠從標準輸入流中讀取數據，而且能夠把數據寫入標準輸出流中，那麼就能夠經過Hadoop流使用任何語言編寫MapReduce程序的map函數和reduce函數。
例如：bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /usr/local/hadoop/mapper.php -reducer /usr/local/hadoop/reducer.php -input test/* -output out4
Hadoop流引入的包：hadoop-streaming-0.20.203.0.jar,Hadoop根目錄下是沒有hadoop- streaming.jar的，由於streaming是一個contrib，因此要去contrib下面找，以hadoop-0.20.2爲例，它在這裏：
-input：指明輸入hdfs文件的路徑
-output：指明輸出hdfs文件的路徑
-mapper：指明map函數
-reducer：指明reduce函數 java

mapper函數

mapper.php文件，寫入以下代碼： python

[php] view plain copy print ?

#!/usr/local/php/bin/php
<?php
$word2count = array();
// input comes from STDIN (standard input)
// You can this code :$stdin = fopen(「php://stdin」, 「r」);
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}
// write the results to STDOUT (standard output)
// what we output here will be the input for the
// Reduce step, i.e. the input for reducer.py
foreach ($word2count as $word => $count) {
// tab-delimited
echo $word, chr(9), $count, PHP_EOL;
}
?>

這段代碼的大體意思是：把輸入的每行文本中的單詞找出來，並以」 linux

hello 1
world 1″

這樣的形式輸出出來。 web

和以前寫的PHP基本沒有什麼不一樣，對吧，可能稍微讓你感到陌生有兩個地方： shell

PHP做爲可執行程序

第一行的 bash

[php] view plain copy print ?

#!/usr/local/php/bin/php

告訴linux，要用#!/usr/local/php/bin/php這個程序做爲如下代碼的解釋器。寫過linux shell的人應該很熟悉這種寫法了，每一個shell腳本的第一行都是這樣: #!/bin/bash, #!/usr/bin/python
有了這一行，保存好這個文件之後，就能夠像這樣直接把mapper.php看成cat, grep同樣的命令執行了：./mapper.php 服務器

使用stdin接收輸入

PHP支持多種參數傳入的方法，你們最熟悉的應該是從$_GET, $_POST超全局變量裏面取經過Web傳遞的參數，次之是從$_SERVER['argv']裏取經過命令行傳入的參數，這裏，採用的是標準輸入stdin app

它的使用效果是：函數

在linux控制檯輸入 ./mapper.php

mapper.php運行，控制檯進入等候用戶鍵盤輸入狀態

用戶經過鍵盤輸入文本

用戶按下Ctrl + D終止輸入，mapper.php開始執行真正的業務邏輯，並將執行結果輸出

那麼stdout在哪呢？print自己已經就是stdout啦，跟咱們之前寫web程序和CLI腳本沒有任何不一樣。

reducer函數

建立reducer.php文件，寫入以下代碼：

[php] view plain copy print ?

#!/usr/local/php/bin/php
<?php
$word2count = array();
// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace
$line = trim($line);
// parse the input we got from mapper.php
list($word, $count) = explode(chr(9), $line);
// convert count (currently a string) to int
$count = intval($count);
// sum counts
if ($count > 0) $word2count[$word] += $count;
}
// sort the words lexigraphically
//
// this set is NOT required, we just do it so that our
// final output will look more like the official Hadoop
// word count examples
ksort($word2count);
// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
echo $word, chr(9), $count, PHP_EOL;
}
?>

這段代碼的大意是統計每一個單詞出現了多少次數，並以」

hello 2

world 1″

這樣的形式輸出

用Hadoop來運行

把文件放入 Hadoop 的 DFS 中：
bin/hadoop dfs -put test.log test
執行 php 程序處理這些文本( 以Streaming方式執行PHP mapreduce程序:):

bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /usr/local/hadoop/mapper.php -reducer /usr/local/hadoop/reducer.php -input test/* -output out

注意：

1) input和output目錄是在hdfs上的路徑

2) mapper和reducer是在本地機器的路徑，必定要寫絕對路徑，不要寫相對路徑，以避免到時候hadoop報錯說找不到mapreduce程序

3 ) mapper.php 和 reducer.php 必須複製到全部 DataNode 服務器上的相同路徑下, 全部的服務器都已經安裝php.且安裝路徑同樣.