hadoop能夠用C++開發,命令運行方式爲pipes,例子:hadoop pipes -conf job_config.xml -input input/myfile.txt -output output -program bin/wordcountjava
另外,還有一種streaming方式(?)node
運行java 程序,是打成jar包,使用hadoop jar命令,若是"hadoop jar 程序.jar mainclass arguments" linux
引自網絡具體講解:c++
HCE, short for Hadoop c++ extensionapache
聽說效率能夠比傳統Hadoop提升20%以上,計劃過幾天用倒排索引測試其效率。暫定使用3臺節點,每一個節點16核cpu。centos
一天半的時間學習hadoop和hce的部署,並在CentOS5.4上成功部署僞分佈式hce,提交本身編譯經過的mapreduce程序wordcount,獲得正確結果。網絡
配置過程以及遇到的問題:app
下載hce源碼後,編譯過程當中遇到以下錯誤:jvm
1.多餘的名稱限定:HCE:Compressor 解決方法: 在代碼中去掉限定HCE分佈式
代碼位置:src/c++/hce/impl/Compressor
2.找不到符號:htons 解決方法: 改變引用的頭文件。不要使用系統相關的頭文件,即 linux/目錄下的。
#include <linux/in.h>
#include <linux/in6.h>
註釋,增長 #include <netinet/in.h>
連接時可能遇到找不到 -lncurses的錯誤
須要安裝ncurses-devel。對於centos,可以使用yum安裝。
編譯成功後生成build目錄下的若干文件
而後是配置運行階段:
配置conf/ 下的core-site.xml mapred-site.xml hdfs-site.xml
主要是配置各個服務的IP地址和端口,hadoop的各個服務將在配置的地址上開啓。
運行階段很容易發生沒法正常啓動某daemon的現象,這裏的錯誤緣由可能性比較多,推薦使用一種雖然繁瑣但比較保險的作法:按順序分別啓動服務
首先要格式化hdfs,bin/hadoop namenode -format
而後按順序啓動daemons,hadoop主要包括四個daemons: namenode, datanode, jobtracker, tasktracker
按順序啓動:
bin/hadoop-daemon start namenode
bin/hadoop-daemon start datanode
bin/hadoop-daemon start jobtracker
bin/hadoop-daemon start tasktracker
能夠邊啓動邊去logs裏查看日誌,看是否啓動成功。
啓動成功後,使用bin/hadoop fs 系列命令創建好輸入/出目錄input/output, 將輸入文件上傳hdfs。
而後該編寫咱們的c++版的mapreduce程序wordcount了,代碼以下:
#include "hadoop/Hce.hh"
class WordCountMap: public HCE::Mapper {
public:
HCE::TaskContext::Counter* inputWords;
int64_t setup() {
inputWords = getContext()->getCounter("WordCount",
"Input Words");
return 0;
}
int64_t map(HCE::MapInput &input) {
int64_t size = 0;
const void* value = input.value(size);
if ((size > 0) && (NULL != value)) {
char* text = (char*)value;
const int n = (int)size;
for (int i = 0; i < n;) {
// Skip past leading whitespace
while ((i < n) && isspace(text[i])) i++;
// Find word end
int start = i;
while ((i < n) && !isspace(text[i])) i++;
if (start < i) {
emit(text + start, i-start, "1", 1);
getContext()->incrementCounter(inputWords, 1);
}
}
}
return 0;
}
int64_t cleanup() {
return 0;
}
};
const int INT64_MAXLEN = 25;
int64_t toInt64(const char *val) {
int64_t result;
char trash;
int num = sscanf(val, "%ld%c", &result, &trash);
return result;
}
class WordCountReduce: public HCE::Reducer {
public:
HCE::TaskContext::Counter* outputWords;
int64_t setup() {
outputWords = getContext()->getCounter("WordCount",
"Output Words");
return 0;
}
int64_t reduce(HCE::ReduceInput &input) {
int64_t keyLength;
const void* key = input.key(keyLength);
int64_t sum = 0;
while (input.nextValue()) {
int64_t valueLength;
const void* value = input.value(valueLength);
sum += toInt64((const char*)value);
}
char str[INT64_MAXLEN];
int str_len = snprintf(str, INT64_MAXLEN, "%ld", sum);
getContext()->incrementCounter(outputWords, 1);
emit(key, keyLength, str, str_len);
}
int64_t cleanup() {
return 0;
}
};
int main(int argc, char *argv[]) {
return HCE::runTask(
//TemplateFactory sequence is Mapper, Reducer,
// Partitioner, Combiner, Committer,
// RecordReader, RecordWriter
HCE::TemplateFactory<WordCountMap, WordCountReduce,
void, void, void, void, void>()
);
}
Makefile以下:
HADOOP_HOME = ../hadoop-0.20.3/build
JAVA_HOME = ../java6
INCLUDEDIR = ../hadoop-0.20.3/build/c++/Linux-amd64-64/include
LIBDIR = ../hadoop-0.20.3/build/c++/Linux-amd64-64/lib
CXX=g++
RM=rm -f
INCLUDEDIR = -I${HADOOP_HOME}/c++/Linux-amd64-64/include
LIBDIR = -L${HADOOP_HOME}/c++/Linux-amd64-64/lib \
-L${JAVA_HOME}/jre/lib/amd64/server
CXXFLAGS = ${INCLUDEDIR} -g -Wextra -Werror \
-Wno-unused-parameter -Wformat \
-Wconversion -Wdeprecated
LDLIBS = ${LIBDIR} -lhce -lhdfs -ljvm
all : wordcount-demo
wordcount-demo : wordcount-demo.o
$(CXX) -o $@ $^ $(LDLIBS) $(CXXFLAGS)
clean:
$(RM) *.o wordcount-demo
編譯成功後就能夠提交hce做業了:
bin/hadoop hce -input /input/test -output /output/out1 -program wordcount-demo -file wordcount-demo -numReduceTasks 1
這裏使用到的輸入文件 input/test內容以下:
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
提交做業後可能遇到錯誤:job not successful
查看日誌,有以下錯誤提示:
stderr logs:
..........
HCE_FATAL 08-10 12:13:51 [/home/shengeng/hce/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/MapRed/Hce.cc][176][runTask] error when parsing UgiInfo at /home/shengeng/hce/hadoop_hce_v1/hadoop-0.20.3/src/c++/hce/impl/MapRed/HadoopCommitter.cc:247 in virtual bool HCE::HadoopCommitter::needsTaskCommit() syslog logs: .......................2011-08-10 12:13:51,450 ERROR org.apache.hadoop.mapred.hce.BinaryProtocol: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.mapred.hce.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:112) 2011-08-10 12:13:51,450 ERROR org.apache.hadoop.mapred.hce.Application: Aborting because of java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.mapred.hce.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:112) 2011-08-10 12:13:51,450 INFO org.apache.hadoop.mapred.hce.BinaryProtocol: Sent abort command 2011-08-10 12:13:51,496 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: hce child exception at org.apache.hadoop.mapred.hce.Application.abort(Application.java:325) at org.apache.hadoop.mapred.hce.HceMapRunner.run(HceMapRunner.java:87) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:369) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.mapred.hce.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:112) 2011-08-10 12:13:51,500 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
根據日誌定位到代碼:
在HadoopCommitter.cc中,
bool HadoopCommitter::needsTaskCommit()
string ugiInfo = taskContext->getJobConf()->get("hadoop.job.ugi"); //這裏去找hadoop.job.ugi這個配置項可是默認的hce配置文件中沒有此項
words = HadoopUtils::splitString(ugiInfo, ",");
HADOOP_ASSERT(words.size() ==2, "error when parsing UgiInfo"); //因此在這裏拋出異常了
在hdfs-site.xml中添加配置項: <property> <name>hadoop.job.ugi</name> <value>hadoop,supergroup</value> </property>
又觀察代碼能夠推斷,此配置項在hce中並未生效,在needsTaskCommit()函數中僅僅是去讀取了此配置項,但未使用到其值。