在Ubuntu下安裝hadoop2.1.0以前,首先須要安裝以下程序:html
|- JDK 1.6 or laterjava
|- SSH(安全協議外殼)。node
要裝這兩個程序的緣由:linux
1. Hadoop是用Java開發的,Hadoop的編譯及MapReduce的運行都須要使用JDK。git
2. Hadoop須要經過SSH來啓動salve列表中各臺主機的守護進程,所以SSH也是必須安裝的,即便是安裝僞分佈式版本(由於Hadoop並無區分集羣式和僞分佈式)。對於僞分佈式,Hadoop會採用與集羣相同的處理方式,即依次序啓動文件conf/slaves中記載 的主機上的進程,只不過僞分佈式中salve爲localhost(即爲自身),因此對於僞分佈式Hadoop,SSH同樣是必須的。web
|- Maven 3.0 or latershell
命令安裝apache
1,apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-devubuntu
Because On Linux, you need the tools to create the native libraries.api
2,確認是否安裝成功
mvn -version
Maven的做用和原理:
當您的項目逐漸變得龐大和複雜時,最好使用一種構建工具來自動構建您的項目。例如,一個典型的java項目,每次構建時都要經歷編譯java源代碼,把class文件打成.jar包,生成javadocs文檔等步驟。這些步驟均可以用構建工具幫您自動完成。說到構建工具,你們確定都知道make,但make是依賴具體操做系統的。Java-centric選擇了Ant,一種能夠跨平臺的使用xml來替換Makefile糟糕語法的構建工具。
來自Apache軟件組織的構建工具Maven更可能成爲您的選擇,Maven不只提供了out-of-the-box的解決方案來統一處理構建相關的任務,還提供了信息統計的功能。使您的開發團隊能夠更好地跟蹤項目的進展狀況。
做爲構建工具,Maven和Ant同樣,利用構建配置文件進行編譯,打包,測試等操做。您能夠用Maven自帶的功能進行任何的操做,但前提是作好了相應的配置。固然,修改已有的模板來開始新的項目是個好方法。除非您在寫特有的task,否則都會有target重用的問題。
Maven進行了一些改進。您將項目配置內容寫成XML文件,而且可使用不少Maven自帶的功能。另外還能夠在Maven項目中調用任何Ant的task。
Maven自帶的"goals"有如下功能:
編譯源代碼
產生Javadoc文檔
運行unit測試
源代碼文法分析
產生違反團隊編碼規範的詳細報告
產生CVS最新提交報告
產生CVS更改最頻繁的文件報告和提交最頻繁的開發人員報告
產生能夠交叉引用的HTML格式的源代碼,等等。
|-ProtocolBuffer 2.5.0:Protocol Buffers are a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats.
最新版本的Hadoop代碼中已經默認了Protocol buffer(如下簡稱PB,http://code.google.com/p/protobuf/)做爲RPC的默認實現,原來的WritableRpcEngine已經被淘汰了。來自cloudera的Aaron T. Myers在郵件中這樣說的「since PB can provide support for evolving protocols in a compatible fashion.」
首先要明白PB是什麼,PB是Google開源的一種輕便高效的結構化數據存儲格式,能夠用於結構化數據序列化/反序列化,很適合作數據存儲或 RPC 數據交換格式。它可用於通信協議、數據存儲等領域的語言無關、平臺無關、可擴展的序列化結構數據格式。目前提供了 C++、Java、Python 三種語言的 API。簡單理解就是某個進程把一些結構化數據經過網絡通訊的形式傳遞給另一個進程(典型應用就是RPC);或者某個進程要把某些結構化數據持久化存儲到磁盤上(這個有點相似於在Mongodb中的BSON格式)。對於存儲的這個例子來講,使用PB和XML,JSON相比的缺點就是存儲在磁盤上的數據用戶是沒法理解的,除非用PB反序列化以後才行,這個有點相似於IDL。優勢就是序列化/反序列化速度快,網絡或者磁盤IO傳輸的數據少,這個在Data-Intensive Scalable Computing中是很是重要的。
Hadoop使用PB做爲RPC實現的另一個緣由是PB的語言、平臺無關性。在mailing list裏據說過社區的人有這樣的考慮:就是如今每一個MapReduce task都是在一個JVM虛擬機上運行的(即便是Streaming的模式,MR任務的數據流也是經過JVM與NN或者DN進行RPC交換的),JVM最嚴重的問題就是內存,例如OOM。我看社區裏有人討論說若是用PB這樣的RPC實現,那麼每一個MR task均可以直接與NN或者DN進行RPC交換了,這樣就能夠用C/C++來實現每個MR task了。百度作的HCE(https://issues.apache.org/jira/browse/MAPREDUCE-1270)和這種思路有點相似,可是因爲當時的Hadoop RPC通訊仍是經過WritableRpcEngine來實現的,因此MR task仍是沒有擺脫經過本地的JVM代理與NN或者DN通訊的束縛,由於Child JVM Process仍是存在的,仍是由它來設置運行時環境和RPC交互。
關於PB的原理和實現,請你們參考http://code.google.com/p/protobuf/或者http://www.ibm.com/developerworks/cn/linux/l-cn-gpb/?ca=drs-tp4608,本文再也不贅述。
sudo mkdir -p /usr/local/java
sudo mv /home/john/Downloads/jdk-6u26-linux-x64.bin /usr/local/java
cd /usr/local/java
sudo chmod 700 jdk-6u26-linux-x64.bin
sudo ./jdk-6u26-linux-x64.bin
sudo rm jdk-6u26-linux-x64.bin
sudo ln -s jdk1.6.0_26 /usr/local/java/latest
輸入命令:
sudo gedit /etc/environment
輸入密碼,打開environment文件。
在文件的最下面輸入以下內容:
JAVA_HOME="/usr/local/java/latest"
JRE_HOME="/usr/local/java/latest/jre"
PATH="/usr/local/java/latest/bin:\
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"
輸入命令驗證JDK是否安裝成功:
java -version
查看信息:
java version "1.6.0_14" Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) Server VM (build 14.0-b16, mixed mode)
一樣以Ubuntu爲例,假設用戶名爲u。
1)確認已經鏈接上互聯網,輸入命令
sudo apt-get install ssh
這裏先解釋一下sudo與apt這兩個命令,sudo這個命令容許普通用戶執行某些或所有須要root權限命令,它提供了詳盡的日誌,能夠記錄下每一個用戶使用這個命令作了些什麼操做;同時sudo也提供了靈活的管理方式,能夠限制用戶使用命令。sudo的配置文件爲/etc/sudoers。
apt的全稱爲the Advanced Packaging Tool,是Debian計劃的一部分,是Ubuntu的軟件包管理軟件,經過apt安裝軟件無須考慮軟件的依賴關係,能夠直接安裝所須要的軟件,apt會自動下載有依賴關係的包,並按順序安裝,在Ubuntu中安裝有apt的一個圖形化界面程序synaptic(中文譯名爲「新立得」),你們若是有興趣也可使用這個程序來安裝所須要的軟件。(若是你們想了解更多,能夠查看一下關於Debian計劃的資料。)
2)配置爲能夠無密碼登陸本機。
首先查看在u用戶下是否存在.ssh文件夾(注意ssh前面有「.」,這是一個隱藏文件夾),輸入命令:
ls -a /home/u
通常來講,安裝SSH時會自動在當前用戶下建立這個隱藏文件夾,若是沒有,能夠手動建立一個。
接下來,輸入命令:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
解釋一下,ssh-keygen表明生成密鑰;-t(注意區分大小寫)表示指定生成的密鑰類型;dsa是dsa密鑰認證的意思,即密鑰類型;-P用於提供密語;-f指定生成的密鑰文件。(關於密鑰密語的相關知識這裏就不詳細介紹了,裏面會涉及SSH的一些知識,若是讀者有興趣,能夠自行查閱資料。)
在Ubuntu中,~表明當前用戶文件夾,這裏即/home/u。
這個命令會在.ssh文件夾下建立兩個文件id_dsa及id_dsa.pub,這是SSH的一對私鑰和公鑰,相似於鑰匙及鎖,把id_dsa.pub(公鑰)追加到受權的key裏面去。
輸入命令:
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
這段話的意思是把公鑰加到用於認證的公鑰文件中,這裏的authorized_keys是用於認證的公鑰文件。
至此無密碼登陸本機已設置完畢。
3)驗證SSH是否已安裝成功,以及是否能夠無密碼登陸本機。
輸入命令:
ssh -version
顯示結果:
OpenSSH_6.2p2 Ubuntu-6ubuntu0.1, OpenSSL 1.0.1e 11 Feb 2013
Bad escape character 'rsion'.
顯示SSH已經安裝成功了。
輸入命令:
ssh localhost
會有以下顯示:
The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is 8b:c3:51:a5:2a:31:b7:74:06:9d:62:04:4f:84:f8:77. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux master 2.6.31-14-generic #48-Ubuntu SMP Fri Oct 16 14:04:26 UTC 2009 i686 To access official Ubuntu documentation, please visit: http://help.ubuntu.com/ Last login: Mon Oct 18 17:12:40 2010 from master admin@Hadoop:~$
這說明已經安裝成功,第一次登陸時會詢問你是否繼續連接,輸入yes便可進入。
實際上,在Hadoop的安裝過程當中,是否無密碼登陸是可有可無的,可是若是不配置無密碼登陸,每次啓動Hadoop,都須要輸入密碼以登陸到每臺機器的DataNode上,考慮到通常的Hadoop集羣動輒數百臺或上千臺機器,所以通常來講都會配置SSH的無密碼登陸。
安裝protobuf
下載地址:http://code.google.com/p/protobuf/downloads/detail?name=protobuf-2.4.1.tar.gz&can=2&q= 安裝過程:
tar zxvf protobuf-2.4.1.tar.gz
cd protobuf-2.4.1
./configure
make
make check
make install
查看是否安裝成功:protoc --version
hadoop源碼的編譯
源碼下載:
從subversion庫check out:
[zhouhh@Hadoop48 hsrc]$ svn co http://svn.apache.org/repos/asf/hadoop/common/trunk
[zhouhh@Hadoop48 hsrc]$ cd trunk/
[zhouhh@Hadoop48 trunk]$ ls
BUILDING.txt hadoop-assemblies hadoop-common-project hadoop-hdfs-project hadoop-minicluster hadoop-project-dist hadoop-yarn-project
dev-support hadoop-client hadoop-dist hadoop-mapreduce-project hadoop-project hadoop-tools pom.xml
hadoop (Main Hadoop project)
– hadoop-project (Parent POM for all Hadoop Maven modules. )
(All plugins & dependencies versions are defined here.)
– hadoop-project-dist (Parent POM for modules that generate distributions.)
– hadoop-annotations (Generates the Hadoop doclet used to generated the Javadocs)
– hadoop-assemblies (Maven assemblies used by the different modules)
– hadoop-common-project (Hadoop Common)
– hadoop-hdfs-project (Hadoop HDFS)
– hadoop-mapreduce-project (Hadoop MapReduce)
– hadoop-tools (Hadoop tools like Streaming, Distcp, etc.)
– hadoop-dist (Hadoop distribution assembler)
源碼編譯:
[zhouhh@Hadoop48 trunk]$ mvn install -DskipTests -Pdist
…
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Main ………………………….. SUCCESS [0.605s]
[INFO] Apache Hadoop Project POM ……………………. SUCCESS [0.558s]
[INFO] Apache Hadoop Annotations ……………………. SUCCESS [0.288s]
[INFO] Apache Hadoop Project Dist POM ……………….. SUCCESS [0.094s]
[INFO] Apache Hadoop Assemblies …………………….. SUCCESS [0.088s]
[INFO] Apache Hadoop Auth ………………………….. SUCCESS [0.152s]
[INFO] Apache Hadoop Auth Examples ………………….. SUCCESS [0.093s]
[INFO] Apache Hadoop Common ………………………… SUCCESS [5.188s]
[INFO] Apache Hadoop Common Project …………………. SUCCESS [0.049s]
[INFO] Apache Hadoop HDFS ………………………….. SUCCESS [12.065s]
[INFO] Apache Hadoop HttpFS ………………………… SUCCESS [0.194s]
[INFO] Apache Hadoop HDFS BookKeeper Journal …………. SUCCESS [0.616s]
[INFO] Apache Hadoop HDFS Project …………………… SUCCESS [0.029s]
[INFO] hadoop-yarn ………………………………… SUCCESS [0.157s]
[INFO] hadoop-yarn-api …………………………….. SUCCESS [2.951s]
[INFO] hadoop-yarn-common ………………………….. SUCCESS [0.752s]
[INFO] hadoop-yarn-server ………………………….. SUCCESS [0.124s]
[INFO] hadoop-yarn-server-common ……………………. SUCCESS [0.736s]
[INFO] hadoop-yarn-server-nodemanager ……………….. SUCCESS [0.592s]
[INFO] hadoop-yarn-server-web-proxy …………………. SUCCESS [0.123s]
[INFO] hadoop-yarn-server-resourcemanager ……………. SUCCESS [0.200s]
[INFO] hadoop-yarn-server-tests …………………….. SUCCESS [0.149s]
[INFO] hadoop-yarn-client ………………………….. SUCCESS [0.119s]
[INFO] hadoop-yarn-applications …………………….. SUCCESS [0.090s]
[INFO] hadoop-yarn-applications-distributedshell ……… SUCCESS [0.167s]
[INFO] hadoop-mapreduce-client ……………………… SUCCESS [0.049s]
[INFO] hadoop-mapreduce-client-core …………………. SUCCESS [1.103s]
[INFO] hadoop-yarn-applications-unmanaged-am-launcher …. SUCCESS [0.142s]
[INFO] hadoop-yarn-site ……………………………. SUCCESS [0.082s]
[INFO] hadoop-yarn-project …………………………. SUCCESS [0.075s]
[INFO] hadoop-mapreduce-client-common ……………….. SUCCESS [1.202s]
[INFO] hadoop-mapreduce-client-shuffle ………………. SUCCESS [0.066s]
[INFO] hadoop-mapreduce-client-app ………………….. SUCCESS [0.109s]
[INFO] hadoop-mapreduce-client-hs …………………… SUCCESS [0.123s]
[INFO] hadoop-mapreduce-client-jobclient …………….. SUCCESS [0.114s]
[INFO] hadoop-mapreduce-client-hs-plugins ……………. SUCCESS [0.084s]
[INFO] Apache Hadoop MapReduce Examples ……………… SUCCESS [0.130s]
[INFO] hadoop-mapreduce ……………………………. SUCCESS [0.060s]
[INFO] Apache Hadoop MapReduce Streaming …………….. SUCCESS [0.071s]
[INFO] Apache Hadoop Distributed Copy ……………….. SUCCESS [0.069s]
[INFO] Apache Hadoop Archives ………………………. SUCCESS [0.061s]
[INFO] Apache Hadoop Rumen …………………………. SUCCESS [0.135s]
[INFO] Apache Hadoop Gridmix ……………………….. SUCCESS [0.082s]
[INFO] Apache Hadoop Data Join ……………………… SUCCESS [0.070s]
[INFO] Apache Hadoop Extras ………………………… SUCCESS [0.192s]
[INFO] Apache Hadoop Pipes …………………………. SUCCESS [0.019s]
[INFO] Apache Hadoop Tools Dist …………………….. SUCCESS [0.057s]
[INFO] Apache Hadoop Tools …………………………. SUCCESS [0.018s]
[INFO] Apache Hadoop Distribution …………………… SUCCESS [0.047s]
[INFO] Apache Hadoop Client ………………………… SUCCESS [0.047s]
[INFO] Apache Hadoop Mini-Cluster …………………… SUCCESS [0.053s]
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 32.093s
[INFO] Finished at: Wed Dec 26 11:00:10 CST 2012
[INFO] Final Memory: 60M/76
有兩個出錯的地方,出錯的patch:
diff --git hadoop-project/pom.xml hadoop-project/pom.xml index 3938532..31ee469 100644 --- hadoop-project/pom.xml +++ hadoop-project/pom.xml @@ -600,7 +600,7 @@ <dependency> <groupId>com.google.protobuf</groupId> <artifactId>protobuf-java</artifactId> - <version>2.4.0a</version> + <version>2.5.0</version> </dependency> <dependency> <groupId>commons-daemon</groupId>
Index: hadoop-common-project/hadoop-auth/pom.xml =================================================================== --- hadoop-common-project/hadoop-auth/pom.xml (revision 1543124) +++ hadoop-common-project/hadoop-auth/pom.xml (working copy) @@ -54,6 +54,11 @@ </dependency> <dependency> <groupId>org.mortbay.jetty</groupId> + <artifactId>jetty-util</artifactId> + <scope>test</scope> + </dependency> + <dependency> + <groupId>org.mortbay.jetty</groupId> <artifactId>jetty</artifactId> <scope>test</scope> </dependency>
hadoop的配置
環境變量的配置:
$ export HADOOP_HOME=$HOME/yarn/hadoop-2.0.1-alpha $ export HADOOP_MAPRED_HOME=$HOME/yarn/hadoop-2.0.1-alpha $ export HADOOP_COMMON_HOME=$HOME/yarn/hadoop-2.0.1-alpha $ export HADOOP_HDFS_HOME=$HOME/yarn/hadoop-2.0.1-alpha $ export YARN_HOME=$HOME/yarn/hadoop-2.0.1-alpha $ export HADOOP_CONF_DIR=$HOME/yarn/hadoop-2.0.1-alpha/etc/hadoop
This is very important as if you miss any one variable or set the value incorrectly, it will be very difficult to detect the error and the job will fail.
Also, add these to your ~/.bashrc or other shell start-up script so that you don’t need to set them every time.
配置文件的配置:
1.core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>io.native.lib.available</name>
<value>true</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/gao/yarn/yarn_data/hdfs/namenode/</value>
</property>
</configuration>
2.hafs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>0.0.0.0:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>0.0.0.0:50090</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50010</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:50020</value>
</property>
</configuration>
3.format node
This step is needed only for the first time. Doing it every time will result in loss of content on HDFS.
$ bin/hadoop namenode -format
4. Start HDFS processes
Name node:
$ sbin/hadoop-daemon.sh start namenode starting namenode, logging to /home/hduser/yarn/hadoop-2.0.1-alpha/logs/hadoop-hduser-namenode-pc3-laptop.out $ jps 18509 Jps 17107 NameNode
Data node:
$ sbin/hadoop-daemon.sh start datanode starting datanode, logging to /home/hduser/yarn/hadoop-2.0.1-alpha/logs/hadoop-hduser-datanode-pc3-laptop.out $ jps 18509 Jps 17107 NameNode 17170 DataNode
5. Start Hadoop Map-Reduce Processes
Resource Manager:
$ sbin/yarn-daemon.sh start resourcemanager starting resourcemanager, logging to /home/hduser/yarn/hadoop-2.0.1-alpha/logs/yarn-hduser-resourcemanager-pc3-laptop.out $ jps 18509 Jps 17107 NameNode 17170 DataNode 17252 ResourceManager
Node Manager:
$ sbin/yarn-daemon.sh start nodemanager starting nodemanager, logging to /home/hduser/yarn/hadoop-2.0.1-alpha/logs/yarn-hduser-nodemanager-pc3-laptop.out $jps 18509 Jps 17107 NameNode 17170 DataNode 17252 ResourceManager 17309 NodeManager
Job History Server:
$ sbin/mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /home/hduser/yarn/hadoop-2.0.1-alpha/logs/yarn-hduser-historyserver-pc3-laptop.out $jps 18509 Jps 17107 NameNode 17170 DataNode 17252 ResourceManager 17309 NodeManager 17626 JobHistoryServer
6. Running the famous wordcount example to verify installation
$ mkdir in $ cat > in/file This is one line This is another one
Add this directory to HDFS:
$ bin/hadoop dfs -copyFromLocal in /in
Run wordcount example provided:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.*-alpha.jar wordcount /in /out
Check the output:
$ bin/hadoop dfs -cat /out/* This 2 another 1 is 2 line 1 one 2
7. Web interface
Browse HDFS and check health using http://localhost:50070 in the browser:
You can check the status of the applications running using the following URL:
7. Stop the processes
$ sbin/hadoop-daemon.sh stop namenode $ sbin/hadoop-daemon.sh stop datanode $ sbin/yarn-daemon.sh stop resourcemanager $ sbin/yarn-daemon.sh stop nodemanager $ sbin/mr-jobhistory-daemon.sh stop historyserver