Hadoop在處理海量數據分析方面具備獨天優點。今天花時間在本身的Linux上搭建了僞分佈模式,期間經歷不少曲折,如今將經驗總結以下。html
首先,瞭解Hadoop的三種安裝模式:java
1. 單機模式. 單機模式是Hadoop的默認模。當配置文件爲空時,Hadoop徹底運行在本地。由於不須要與其餘節點交互,單機模式就不使用HDFS,也不加載任何Hadoop的守護進程。該模式主要用於開發調試MapReduce程序的應用邏輯。node
2. 僞分佈模式. Hadoop守護進程運行在本地機器上,模擬一個小規模的的集羣。該模式在單機模式之上增長了代碼調試功能,容許你檢查內存使用狀況,HDFS輸入輸出,以及其餘的守護進程交互。linux
3. 全分佈模式. Hadoop守護進程運行在一個集羣上。
算法
參考資料:apache
1. Ubuntu11.10下安裝Hadoop1.0.0(單機僞分佈式)ubuntu
2. 在Ubuntu上安裝Hadoop安全
3. Ubuntu 12.04搭建hadoop單機版環境服務器
4. Ubuntu下安裝及配置單點hadoopsession
5. Ubuntu上搭建Hadoop環境(單機模式+僞分佈模式)
6. Hadoop的快速入門之 Ubuntu上搭建Hadoop環境(單機模式+僞分佈模式)
本人極力推薦5和6,這兩種教程從簡到難,步驟詳細,且有運行算例。下面我就將本身的安裝過程大體回顧一下,爲省時間,不少文字粘貼子參考資料5和6,再次感謝兩位做者分享本身的安裝經歷。另外,下面的三篇文章能夠從總體上把握Hadoop的結構,使你可以理解爲何要這麼這麼作。
個人安裝的是ubuntu12.o4, 用戶名derek, 機器名稱是derekUbn, Hadoop的版本Hadoop-1.1.2.tar.gz,閒話少說,步驟和每一步的圖示以下:
1、在Ubuntu下建立hadoop用戶組和用戶
1.添加hadoop用戶到系統用戶
[plain] view plaincopy
derek@derekUbun:~$ sudo addgroup hadoop
derek@derekUbun:~$ sudo adduser --ingroup hadoop hadoop
2. 如今只是添加了一個用戶hadoop,它並不具有管理員權限,咱們給hadoop用戶添加權限,打開/etc/sudoers文件
[plain] view plaincopy
derek@derekUbun:~$ sudo gedit /etc/sudoers
在root ALL=(ALL:ALL) ALL下添加hadoop ALL=(ALL:ALL) ALL
2、配置SSH
配置SSH是爲了實現各機器之間執行指令無需輸入登陸密碼。務必要避免輸入密碼,不然,主節點每次試圖訪問其餘節點時,都須要手動輸入這個密碼。
SSH無密碼原理:master(namenode/jobtrack)做爲客戶端,要實現無密碼公鑰認證,鏈接到服務器slave(datanode/tasktracker)上時,須要在master上生成一個公鑰對,包括一個公鑰和一個私鑰,然後將公鑰複製到全部的slave上。當master經過SSH鏈接slave時,slave就會生成一個隨機數並用master的公鑰對隨機數進行加密,併發送給master。Master收到密鑰加密數以後再用私鑰解密,並將解密數回傳給slave,slave確認解密數無誤後就容許master進行鏈接了。這就是一個公鑰認證的過程,期間不須要用戶手工輸入密碼。重要過程是將客戶端master複製到slave上。
一、安裝ssh
1) 因爲Hadoop用ssh通訊,先安裝ssh. 注意,我先從derek用戶轉到了hadoop.
[plain] view plaincopy
derek@derekUbun:~$ su - hadoop
密碼:
hadoop@derekUbun:~$ sudo apt-get install openssh-server
[sudo] password for hadoop:
正在讀取軟件包列表... 完成
正在分析軟件包的依賴關係樹
正在讀取狀態信息... 完成
openssh-server 已是最新的版本了。
下列軟件包是自動安裝的而且如今不須要了:
kde-l10n-de language-pack-kde-de language-pack-kde-en ssh-krb5
language-pack-de-base language-pack-kde-zh-hans language-pack-kde-en-base
kde-l10n-engb language-pack-kde-de-base kde-l10n-zhcn firefox-locale-de
language-pack-de language-pack-kde-zh-hans-base
使用'apt-get autoremove'來卸載它們
升級了 0 個軟件包,新安裝了 0 個軟件包,要卸載 0 個軟件包,有 505 個軟件包未被升級。
由於個人機器已安裝最新版的ssh,所以這一步實際上什麼也沒作。
2) 假設ssh安裝完成,先啓動服務。啓動後,能夠經過命令查看服務是否正確啓動:
[plain] view plaincopy
hadoop@derekUbun:~$ sudo /etc/init.d/ssh start
Rather than invoking init scripts through /etc/init.d, use the service(8)
utility, e.g. service ssh start
Since the script you are attempting to invoke has been converted to an
Upstart job, you may also use the start(8) utility, e.g. start ssh
hadoop@derekUbun:~$ ps -e |grep ssh
759 ? 00:00:00 sshd
1691 ? 00:00:00 ssh-agent
12447 ? 00:00:00 ssh
12448 ? 00:00:00 sshd
12587 ? 00:00:00 sshd
hadoop@derekUbun:~$
3) 做爲一個安全通訊協議(ssh生成密鑰有rsa和dsa兩種生成方式,默認狀況下采用rsa方式),使用時須要密碼,所以咱們要設置成免密碼登陸,生成私鑰和公鑰:
[plain] view plaincopy
hadoop@derekUbun:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
/home/hadoop/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
c7:36:c7:77:91:a2:32:28:35:a6:9f:36:dd:bd:dc:4f hadoop@derekUbun
The key's randomart image is:
+--[ RSA 2048]----+
| |
| .|
| + . o |
| + o. .. . .|
| o .So=.o . .|
| o oo+o.. . |
| = . . . E|
| . . . o. |
| o .o|
+-----------------+
hadoop@derekUbun:~$
(注:回車後會在~/.ssh/下生成兩個文件:id_rsa和id_rsa.pub這兩個文件是成對出現的前者爲私鑰,後者爲公鑰)
進入~/.ssh/目錄下,將公鑰id_rsa.pub追加到authorized_keys受權文件中,開始是沒有authorized_keys文件的(authorized_keys 用於保存全部容許以當前用戶身份登陸到ssh客戶端用戶的公鑰內容):
[plain] view plaincopy
hadoop@derekUbun:~$ cat ~/.ssh/id_rsa.pub>> ~/.ssh/authorized_keys
如今能夠登入ssh確認之後登陸時不用輸入密碼:
hadoop@derekUbun:~$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-27-generic-pae i686)
* Documentation: https://help.ubuntu.com/
512 packages can be updated.
151 updates are security updates.
Last login: Mon Mar 11 15:56:15 2013 from localhost
hadoop@derekUbun:~$
( 注:當ssh遠程登陸到其它機器後,如今你控制的是遠程的機器,須要執行退出命令才能從新控制本地主機。)
登出:~$ exit
這樣之後登陸就不用輸入密碼了。
[plain] view plaincopy
hadoop@derekUbun:~$ exit
Connection to localhost closed.
hadoop@derekUbun:~$
3、安裝Java
使用derek用戶,安裝java. 由於個人電腦上已安裝java,其安裝目錄是/usr/java/jdk1.7.0_17,能夠顯示個人這個安裝版本。
[plain] view plaincopy
hadoop@derekUbun:~$ su - derek
密碼:
derek@derekUbun:~$ java -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) Server VM (build 23.7-b01, mixed mode)
4、安裝hadoop-1.1.2
到官網下載hadoop源文件,我下載的是最新版本 jdk-7u17-linux-i586.tar.gz,將其解壓並放到但願的目錄中。我把 jdk-7u17-linux-i586.tar.gz放到/usr/local/hadoop,並將解壓後的文件夾重命名爲hadoop。
[plain] view plaincopy
hadoop@derekUbun:/usr/local$ sudo tar xzf hadoop-1.1.2.tar.gz (注意,我已將hadoop-1.1.2.tar.gz拷貝到usr/local/hadoop,而後轉到hadoop用戶上)
hadoop@derekUbun:/usr/local$ sudo mv hadoop-1.1.2 /usr/local/hadoop
要確保全部的操做都是在用戶hadoop下完成的,因此將該hadoop文件夾的屬主用戶設爲hadoop
[plain] view plaincopy
hadoop@derekUbun:/usr/local$ sudo chown -R hadoop:hadoop hadoop
5、配置hadoop-env.sh(Java 安裝路徑)
進入用hadoop用戶登陸,進入/usr/localhadoop目錄,打開conf目錄的hadoop-env.sh,添加如下信息:(找到#export JAVA_HOME=...,去掉#,而後加上本機jdk的路徑)
export JAVA_HOME=/usr/java/jdk1.7.0_17 (視你機器的java安裝路徑而定,個人java安裝目錄是/usr/java/jdk1.7.0_17)
export HADOOP_INSTALL=/usr/local/hadoop( 注意,我這裏用的HADOOP_INSTALL,而不是HADOOP_HOME,由於在新版中後者已經不用了。若用,會有警告)
export PATH=$PATH:/usr/local/hadoop/bin
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ sudo vi conf/hadoop-env.sh
[plain] view plaincopy
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/java/jdk1.7.0_17
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:/usr/local/hadoop/bin
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=2000
# Extra Java runtime options. Empty by default.
# export HADOOP_OPTS=-server
"conf/hadoop-env.sh" 57L, 2356C
而且,讓環境變量配置生效source
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
至此,hadoop的單機模式已經安裝成功。能夠顯示Hadoop版本以下
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ hadoop version
Hadoop 1.1.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782
Compiled by hortonfo on Thu Jan 31 02:03:24 UTC 2013
From source with checksum c720ddcf4b926991de7467d253a79b8b
hadoop@derekUbun:/usr/local/hadoop$
如今運行一下hadoop自帶的例子WordCount來感覺如下MapReduce過程:
在hadoop目錄下新建input文件夾
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ mkdir input
將conf中的全部文件拷貝到input文件夾中
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ cp conf/* input
運行WordCount程序,並將結果保存到output中
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.1.2.jar wordcount input output
運行
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ cat output/*
會看到conf全部文件的單詞和頻數都被統計出來。
6、 僞分佈模式的一些配置
這裏須要設定3個文件:core-site.xml hdfs-site.xml mapred-site.xml,都在/usr/local/hadoop/conf目錄下
core-site.xml: Hadoop Core的配置項,例如HDFS和MapReduce經常使用的I/O設置等。
hdfs-site.xml: Hadoop 守護進程的配置項,包括namenode,輔助namenode和datanode等。
mapred-site.xml: MapReduce 守護進程的配置項,包括jobtracker和tasktracker。
1.編輯三個文件:
1). core-site.xml:
[plain] view plaincopy
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>
2).hdfs-site.xml:
[plain] view plaincopy
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/datalog1,/usr/local/hadoop/datalog2</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/data1,/usr/local/hadoop/data2</value>
</property>
</configuration>
3). mapred-site.xml:
[plain] view plaincopy
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
2. 啓動Hadoop到相關服務,格式化namenode, secondarynamenode, tasktracker:
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ source /usr/local/hadoop/conf/hadoop-env.sh
hadoop@derekUbun:/usr/local/hadoop$ hadoop namenode -format
看到下面的信息就說明hdfs文件系統格式化成功了
[plain] view plaincopy
13/03/11 23:08:01 INFO common.Storage: Storage directory /usr/local/hadoop/datalog2 has been successfully formatted.
13/03/11 23:08:01 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at derekUbun/127.0.1.1
************************************************************/
3. 啓動Hadoop
接着執行start-all.sh來啓動全部服務,包括namenode,datanode,start-all.sh腳本用來裝載守護進程。用Java的jps命令列出全部守護進程來驗證安裝成功,出現以下列表,代表成功.
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ cd bin
hadoop@derekUbun:/usr/local/hadoop/bin$ start-all.sh
starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-namenode-derekUbun.out
localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-derekUbun.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-derekUbun.out
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-derekUbun.out
localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-derekUbun.out
hadoop@derekUbun:/usr/local/hadoop/bin$
用Java的jps命令列出全部守護進程來驗證安裝成功
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ jps
出現以下列表,代表成功
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ jps
8431 JobTracker
8684 TaskTracker
7821 NameNode
8915 Jps
8341 SecondaryNameNode
hadoop@derekUbun:/usr/local/hadoop$
4. 檢查運行狀態
全部的設置已完成,Hadoop也啓動了,如今能夠經過下面的操做來查看服務是否正常,在Hadoop中用於監控集羣健康狀態的Web界面:
http://localhost:50030/ - Hadoop 管理介面
http://localhost:50060/ - Hadoop Task Tracker 狀態
http://localhost:50070/ - Hadoop DFS 狀態
至此,hadoop的僞分佈模式已經安裝成功,因而,再次在僞分佈模式下運行一下hadoop自帶的例子WordCount來感覺如下MapReduce過程:
這時注意程序是在文件系統dfs運行的,建立的文件也都基於文件系統:
首先在dfs中建立input目錄
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -mkdir input
將conf中的文件拷貝到dfs中的input
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -copyFromLocal conf/* input
(注:可使用查看和刪除hadoop dfs中的文件)
在僞分佈式模式下運行WordCount
[plain] view plaincopy
hadoop jar hadoop-examples-1.1.2.jar wordcount input output
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ hadoop jar hadoop-examples-1.1.2.jar wordcount input output
13/03/12 09:26:05 INFO input.FileInputFormat: Total input paths to process : 16
13/03/12 09:26:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/12 09:26:05 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/12 09:26:05 INFO mapred.JobClient: Running job: job_201303120920_0001
13/03/12 09:26:06 INFO mapred.JobClient: map 0% reduce 0%
13/03/12 09:26:10 INFO mapred.JobClient: map 12% reduce 0%
13/03/12 09:26:13 INFO mapred.JobClient: map 25% reduce 0%
13/03/12 09:26:15 INFO mapred.JobClient: map 37% reduce 0%
13/03/12 09:26:17 INFO mapred.JobClient: map 50% reduce 0%
13/03/12 09:26:18 INFO mapred.JobClient: map 62% reduce 0%
13/03/12 09:26:19 INFO mapred.JobClient: map 62% reduce 16%
13/03/12 09:26:20 INFO mapred.JobClient: map 75% reduce 16%
13/03/12 09:26:22 INFO mapred.JobClient: map 87% reduce 16%
13/03/12 09:26:24 INFO mapred.JobClient: map 100% reduce 16%
13/03/12 09:26:28 INFO mapred.JobClient: map 100% reduce 29%
13/03/12 09:26:30 INFO mapred.JobClient: map 100% reduce 100%
13/03/12 09:26:30 INFO mapred.JobClient: Job complete: job_201303120920_0001
13/03/12 09:26:30 INFO mapred.JobClient: Counters: 29
13/03/12 09:26:30 INFO mapred.JobClient: Job Counters
13/03/12 09:26:30 INFO mapred.JobClient: Launched reduce tasks=1
13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=29912
13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/03/12 09:26:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/03/12 09:26:30 INFO mapred.JobClient: Launched map tasks=16
13/03/12 09:26:30 INFO mapred.JobClient: Data-local map tasks=16
13/03/12 09:26:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=19608
13/03/12 09:26:30 INFO mapred.JobClient: File Output Format Counters
13/03/12 09:26:30 INFO mapred.JobClient: Bytes Written=15836
13/03/12 09:26:30 INFO mapred.JobClient: FileSystemCounters
13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_READ=23161
13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_READ=29346
13/03/12 09:26:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=944157
13/03/12 09:26:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=15836
13/03/12 09:26:30 INFO mapred.JobClient: File Input Format Counters
13/03/12 09:26:30 INFO mapred.JobClient: Bytes Read=27400
13/03/12 09:26:30 INFO mapred.JobClient: Map-Reduce Framework
13/03/12 09:26:30 INFO mapred.JobClient: Map output materialized bytes=23251
13/03/12 09:26:30 INFO mapred.JobClient: Map input records=778
13/03/12 09:26:30 INFO mapred.JobClient: Reduce shuffle bytes=23251
13/03/12 09:26:30 INFO mapred.JobClient: Spilled Records=2220
13/03/12 09:26:30 INFO mapred.JobClient: Map output bytes=36314
13/03/12 09:26:30 INFO mapred.JobClient: Total committed heap usage (bytes)=2736914432
13/03/12 09:26:30 INFO mapred.JobClient: CPU time spent (ms)=6550
13/03/12 09:26:30 INFO mapred.JobClient: Combine input records=2615
13/03/12 09:26:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=1946
13/03/12 09:26:30 INFO mapred.JobClient: Reduce input records=1110
13/03/12 09:26:30 INFO mapred.JobClient: Reduce input groups=804
13/03/12 09:26:30 INFO mapred.JobClient: Combine output records=1110
13/03/12 09:26:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=2738036736
13/03/12 09:26:30 INFO mapred.JobClient: Reduce output records=804
13/03/12 09:26:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=6773346304
13/03/12 09:26:30 INFO mapred.JobClient: Map output records=2615
hadoop@derekUbun:/usr/local/hadoop$
顯示輸出結果
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ hadoop dfs -cat output/*
當Hadoop結束時,能夠經過stop-all.sh腳原本關閉Hadoop的守護進程
[plain] view plaincopy
hadoop@derekUbun:/usr/local/hadoop$ bin/stop-all.sh
如今,開始Hadoop之旅,實現一些算法吧!
註記:
1. 在僞分佈模式,能夠經過hadoop dfs -ls 查看input裏的內容
2. 在僞分佈模式,能夠經過hadoop dfs -rmr 查看input裏的內容
3. 在僞分佈模式,input和output都在hadoop dfs文件裏