Hadoop（二）Hadoop的HelloWorld（單機模式下的安裝和使用）

時間 2019-11-15

標籤 hadoop helloworld 單機模式安裝使用欄目 Hadoop 简体版

原文原文鏈接

本文已同步至我的博客 liaosi's blog-Hadoop（二）Hadoop的HelloWorld（單機模式下的安裝和使用）
本文示例使用的VMWare虛擬機，Linux系統版本是CentOS 7_64位，Hadoop的版本是Hadoop 2.8.2，JDK版本是1.8，使用的帳號是建立的hadoop帳號（參考Hadoop（一）Hadoop的介紹和安裝前準備）。
安裝Hadoop以前要保證系統已經安裝了Java JDK，並配置好了Java環境變量。html

Hadoop集羣有三種啓動模式：java

單機模式：下載Hadoop在系統中，默認狀況下之，Hadoop被配置成以非分佈式模式運行的一個獨立Java進程。這種模式適合用於調試。
僞分佈式模式：Hadoop能夠在單節點上以所謂的僞分佈式模式運行。此時每個Hadoop守護進程，如 hdfs, yarn, MapReduce 等，都將做爲一個獨立的java程序運行。這種模式適合用戶開發。
徹底分佈式模式：即真正的分佈式，須要多臺獨立服務器組成集羣。

本文內容便是單機模式的示例。node

一.下載Hadoop並配置

1.從官網上 http://hadoop.apache.org/rele... 下載，並解壓到服務器的某個目錄下（此處我登陸的用戶是hadoop，解壓到${HOME}/app目錄下）。
git

2.在Hadoop的運行環境配置文件中配置Java的安裝目錄
編輯 ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh文件，將JAVA_HOME設置爲Java安裝根路徑。
正則表達式

3.配置Hadoop的環境變量
在/etc/profile文件中增長：apache

export HADOOP_HOME=/opt/hadoop-2.8.1
      export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

好比個人/etc/profile設置成以下圖：
服務器

4.執行hadoop version命令，驗證驗證環境變量是否配置成功，正常狀況下會看到相似以下的結果：hexo

[hadoop@server01 hadoop]$ hadoop version
   Hadoop 2.8.2
   Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb
   Compiled by jdu on 2017-10-19T20:39Z
   Compiled with protoc 2.5.0
   From source with checksum dce55e5afe30c210816b39b631a53b1d
   This command was run using /home/hadoop/app/hadoop-2.8.2/share/hadoop/common/hadoop-common-2.8.2.jar
   [hadoop@server01 hadoop]$

二.使用示例

Hadoop自帶了一個MapReduce程序$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar，它做爲一個例子提供了MapReduce的基本功能，而且能夠用於計算，包括 wordcount、terasort、join、grep 等。app

以經過執行以下命令查看該.jar文件支持哪些MapReduce功能。dom

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar

[hadoop@server01 mapreduce]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
[hadoop@server01 mapreduce]$

接下來咱們就演示如何使用這個自帶的MapReduce程序來計算文件中單詞的個數。

1.建立一個目錄用來存放咱們要處理的數據，能夠建立在任何地方（這裏我是在/home/hadoop/hadoopdata的目錄下建立一個input的目錄），並把想要計算分析的文件放到這個目錄下（這裏我把Hadoop的配置文件複製一份到input目錄下）。

cd /home/hadoop/hadoopdata
mkdir input
cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input
ls -l input

[hadoop@server01 hadoopdata]$ cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input
[hadoop@server01 hadoopdata]$ ll input
total 52
-rw-r--r--. 1 hadoop hadoop 4942 Apr 30 11:43 capacity-scheduler.xml
-rw-r--r--. 1 hadoop hadoop 1144 Apr 30 11:43 core-site.xml
-rw-r--r--. 1 hadoop hadoop 9683 Apr 30 11:43 hadoop-policy.xml
-rw-r--r--. 1 hadoop hadoop  854 Apr 30 11:43 hdfs-site.xml
-rw-r--r--. 1 hadoop hadoop  620 Apr 30 11:43 httpfs-site.xml
-rw-r--r--. 1 hadoop hadoop 3518 Apr 30 11:43 kms-acls.xml
-rw-r--r--. 1 hadoop hadoop 5546 Apr 30 11:43 kms-site.xml
-rw-r--r--. 1 hadoop hadoop  871 Apr 30 11:43 mapred-site.xml
-rw-r--r--. 1 hadoop hadoop 1067 Apr 30 11:43 yarn-site.xml
[hadoop@server01 hadoopdata]$

2.在這個例子中，咱們將 input 文件夾中的全部文件做爲輸入，篩選當中符合正則表達式 dfs[a-z.]+ 的單詞並統計出現的次數，在/home/hadoop/hadoopdata目錄下執行以下命令啓動Hadoop進程。

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dfs[a-z.]+'

執行成功的話，會打印一系列處理的信息，處理的結果會輸出到 output 文件夾中，經過命令 cat output/* 查看結果，符合正則的單詞 dfsadmin 出現了1次：

Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=123
    File Output Format Counters 
        Bytes Written=23
[hadoop@server01 hadoopdata]$ cat output/*
1   dfsadmin
[hadoop@server01 hadoopdata]$ ll output/
total 4
-rw-r--r--. 1 hadoop hadoop 11 Apr 30 12:51 part-r-00000
-rw-r--r--. 1 hadoop hadoop  0 Apr 30 12:51 _SUCCESS
[hadoop@server04 hadoopdata]$

注意，Hadoop 默認不會覆蓋結果文件，所以再次運行一個命令而且結果也是輸出到output目錄則會提示出錯，須要先將 output 目錄刪除。

3.刪除output目錄後咱們使用命令在計算一下單詞數：

[hadoop@server04 hadoopdata]$ rm -rf output/
[hadoop@server04 hadoopdata]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount input output

查看結果以下：

File Input Format Counters 
      Bytes Read=26548
  File Output Format Counters 
      Bytes Written=10400
[hadoop@server04 hadoopdata]$ cat output/*
"*" 18
"AS 8
"License"); 8
"alice,bob  18
"clumping"  1
&quot;kerberos&quot;.   1
&quot;simple&quot;  1
'HTTP/' 1
'none'  1
'random'    1

這樣咱們就利用Hadoop自帶的MapReduce程序成功地運行了它計算單詞個數的功能。