Hadoop(二)Hadoop的HelloWorld(單機模式下的安裝和使用)

本文已同步至我的博客 liaosi's blog-Hadoop(二)Hadoop的HelloWorld(單機模式下的安裝和使用)

本文示例使用的VMWare虛擬機,Linux系統版本是CentOS 7_64位,Hadoop的版本是Hadoop 2.8.2,JDK版本是1.8,使用的帳號是建立的hadoop帳號(參考Hadoop(一)Hadoop的介紹和安裝前準備)。
安裝Hadoop以前要保證系統已經安裝了Java JDK,並配置好了Java環境變量。html

Hadoop集羣有三種啓動模式java

  • 單機模式:下載Hadoop在系統中,默認狀況下之,Hadoop被配置成以非分佈式模式運行的一個獨立Java進程。這種模式適合用於調試。
  • 僞分佈式模式:Hadoop能夠在單節點上以所謂的僞分佈式模式運行。此時每個Hadoop守護進程,如 hdfs, yarn, MapReduce 等,都將做爲一個獨立的java程序運行。這種模式適合用戶開發。
  • 徹底分佈式模式:即真正的分佈式,須要多臺獨立服務器組成集羣。

本文內容便是單機模式的示例。node

一.下載Hadoop並配置

1.從官網上 http://hadoop.apache.org/rele... 下載,並解壓到服務器的某個目錄下(此處我登陸的用戶是hadoop,解壓到${HOME}/app目錄下)。
下載Haoop並解壓git

2.在Hadoop的運行環境配置文件中配置Java的安裝目錄
編輯 ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh文件,將JAVA_HOME設置爲Java安裝根路徑。
正則表達式

3.配置Hadoop的環境變量
/etc/profile文件中增長:apache

export HADOOP_HOME=/opt/hadoop-2.8.1
      export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

好比個人/etc/profile設置成以下圖:
服務器

4.執行hadoop version命令,驗證驗證環境變量是否配置成功,正常狀況下會看到相似以下的結果:hexo

[hadoop@server01 hadoop]$ hadoop version
   Hadoop 2.8.2
   Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb
   Compiled by jdu on 2017-10-19T20:39Z
   Compiled with protoc 2.5.0
   From source with checksum dce55e5afe30c210816b39b631a53b1d
   This command was run using /home/hadoop/app/hadoop-2.8.2/share/hadoop/common/hadoop-common-2.8.2.jar
   [hadoop@server01 hadoop]$

二.使用示例

Hadoop自帶了一個MapReduce程序$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar,它做爲一個例子提供了MapReduce的基本功能,而且能夠用於計算,包括 wordcount、terasort、join、grep 等。app

以經過執行以下命令查看該.jar文件支持哪些MapReduce功能。dom

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar
[hadoop@server01 mapreduce]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
[hadoop@server01 mapreduce]$

接下來咱們就演示如何使用這個自帶的MapReduce程序來計算文件中單詞的個數。

1.建立一個目錄用來存放咱們要處理的數據,能夠建立在任何地方(這裏我是在/home/hadoop/hadoopdata的目錄下建立一個input的目錄),並把想要計算分析的文件放到這個目錄下(這裏我把Hadoop的配置文件複製一份到input目錄下)。

cd /home/hadoop/hadoopdata
mkdir input
cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input
ls -l input
[hadoop@server01 hadoopdata]$ cp /home/hadoop/app/hadoop-2.8.2/etc/hadoop/*.xml input
[hadoop@server01 hadoopdata]$ ll input
total 52
-rw-r--r--. 1 hadoop hadoop 4942 Apr 30 11:43 capacity-scheduler.xml
-rw-r--r--. 1 hadoop hadoop 1144 Apr 30 11:43 core-site.xml
-rw-r--r--. 1 hadoop hadoop 9683 Apr 30 11:43 hadoop-policy.xml
-rw-r--r--. 1 hadoop hadoop  854 Apr 30 11:43 hdfs-site.xml
-rw-r--r--. 1 hadoop hadoop  620 Apr 30 11:43 httpfs-site.xml
-rw-r--r--. 1 hadoop hadoop 3518 Apr 30 11:43 kms-acls.xml
-rw-r--r--. 1 hadoop hadoop 5546 Apr 30 11:43 kms-site.xml
-rw-r--r--. 1 hadoop hadoop  871 Apr 30 11:43 mapred-site.xml
-rw-r--r--. 1 hadoop hadoop 1067 Apr 30 11:43 yarn-site.xml
[hadoop@server01 hadoopdata]$

2.在這個例子中,咱們將 input 文件夾中的全部文件做爲輸入,篩選當中符合正則表達式 dfs[a-z.]+ 的單詞並統計出現的次數,在/home/hadoop/hadoopdata目錄下執行以下命令啓動Hadoop進程。

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dfs[a-z.]+'

執行成功的話,會打印一系列處理的信息,處理的結果會輸出到 output 文件夾中,經過命令 cat output/* 查看結果,符合正則的單詞 dfsadmin 出現了1次:

Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=123
    File Output Format Counters 
        Bytes Written=23
[hadoop@server01 hadoopdata]$ cat output/*
1   dfsadmin
[hadoop@server01 hadoopdata]$ ll output/
total 4
-rw-r--r--. 1 hadoop hadoop 11 Apr 30 12:51 part-r-00000
-rw-r--r--. 1 hadoop hadoop  0 Apr 30 12:51 _SUCCESS
[hadoop@server04 hadoopdata]$

注意,Hadoop 默認不會覆蓋結果文件,所以再次運行一個命令而且結果也是輸出到output目錄則會提示出錯,須要先將 output 目錄刪除

3.刪除output目錄後咱們使用命令在計算一下單詞數:

[hadoop@server04 hadoopdata]$ rm -rf output/
[hadoop@server04 hadoopdata]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount input output

查看結果以下:

File Input Format Counters 
      Bytes Read=26548
  File Output Format Counters 
      Bytes Written=10400
[hadoop@server04 hadoopdata]$ cat output/*
"*" 18
"AS 8
"License"); 8
"alice,bob  18
"clumping"  1
"kerberos".   1
"simple"  1
'HTTP/' 1
'none'  1
'random'    1

這樣咱們就利用Hadoop自帶的MapReduce程序成功地運行了它計算單詞個數的功能。

相關文章
相關標籤/搜索