大數據教程（8.2）wordcount程序原理及代碼實現/運行

時間 2019-11-06

標籤數據教程 8.2 wordcount 程序原理代碼實現運行欄目 Microsoft Office 简体版

原文原文鏈接

上一篇博客分享了mapreduce的編程思想，本節博主將帶小夥伴們瞭解wordcount程序的原理和代碼實現/運行細節。經過本節能夠對mapreduce程序有一個大概的認識，其實hadoop中的map、reduce程序只是其中的兩個組件，其他的組件（如input/output）也是能夠重寫的，默認狀況下是使用默認組件。java

1、wordcount統計程序實現：linux

WordcountMapper (map task業務實現)shell

package com.empire.hadoop.mr.wcdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * KEYIN: 默認狀況下，是mr框架所讀到的一行文本的起始偏移量，Long,
 * 可是在hadoop中有本身的更精簡的序列化接口，因此不直接用Long，而用LongWritable
 * VALUEIN:默認狀況下，是mr框架所讀到的一行文本的內容，String，同上，用Text
 * KEYOUT：是用戶自定義邏輯處理完成以後輸出數據中的key，在此處是單詞，String，同上，用Text
 * VALUEOUT：是用戶自定義邏輯處理完成以後輸出數據中的value，在此處是單詞次數，Integer，同上，用IntWritable
 * 
 * @author
 */

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    /**
     * map階段的業務邏輯就寫在自定義的map()方法中 maptask會對每一行輸入數據調用一次咱們自定義的map()方法
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //將maptask傳給咱們的文本內容先轉換成String
        String line = value.toString();
        //根據空格將這一行切分紅單詞
        String[] words = line.split(" ");

        //將單詞輸出爲<單詞，1>
        for (String word : words) {
            //將單詞做爲key，將次數1做爲value，以便於後續的數據分發，能夠根據單詞分發，以便於相同單詞會到相同的reduce task
            context.write(new Text(word), new IntWritable(1));
        }
    }

}

WordcountReducer（reduce業務代碼實現）apache

package com.empire.hadoop.mr.wcdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * KEYIN, VALUEIN 對應 mapper輸出的KEYOUT,VALUEOUT類型對應 KEYOUT, VALUEOUT
 * 是自定義reduce邏輯處理結果的輸出數據類型 KEYOUT是單詞 VLAUEOUT是總次數
 * 
 * @author
 */
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    /**
     * <angelababy,1><angelababy,1><angelababy,1><angelababy,1><angelababy,1>
     * <hello,1><hello,1><hello,1><hello,1><hello,1><hello,1>
     * <banana,1><banana,1><banana,1><banana,1><banana,1><banana,1>
     * 入參key，是一組相同單詞kv對的key
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {

        int count = 0;
        /*
         * Iterator<IntWritable> iterator = values.iterator();
         * while(iterator.hasNext()){ count += iterator.next().get(); }
         */

        for (IntWritable value : values) {

            count += value.get();
        }

        context.write(key, new IntWritable(count));

    }

}

WordcountDriver （提交yarn的程序）編程

package com.empire.hadoop.mr.wcdemo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 至關於一個yarn集羣的客戶端 須要在此封裝咱們的mr程序的相關運行參數，指定jar包 最後提交給yarn
 * 
 * @author
 */
public class WordcountDriver {

    public static void main(String[] args) throws Exception {

        if (args == null || args.length == 0) {
            args = new String[2];
            args[0] = "hdfs://master:9000/wordcount/input/wordcount.txt";
            args[1] = "hdfs://master:9000/wordcount/output8";
        }

        Configuration conf = new Configuration();

        //設置的沒有用!  ??????
        //		conf.set("HADOOP_USER_NAME", "hadoop");
        //		conf.set("dfs.permissions.enabled", "false");

        /*
         * conf.set("mapreduce.framework.name", "yarn");
         * conf.set("yarn.resoucemanager.hostname", "mini1");
         */
        Job job = Job.getInstance(conf);

        /* job.setJar("/home/hadoop/wc.jar"); */
        //指定本程序的jar包所在的本地路徑
        job.setJarByClass(WordcountDriver.class);

        //指定本業務job要使用的mapper/Reducer業務類
        job.setMapperClass(WordcountMapper.class);
        job.setReducerClass(WordcountReducer.class);

        //指定mapper輸出數據的kv類型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //指定最終輸出的數據的kv類型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //指定job的輸入原始文件所在目錄
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        //指定job的輸出結果所在目錄
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //將job中配置的相關參數，以及job所用的java類所在的jar包，提交給yarn去運行
        /* job.submit(); */
        boolean res = job.waitForCompletion(true);
        //linux shell腳本中，上一條命令返回0表示成功，其它表示失敗
        System.exit(res ? 0 : 1);

    }

}

2、運行mapreducecentos

(1)jar打包服務器

(2)上傳到hadoop集羣上，並運行app

#上傳jar

Alt+p
lcd d:/
put  wordcount_aaron.jar

#準備hadoop處理的數據文件

cd /home/hadoop/apps/hadoop-2.9.1
hadoop fs  -mkdir -p /wordcount/input
hadoop fs -put  LICENSE.txt NOTICE.txt /wordcount/input

#運行wordcount程序

hadoop jar wordcount_aaron.jar  com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/outputs

運行效果圖：框架

[hadoop@centos-aaron-h1 ~]$  hadoop jar wordcount_aaron.jar  com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/output
18/11/19 22:48:54 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032
18/11/19 22:48:55 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/11/19 22:48:55 INFO input.FileInputFormat: Total input files to process : 2
18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1280)
        at java.lang.Thread.join(Thread.java:1354)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1280)
        at java.lang.Thread.join(Thread.java:1354)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
18/11/19 22:48:55 INFO mapreduce.JobSubmitter: number of splits:2
18/11/19 22:48:55 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/11/19 22:48:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542637441480_0002
18/11/19 22:48:56 INFO impl.YarnClientImpl: Submitted application application_1542637441480_0002
18/11/19 22:48:56 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1542637441480_0002/
18/11/19 22:48:56 INFO mapreduce.Job: Running job: job_1542637441480_0002
18/11/19 22:49:03 INFO mapreduce.Job: Job job_1542637441480_0002 running in uber mode : false
18/11/19 22:49:03 INFO mapreduce.Job:  map 0% reduce 0%
18/11/19 22:49:09 INFO mapreduce.Job:  map 100% reduce 0%
18/11/19 22:49:14 INFO mapreduce.Job:  map 100% reduce 100%
18/11/19 22:49:15 INFO mapreduce.Job: Job job_1542637441480_0002 completed successfully
18/11/19 22:49:15 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=241219
                FILE: Number of bytes written=1074952
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=122364
                HDFS: Number of bytes written=35348
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=7588
                Total time spent by all reduces in occupied slots (ms)=3742
                Total time spent by all map tasks (ms)=7588
                Total time spent by all reduce tasks (ms)=3742
                Total vcore-milliseconds taken by all map tasks=7588
                Total vcore-milliseconds taken by all reduce tasks=3742
                Total megabyte-milliseconds taken by all map tasks=7770112
                Total megabyte-milliseconds taken by all reduce tasks=3831808
        Map-Reduce Framework
                Map input records=2430
                Map output records=19848
                Map output bytes=201516
                Map output materialized bytes=241225
                Input split bytes=239
                Combine input records=0
                Combine output records=0
                Reduce input groups=2794
                Reduce shuffle bytes=241225
                Reduce input records=19848
                Reduce output records=2794
                Spilled Records=39696
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=332
                CPU time spent (ms)=2830
                Physical memory (bytes) snapshot=557314048
                Virtual memory (bytes) snapshot=2538102784
                Total committed heap usage (bytes)=259411968
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=122125
        File Output Format Counters 
                Bytes Written=35348

運行結果：ide

#查看處理結果文件
hadoop fs -ls /wordcount/output
hadoop fs -cat /wordcount/output/part-r-00000|more

問題處理：

18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1280)
        at java.lang.Thread.join(Thread.java:1354)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)

發生上面錯誤是由於咱們新建hdfs目錄時未按照官方文檔新建形成，問題不大；博主這邊沒影響正常使用；

解決方法：

#建立目錄
hdfs dfs -mkdir -p /user/hadoop
hdfs dfs -put NOTICE.txt  LICENSE.txt /user/hadoop

總結：使用如下兩種方式來執行並無區別，hadoop jar，底層就是調用的java -cp命令來執行。

hadoop jar wordcount_aaron.jar  com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/outputs
java -cp .:/home/hadoop/wordcount_aaron.jar:/home/hadoop/apps/hadoop-2.9.1....jar com.empire.hadoop.mr.wcdemo.WordcountDriver   /user/hadoop/    /wordcount/outputs

最後寄語，以上是博主本次文章的所有內容，若是你們以爲博主的文章還不錯，請點贊；若是您對博主其它服務器大數據技術或者博主本人感興趣，請關注博主博客，而且歡迎隨時跟博主溝通交流。