上一篇博客分享了mapreduce的編程思想,本節博主將帶小夥伴們瞭解wordcount程序的原理和代碼實現/運行細節。經過本節能夠對mapreduce程序有一個大概的認識,其實hadoop中的map、reduce程序只是其中的兩個組件,其他的組件(如input/output)也是能夠重寫的,默認狀況下是使用默認組件。java
1、wordcount統計程序實現:linux
WordcountMapper (map task業務實現)shell
package com.empire.hadoop.mr.wcdemo; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; /** * KEYIN: 默認狀況下,是mr框架所讀到的一行文本的起始偏移量,Long, * 可是在hadoop中有本身的更精簡的序列化接口,因此不直接用Long,而用LongWritable * VALUEIN:默認狀況下,是mr框架所讀到的一行文本的內容,String,同上,用Text * KEYOUT:是用戶自定義邏輯處理完成以後輸出數據中的key,在此處是單詞,String,同上,用Text * VALUEOUT:是用戶自定義邏輯處理完成以後輸出數據中的value,在此處是單詞次數,Integer,同上,用IntWritable * * @author */ public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { /** * map階段的業務邏輯就寫在自定義的map()方法中 maptask會對每一行輸入數據調用一次咱們自定義的map()方法 */ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //將maptask傳給咱們的文本內容先轉換成String String line = value.toString(); //根據空格將這一行切分紅單詞 String[] words = line.split(" "); //將單詞輸出爲<單詞,1> for (String word : words) { //將單詞做爲key,將次數1做爲value,以便於後續的數據分發,能夠根據單詞分發,以便於相同單詞會到相同的reduce task context.write(new Text(word), new IntWritable(1)); } } }
WordcountReducer(reduce業務代碼實現)apache
package com.empire.hadoop.mr.wcdemo; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; /** * KEYIN, VALUEIN 對應 mapper輸出的KEYOUT,VALUEOUT類型對應 KEYOUT, VALUEOUT * 是自定義reduce邏輯處理結果的輸出數據類型 KEYOUT是單詞 VLAUEOUT是總次數 * * @author */ public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { /** * <angelababy,1><angelababy,1><angelababy,1><angelababy,1><angelababy,1> * <hello,1><hello,1><hello,1><hello,1><hello,1><hello,1> * <banana,1><banana,1><banana,1><banana,1><banana,1><banana,1> * 入參key,是一組相同單詞kv對的key */ @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count = 0; /* * Iterator<IntWritable> iterator = values.iterator(); * while(iterator.hasNext()){ count += iterator.next().get(); } */ for (IntWritable value : values) { count += value.get(); } context.write(key, new IntWritable(count)); } }
WordcountDriver (提交yarn的程序)編程
package com.empire.hadoop.mr.wcdemo; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /** * 至關於一個yarn集羣的客戶端 須要在此封裝咱們的mr程序的相關運行參數,指定jar包 最後提交給yarn * * @author */ public class WordcountDriver { public static void main(String[] args) throws Exception { if (args == null || args.length == 0) { args = new String[2]; args[0] = "hdfs://master:9000/wordcount/input/wordcount.txt"; args[1] = "hdfs://master:9000/wordcount/output8"; } Configuration conf = new Configuration(); //設置的沒有用! ?????? // conf.set("HADOOP_USER_NAME", "hadoop"); // conf.set("dfs.permissions.enabled", "false"); /* * conf.set("mapreduce.framework.name", "yarn"); * conf.set("yarn.resoucemanager.hostname", "mini1"); */ Job job = Job.getInstance(conf); /* job.setJar("/home/hadoop/wc.jar"); */ //指定本程序的jar包所在的本地路徑 job.setJarByClass(WordcountDriver.class); //指定本業務job要使用的mapper/Reducer業務類 job.setMapperClass(WordcountMapper.class); job.setReducerClass(WordcountReducer.class); //指定mapper輸出數據的kv類型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //指定最終輸出的數據的kv類型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //指定job的輸入原始文件所在目錄 FileInputFormat.setInputPaths(job, new Path(args[0])); //指定job的輸出結果所在目錄 FileOutputFormat.setOutputPath(job, new Path(args[1])); //將job中配置的相關參數,以及job所用的java類所在的jar包,提交給yarn去運行 /* job.submit(); */ boolean res = job.waitForCompletion(true); //linux shell腳本中,上一條命令返回0表示成功,其它表示失敗 System.exit(res ? 0 : 1); } }
2、運行mapreducecentos
(1)jar打包服務器
(2)上傳到hadoop集羣上,並運行app
#上傳jar Alt+p lcd d:/ put wordcount_aaron.jar #準備hadoop處理的數據文件 cd /home/hadoop/apps/hadoop-2.9.1 hadoop fs -mkdir -p /wordcount/input hadoop fs -put LICENSE.txt NOTICE.txt /wordcount/input #運行wordcount程序 hadoop jar wordcount_aaron.jar com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/outputs
運行效果圖:框架
[hadoop@centos-aaron-h1 ~]$ hadoop jar wordcount_aaron.jar com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/output 18/11/19 22:48:54 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032 18/11/19 22:48:55 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 18/11/19 22:48:55 INFO input.FileInputFormat: Total input files to process : 2 18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1280) at java.lang.Thread.join(Thread.java:1354) at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980) at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807) 18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1280) at java.lang.Thread.join(Thread.java:1354) at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980) at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807) 18/11/19 22:48:55 INFO mapreduce.JobSubmitter: number of splits:2 18/11/19 22:48:55 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 18/11/19 22:48:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542637441480_0002 18/11/19 22:48:56 INFO impl.YarnClientImpl: Submitted application application_1542637441480_0002 18/11/19 22:48:56 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1542637441480_0002/ 18/11/19 22:48:56 INFO mapreduce.Job: Running job: job_1542637441480_0002 18/11/19 22:49:03 INFO mapreduce.Job: Job job_1542637441480_0002 running in uber mode : false 18/11/19 22:49:03 INFO mapreduce.Job: map 0% reduce 0% 18/11/19 22:49:09 INFO mapreduce.Job: map 100% reduce 0% 18/11/19 22:49:14 INFO mapreduce.Job: map 100% reduce 100% 18/11/19 22:49:15 INFO mapreduce.Job: Job job_1542637441480_0002 completed successfully 18/11/19 22:49:15 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=241219 FILE: Number of bytes written=1074952 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=122364 HDFS: Number of bytes written=35348 HDFS: Number of read operations=9 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=7588 Total time spent by all reduces in occupied slots (ms)=3742 Total time spent by all map tasks (ms)=7588 Total time spent by all reduce tasks (ms)=3742 Total vcore-milliseconds taken by all map tasks=7588 Total vcore-milliseconds taken by all reduce tasks=3742 Total megabyte-milliseconds taken by all map tasks=7770112 Total megabyte-milliseconds taken by all reduce tasks=3831808 Map-Reduce Framework Map input records=2430 Map output records=19848 Map output bytes=201516 Map output materialized bytes=241225 Input split bytes=239 Combine input records=0 Combine output records=0 Reduce input groups=2794 Reduce shuffle bytes=241225 Reduce input records=19848 Reduce output records=2794 Spilled Records=39696 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=332 CPU time spent (ms)=2830 Physical memory (bytes) snapshot=557314048 Virtual memory (bytes) snapshot=2538102784 Total committed heap usage (bytes)=259411968 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=122125 File Output Format Counters Bytes Written=35348
運行結果:ide
#查看處理結果文件 hadoop fs -ls /wordcount/output hadoop fs -cat /wordcount/output/part-r-00000|more
問題處理:
18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1280) at java.lang.Thread.join(Thread.java:1354) at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980) at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
發生上面錯誤是由於咱們新建hdfs目錄時未按照官方文檔新建形成,問題不大;博主這邊沒影響正常使用;
解決方法:
#建立目錄 hdfs dfs -mkdir -p /user/hadoop hdfs dfs -put NOTICE.txt LICENSE.txt /user/hadoop
總結:使用如下兩種方式來執行並無區別,hadoop jar,底層就是調用的java -cp命令來執行。
hadoop jar wordcount_aaron.jar com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/outputs java -cp .:/home/hadoop/wordcount_aaron.jar:/home/hadoop/apps/hadoop-2.9.1....jar com.empire.hadoop.mr.wcdemo.WordcountDriver /user/hadoop/ /wordcount/outputs
最後寄語,以上是博主本次文章的所有內容,若是你們以爲博主的文章還不錯,請點贊;若是您對博主其它服務器大數據技術或者博主本人感興趣,請關注博主博客,而且歡迎隨時跟博主溝通交流。