由於工做須要用到,因此須要學習hadoop,因此記錄這篇文章,主要分享本身快速搭建hadoop環境與運行一個demojava
網上搭建hadoop環境的例子我看蠻多的.可是我看都比較複雜,要求安裝java,hadoop,而後各類設置..不少參數變量都不明白是啥意思...個人目標很簡單,首先應該是用最簡單的方法搭建好一個環境.各類變量呀參數呀這些我以爲一開始對我都不過重要..我只要能跑起來1個本身的簡單demo就行.並且現實中基本上環境也不會讓我來維護..因此對我來講簡單就行.linux
恰好最近我一直在看docker..因此我就打算用docker來搭建這個環境.算是同時學習hadoop和docker吧.git
首先安裝docker....很簡單...這裏就不介紹了.官方有一鍵安裝腳本...docker
docker hub中有1個官方的hadoop的例子.apache
https://hub.docker.com/r/sequenceiq/hadoop-docker/bootstrap
我稍微修改了一下命令:bash
額外掛載了1個目錄,由於我要上傳我本身寫的demo jar到docker裏去用hadoop運行.app
另外把這個container取名字爲hadoop2,由於我跑了不少容器,取名字便於區分,並且後面可能要用多個hadoop容器來製做集羣.maven
docker run -it -v /dockerVolumes/hadoop2:/dockerVolume --name hadoop2 sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash
運行好這個命令,這個容器就已經運行起來了.咱們能夠跑一下官方的example.ide
cd $HADOOP_PREFIX # run the mapreduce bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+' # check the output bin/hdfs dfs -cat output/*
輸出內容:
bash-4.1# clear bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+' 18/06/11 07:35:38 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/06/11 07:35:39 INFO input.FileInputFormat: Total input paths to process : 31 18/06/11 07:35:39 INFO mapreduce.JobSubmitter: number of splits:31 18/06/11 07:35:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0007 18/06/11 07:35:40 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0007 18/06/11 07:35:40 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0007/ 18/06/11 07:35:40 INFO mapreduce.Job: Running job: job_1528635021541_0007 18/06/11 07:35:45 INFO mapreduce.Job: Job job_1528635021541_0007 running in uber mode : false 18/06/11 07:35:45 INFO mapreduce.Job: map 0% reduce 0% 18/06/11 07:36:02 INFO mapreduce.Job: map 10% reduce 0% 18/06/11 07:36:03 INFO mapreduce.Job: map 19% reduce 0% 18/06/11 07:36:19 INFO mapreduce.Job: map 35% reduce 0% 18/06/11 07:36:20 INFO mapreduce.Job: map 39% reduce 0% 18/06/11 07:36:33 INFO mapreduce.Job: map 42% reduce 0% 18/06/11 07:36:35 INFO mapreduce.Job: map 55% reduce 0% 18/06/11 07:36:36 INFO mapreduce.Job: map 55% reduce 15% 18/06/11 07:36:39 INFO mapreduce.Job: map 55% reduce 18% 18/06/11 07:36:45 INFO mapreduce.Job: map 58% reduce 18% 18/06/11 07:36:46 INFO mapreduce.Job: map 61% reduce 18% 18/06/11 07:36:47 INFO mapreduce.Job: map 65% reduce 18% 18/06/11 07:36:48 INFO mapreduce.Job: map 65% reduce 22% 18/06/11 07:36:49 INFO mapreduce.Job: map 71% reduce 22% 18/06/11 07:36:51 INFO mapreduce.Job: map 71% reduce 24% 18/06/11 07:36:57 INFO mapreduce.Job: map 74% reduce 24% 18/06/11 07:36:59 INFO mapreduce.Job: map 77% reduce 24% 18/06/11 07:37:00 INFO mapreduce.Job: map 77% reduce 26% 18/06/11 07:37:01 INFO mapreduce.Job: map 84% reduce 26% 18/06/11 07:37:03 INFO mapreduce.Job: map 87% reduce 28% 18/06/11 07:37:06 INFO mapreduce.Job: map 87% reduce 29% 18/06/11 07:37:08 INFO mapreduce.Job: map 90% reduce 29% 18/06/11 07:37:09 INFO mapreduce.Job: map 94% reduce 29% 18/06/11 07:37:11 INFO mapreduce.Job: map 100% reduce 29% 18/06/11 07:37:12 INFO mapreduce.Job: map 100% reduce 100% 18/06/11 07:37:12 INFO mapreduce.Job: Job job_1528635021541_0007 completed successfully 18/06/11 07:37:12 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=345 FILE: Number of bytes written=3697476 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=80529 HDFS: Number of bytes written=437 HDFS: Number of read operations=96 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=31 Launched reduce tasks=1 Data-local map tasks=31 Total time spent by all maps in occupied slots (ms)=400881 Total time spent by all reduces in occupied slots (ms)=52340 Total time spent by all map tasks (ms)=400881 Total time spent by all reduce tasks (ms)=52340 Total vcore-seconds taken by all map tasks=400881 Total vcore-seconds taken by all reduce tasks=52340 Total megabyte-seconds taken by all map tasks=410502144 Total megabyte-seconds taken by all reduce tasks=53596160 Map-Reduce Framework Map input records=2060 Map output records=24 Map output bytes=590 Map output materialized bytes=525 Input split bytes=3812 Combine input records=24 Combine output records=13 Reduce input groups=11 Reduce shuffle bytes=525 Reduce input records=13 Reduce output records=11 Spilled Records=26 Shuffled Maps =31 Failed Shuffles=0 Merged Map outputs=31 GC time elapsed (ms)=2299 CPU time spent (ms)=11090 Physical memory (bytes) snapshot=8178929664 Virtual memory (bytes) snapshot=21830377472 Total committed heap usage (bytes)=6461849600 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=76717 File Output Format Counters Bytes Written=437 18/06/11 07:37:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/06/11 07:37:12 INFO input.FileInputFormat: Total input paths to process : 1 18/06/11 07:37:12 INFO mapreduce.JobSubmitter: number of splits:1 18/06/11 07:37:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0008 18/06/11 07:37:12 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0008 18/06/11 07:37:12 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0008/ 18/06/11 07:37:12 INFO mapreduce.Job: Running job: job_1528635021541_0008 18/06/11 07:37:24 INFO mapreduce.Job: Job job_1528635021541_0008 running in uber mode : false 18/06/11 07:37:24 INFO mapreduce.Job: map 0% reduce 0% 18/06/11 07:37:29 INFO mapreduce.Job: map 100% reduce 0% 18/06/11 07:37:35 INFO mapreduce.Job: map 100% reduce 100% 18/06/11 07:37:35 INFO mapreduce.Job: Job job_1528635021541_0008 completed successfully 18/06/11 07:37:35 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=291 FILE: Number of bytes written=230541 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=569 HDFS: Number of bytes written=197 HDFS: Number of read operations=7 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=3210 Total time spent by all reduces in occupied slots (ms)=3248 Total time spent by all map tasks (ms)=3210 Total time spent by all reduce tasks (ms)=3248 Total vcore-seconds taken by all map tasks=3210 Total vcore-seconds taken by all reduce tasks=3248 Total megabyte-seconds taken by all map tasks=3287040 Total megabyte-seconds taken by all reduce tasks=3325952 Map-Reduce Framework Map input records=11 Map output records=11 Map output bytes=263 Map output materialized bytes=291 Input split bytes=132 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=291 Reduce input records=11 Reduce output records=11 Spilled Records=22 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=55 CPU time spent (ms)=1090 Physical memory (bytes) snapshot=415494144 Virtual memory (bytes) snapshot=1373601792 Total committed heap usage (bytes)=354942976 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=437 File Output Format Counters Bytes Written=197
能夠看到利用了docker...安裝hadoop就1行命令....就能成功運行官方example了.超級簡單
我本身嘗試寫了個demo.就是讀取一個txt裏的文字,而後統計它的字符數量
1.首先我往hdfs裏建立1個txt:
hdfs的命令能夠參考 https://blog.csdn.net/zhaojw_420/article/details/53161624
hdfs dfs -put in.txt /myinput/in.txt
2.寫本身的mapper和reducer
代碼參考 https://gitee.com/abcwt112/hadoopDemo
參考裏面的MyFirstMapper和MyFirstReducer和MyFirstStarter
package demo; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; import java.util.Iterator; public class MyFirstReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> { @Override protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int total = 0; for (IntWritable value : values) { total += value.get(); } context.write(new IntWritable(1), new IntWritable(total)); } }
package demo; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class MyFirstMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); context.write(new IntWritable(0), new IntWritable(line.length())); } }
package demo; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.FileInputStream; import java.io.IOException; public class MyFirstStarter { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Job job = new Job(); job.setJarByClass(MyFirstStarter.class); job.setJobName("============ My First Job =============="); FileInputFormat.addInputPath(job, new Path("/myinput/in.txt")); FileOutputFormat.setOutputPath(job, new Path("/myout")); job.setMapperClass(MyFirstMapper.class); job.setReducerClass(MyFirstReducer.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0: 1); } }
運行mvn package之後打成jar包丟掉linux的/dockerVolumes/hadoop2目錄就能夠了.由於在docker裏掛載了目錄,因此會自動丟到hadoop2這個容器裏.
另外提一句...我mvn package打出來的jar裏的MF文件沒有指定main方法...致使各類找不到入口....在同事的幫助下了解到能夠經過maven配置來解決:
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.1</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>${mainClass}</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build> <properties> <mainClass>demo.MyFirstStarter</mainClass> </properties>
另外docker安裝的hadoop裏的jdk是1.7個人環境是1.8..因此我再pom裏還額外指定了用1.7去編碼..
3.在hadoop2這個容器裏運行我本身寫的demo.
在$HADOOP_PREFIX目錄下運行bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar
bash-4.1# bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar 18/06/11 07:54:11 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/06/11 07:54:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 18/06/11 07:54:13 INFO input.FileInputFormat: Total input paths to process : 1 18/06/11 07:54:13 INFO mapreduce.JobSubmitter: number of splits:1 18/06/11 07:54:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0009 18/06/11 07:54:13 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0009 18/06/11 07:54:13 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0009/ 18/06/11 07:54:13 INFO mapreduce.Job: Running job: job_1528635021541_0009 18/06/11 07:54:20 INFO mapreduce.Job: Job job_1528635021541_0009 running in uber mode : false 18/06/11 07:54:20 INFO mapreduce.Job: map 0% reduce 0% 18/06/11 07:54:25 INFO mapreduce.Job: map 100% reduce 0% 18/06/11 07:54:31 INFO mapreduce.Job: map 100% reduce 100% 18/06/11 07:54:31 INFO mapreduce.Job: Job job_1528635021541_0009 completed successfully 18/06/11 07:54:31 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1606 FILE: Number of bytes written=232725 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=6940 HDFS: Number of bytes written=7 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=3059 Total time spent by all reduces in occupied slots (ms)=3265 Total time spent by all map tasks (ms)=3059 Total time spent by all reduce tasks (ms)=3265 Total vcore-seconds taken by all map tasks=3059 Total vcore-seconds taken by all reduce tasks=3265 Total megabyte-seconds taken by all map tasks=3132416 Total megabyte-seconds taken by all reduce tasks=3343360 Map-Reduce Framework Map input records=160 Map output records=160 Map output bytes=1280 Map output materialized bytes=1606 Input split bytes=104 Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=1606 Reduce input records=160 Reduce output records=1 Spilled Records=320 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=43 CPU time spent (ms)=1140 Physical memory (bytes) snapshot=434499584 Virtual memory (bytes) snapshot=1367728128 Total committed heap usage (bytes)=354942976 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=6836 File Output Format Counters Bytes Written=7
運行成功!
查看輸出結果
bash-4.1# bin/hdfs dfs -ls /myout Found 2 items -rw-r--r-- 1 root supergroup 0 2018-06-11 07:54 /myout/_SUCCESS -rw-r--r-- 1 root supergroup 7 2018-06-11 07:54 /myout/part-r-00000 bash-4.1# bin/hdfs dfs -cat /myout/part-r-00000 1 6676 bash-4.1#
總共6676個字符..
6836 - 160個換行符 = 6676
成功運行本身寫的demo!