hadoop學習記錄1 初始hadoop

時間 2019-12-10

原文原文鏈接

原由

由於工做須要用到,因此須要學習hadoop,因此記錄這篇文章,主要分享本身快速搭建hadoop環境與運行一個demojava

搭建環境

網上搭建hadoop環境的例子我看蠻多的.可是我看都比較複雜,要求安裝java,hadoop,而後各類設置..不少參數變量都不明白是啥意思...個人目標很簡單,首先應該是用最簡單的方法搭建好一個環境.各類變量呀參數呀這些我以爲一開始對我都不過重要..我只要能跑起來1個本身的簡單demo就行.並且現實中基本上環境也不會讓我來維護..因此對我來講簡單就行.linux

恰好最近我一直在看docker..因此我就打算用docker來搭建這個環境.算是同時學習hadoop和docker吧.git

首先安裝docker....很簡單...這裏就不介紹了.官方有一鍵安裝腳本...docker

docker hub中有1個官方的hadoop的例子.apache

https://hub.docker.com/r/sequenceiq/hadoop-docker/bootstrap

我稍微修改了一下命令:bash

額外掛載了1個目錄,由於我要上傳我本身寫的demo jar到docker裏去用hadoop運行.app

另外把這個container取名字爲hadoop2,由於我跑了不少容器,取名字便於區分,並且後面可能要用多個hadoop容器來製做集羣.maven

docker run -it -v /dockerVolumes/hadoop2:/dockerVolume --name hadoop2  sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash

運行好這個命令,這個容器就已經運行起來了.咱們能夠跑一下官方的example.ide

cd $HADOOP_PREFIX
# run the mapreduce
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'

# check the output
bin/hdfs dfs -cat output/*

輸出內容:

bash-4.1# clear
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
18/06/11 07:35:38 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/11 07:35:39 INFO input.FileInputFormat: Total input paths to process : 31
18/06/11 07:35:39 INFO mapreduce.JobSubmitter: number of splits:31
18/06/11 07:35:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0007
18/06/11 07:35:40 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0007
18/06/11 07:35:40 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0007/
18/06/11 07:35:40 INFO mapreduce.Job: Running job: job_1528635021541_0007
18/06/11 07:35:45 INFO mapreduce.Job: Job job_1528635021541_0007 running in uber mode : false
18/06/11 07:35:45 INFO mapreduce.Job:  map 0% reduce 0%
18/06/11 07:36:02 INFO mapreduce.Job:  map 10% reduce 0%
18/06/11 07:36:03 INFO mapreduce.Job:  map 19% reduce 0%
18/06/11 07:36:19 INFO mapreduce.Job:  map 35% reduce 0%
18/06/11 07:36:20 INFO mapreduce.Job:  map 39% reduce 0%
18/06/11 07:36:33 INFO mapreduce.Job:  map 42% reduce 0%
18/06/11 07:36:35 INFO mapreduce.Job:  map 55% reduce 0%
18/06/11 07:36:36 INFO mapreduce.Job:  map 55% reduce 15%
18/06/11 07:36:39 INFO mapreduce.Job:  map 55% reduce 18%
18/06/11 07:36:45 INFO mapreduce.Job:  map 58% reduce 18%
18/06/11 07:36:46 INFO mapreduce.Job:  map 61% reduce 18%
18/06/11 07:36:47 INFO mapreduce.Job:  map 65% reduce 18%
18/06/11 07:36:48 INFO mapreduce.Job:  map 65% reduce 22%
18/06/11 07:36:49 INFO mapreduce.Job:  map 71% reduce 22%
18/06/11 07:36:51 INFO mapreduce.Job:  map 71% reduce 24%
18/06/11 07:36:57 INFO mapreduce.Job:  map 74% reduce 24%
18/06/11 07:36:59 INFO mapreduce.Job:  map 77% reduce 24%
18/06/11 07:37:00 INFO mapreduce.Job:  map 77% reduce 26%
18/06/11 07:37:01 INFO mapreduce.Job:  map 84% reduce 26%
18/06/11 07:37:03 INFO mapreduce.Job:  map 87% reduce 28%
18/06/11 07:37:06 INFO mapreduce.Job:  map 87% reduce 29%
18/06/11 07:37:08 INFO mapreduce.Job:  map 90% reduce 29%
18/06/11 07:37:09 INFO mapreduce.Job:  map 94% reduce 29%
18/06/11 07:37:11 INFO mapreduce.Job:  map 100% reduce 29%
18/06/11 07:37:12 INFO mapreduce.Job:  map 100% reduce 100%
18/06/11 07:37:12 INFO mapreduce.Job: Job job_1528635021541_0007 completed successfully
18/06/11 07:37:12 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=345
		FILE: Number of bytes written=3697476
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=80529
		HDFS: Number of bytes written=437
		HDFS: Number of read operations=96
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=31
		Launched reduce tasks=1
		Data-local map tasks=31
		Total time spent by all maps in occupied slots (ms)=400881
		Total time spent by all reduces in occupied slots (ms)=52340
		Total time spent by all map tasks (ms)=400881
		Total time spent by all reduce tasks (ms)=52340
		Total vcore-seconds taken by all map tasks=400881
		Total vcore-seconds taken by all reduce tasks=52340
		Total megabyte-seconds taken by all map tasks=410502144
		Total megabyte-seconds taken by all reduce tasks=53596160
	Map-Reduce Framework
		Map input records=2060
		Map output records=24
		Map output bytes=590
		Map output materialized bytes=525
		Input split bytes=3812
		Combine input records=24
		Combine output records=13
		Reduce input groups=11
		Reduce shuffle bytes=525
		Reduce input records=13
		Reduce output records=11
		Spilled Records=26
		Shuffled Maps =31
		Failed Shuffles=0
		Merged Map outputs=31
		GC time elapsed (ms)=2299
		CPU time spent (ms)=11090
		Physical memory (bytes) snapshot=8178929664
		Virtual memory (bytes) snapshot=21830377472
		Total committed heap usage (bytes)=6461849600
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=76717
	File Output Format Counters
		Bytes Written=437
18/06/11 07:37:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/11 07:37:12 INFO input.FileInputFormat: Total input paths to process : 1
18/06/11 07:37:12 INFO mapreduce.JobSubmitter: number of splits:1
18/06/11 07:37:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0008
18/06/11 07:37:12 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0008
18/06/11 07:37:12 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0008/
18/06/11 07:37:12 INFO mapreduce.Job: Running job: job_1528635021541_0008
18/06/11 07:37:24 INFO mapreduce.Job: Job job_1528635021541_0008 running in uber mode : false
18/06/11 07:37:24 INFO mapreduce.Job:  map 0% reduce 0%
18/06/11 07:37:29 INFO mapreduce.Job:  map 100% reduce 0%
18/06/11 07:37:35 INFO mapreduce.Job:  map 100% reduce 100%
18/06/11 07:37:35 INFO mapreduce.Job: Job job_1528635021541_0008 completed successfully
18/06/11 07:37:35 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=291
		FILE: Number of bytes written=230541
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=569
		HDFS: Number of bytes written=197
		HDFS: Number of read operations=7
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3210
		Total time spent by all reduces in occupied slots (ms)=3248
		Total time spent by all map tasks (ms)=3210
		Total time spent by all reduce tasks (ms)=3248
		Total vcore-seconds taken by all map tasks=3210
		Total vcore-seconds taken by all reduce tasks=3248
		Total megabyte-seconds taken by all map tasks=3287040
		Total megabyte-seconds taken by all reduce tasks=3325952
	Map-Reduce Framework
		Map input records=11
		Map output records=11
		Map output bytes=263
		Map output materialized bytes=291
		Input split bytes=132
		Combine input records=0
		Combine output records=0
		Reduce input groups=5
		Reduce shuffle bytes=291
		Reduce input records=11
		Reduce output records=11
		Spilled Records=22
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=55
		CPU time spent (ms)=1090
		Physical memory (bytes) snapshot=415494144
		Virtual memory (bytes) snapshot=1373601792
		Total committed heap usage (bytes)=354942976
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=437
	File Output Format Counters
		Bytes Written=197

View Code

能夠看到利用了docker...安裝hadoop就1行命令....就能成功運行官方example了.超級簡單

運行本身寫的demo

我本身嘗試寫了個demo.就是讀取一個txt裏的文字,而後統計它的字符數量

1.首先我往hdfs裏建立1個txt:

hdfs的命令能夠參考 https://blog.csdn.net/zhaojw_420/article/details/53161624

hdfs dfs -put in.txt /myinput/in.txt

2.寫本身的mapper和reducer

代碼參考 https://gitee.com/abcwt112/hadoopDemo

參考裏面的MyFirstMapper和MyFirstReducer和MyFirstStarter

package demo;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

public class MyFirstReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
    @Override
    protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int total = 0;
        for (IntWritable value : values) {
            total += value.get();
        }
        context.write(new IntWritable(1), new IntWritable(total));
    }

}

View Code

package demo;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MyFirstMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        context.write(new IntWritable(0), new IntWritable(line.length()));
    }
}

View Code

package demo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.FileInputStream;
import java.io.IOException;

public class MyFirstStarter {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = new Job();
        job.setJarByClass(MyFirstStarter.class);
        job.setJobName("============ My First Job ==============");

        FileInputFormat.addInputPath(job, new Path("/myinput/in.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/myout"));

        job.setMapperClass(MyFirstMapper.class);
        job.setReducerClass(MyFirstReducer.class);

        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);

        System.exit(job.waitForCompletion(true) ? 0: 1);
    }
}

View Code

運行mvn package之後打成jar包丟掉linux的/dockerVolumes/hadoop2目錄就能夠了.由於在docker裏掛載了目錄,因此會自動丟到hadoop2這個容器裏.

另外提一句...我mvn package打出來的jar裏的MF文件沒有指定main方法...致使各類找不到入口....在同事的幫助下了解到能夠經過maven配置來解決:

<build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>${mainClass}</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>


    <properties>
        <mainClass>demo.MyFirstStarter</mainClass>
    </properties>

另外docker安裝的hadoop裏的jdk是1.7個人環境是1.8..因此我再pom裏還額外指定了用1.7去編碼..

3.在hadoop2這個容器裏運行我本身寫的demo.

在$HADOOP_PREFIX目錄下運行bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar

bash-4.1# bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar
18/06/11 07:54:11 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/11 07:54:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/06/11 07:54:13 INFO input.FileInputFormat: Total input paths to process : 1
18/06/11 07:54:13 INFO mapreduce.JobSubmitter: number of splits:1
18/06/11 07:54:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0009
18/06/11 07:54:13 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0009
18/06/11 07:54:13 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0009/
18/06/11 07:54:13 INFO mapreduce.Job: Running job: job_1528635021541_0009
18/06/11 07:54:20 INFO mapreduce.Job: Job job_1528635021541_0009 running in uber mode : false
18/06/11 07:54:20 INFO mapreduce.Job:  map 0% reduce 0%
18/06/11 07:54:25 INFO mapreduce.Job:  map 100% reduce 0%
18/06/11 07:54:31 INFO mapreduce.Job:  map 100% reduce 100%
18/06/11 07:54:31 INFO mapreduce.Job: Job job_1528635021541_0009 completed successfully
18/06/11 07:54:31 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=1606
		FILE: Number of bytes written=232725
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=6940
		HDFS: Number of bytes written=7
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3059
		Total time spent by all reduces in occupied slots (ms)=3265
		Total time spent by all map tasks (ms)=3059
		Total time spent by all reduce tasks (ms)=3265
		Total vcore-seconds taken by all map tasks=3059
		Total vcore-seconds taken by all reduce tasks=3265
		Total megabyte-seconds taken by all map tasks=3132416
		Total megabyte-seconds taken by all reduce tasks=3343360
	Map-Reduce Framework
		Map input records=160
		Map output records=160
		Map output bytes=1280
		Map output materialized bytes=1606
		Input split bytes=104
		Combine input records=0
		Combine output records=0
		Reduce input groups=1
		Reduce shuffle bytes=1606
		Reduce input records=160
		Reduce output records=1
		Spilled Records=320
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=43
		CPU time spent (ms)=1140
		Physical memory (bytes) snapshot=434499584
		Virtual memory (bytes) snapshot=1367728128
		Total committed heap usage (bytes)=354942976
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=6836
	File Output Format Counters
		Bytes Written=7

運行成功!

查看輸出結果

bash-4.1# bin/hdfs dfs -ls  /myout
Found 2 items
-rw-r--r--   1 root supergroup          0 2018-06-11 07:54 /myout/_SUCCESS
-rw-r--r--   1 root supergroup          7 2018-06-11 07:54 /myout/part-r-00000
bash-4.1# bin/hdfs dfs -cat  /myout/part-r-00000
1	6676
bash-4.1#

總共6676個字符..

6836 - 160個換行符 = 6676

成功運行本身寫的demo!