MapReduce 簡單數據統計

時間 2020-06-03

標籤 mapreduce 簡單數據統計欄目 Hadoop 简体版

原文原文鏈接

1. 準備數據源

摘錄了一片散文，保存格式爲utf-8
html

2. 準備環境

2.1 搭建僞分佈式環境

http://www.javashuo.com/article/p-aanlxmsq-bp.html數組

上傳數據源文件到hdfs中建立的in目錄下app

2.2 下載相關資源

下載hadoop277eclipse

連接：https://pan.baidu.com/s/1xeZx4AVxcjU33hoMLvOojA
提取碼：mxic分佈式

下載hadoop可執行程序 winutils.exeide

連接：https://pan.baidu.com/s/1mPsKk3_TgynAKfJN-kkjSw
提取碼：3bfeoop

2.3 配置環境

2.3.1 配置hadoop的bin和sbin的環境變量
2.3.2 配置Administator訪問權限編碼

#兩種方式均可
#2.3.2.1 關閉訪問權限
<property>   #core-site.xml
    <name>dfs.permissions</name>
    <value>false</value>
</property>

#2.3.2.2 受權
hadoop fs -chmod 777 文件路徑

2.4 將資源放到對應位置

1.將hadoopBin.rar中的全部文件拷到hadoop的bin文件夾下
2.將hadoop-2.7.7/share/hadoop裏common,hdfs,mapreduce,yarn四個文件夾下的jar包加入到項目中插件

3. 準備代碼

3.1 開發Map類（繼承Mapper類）

public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
			throws IOException, InterruptedException {
		//從文本中讀出一行
		String line = value.toString();
		//將這一行字符串變成字符數組
		char[] charArray = line.toCharArray();
		//遍歷每個字符
		for(char a:charArray) {
			//將字符以  字符   1   的格式一行行輸出到臨時文件中
			context.write(new Text(a+""), new IntWritable(1));
                        //注：MapReduce中有本身的數據類型，需進行轉換
		}
	}
}

3.2 開發Reduce類（繼承Reduce類）

public class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable>{

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Reducer<Text, IntWritable, Text, IntWritable>.Context content) throws IOException, InterruptedException {
		//設計一個變量統計總數
		int num = 0;
		//遍歷數據中整數部分
		for(IntWritable v:values) {
			//get()得到int類型的整數，而後累加
			num += v.get();
		}
		//以  字符  總數   的格式輸出到指定文件夾
		content.write(key, new IntWritable(num));
	}
}

3.3 開發Driver類

public class WordCountDriver{
	public static void main(String[] arge) {
		System.setProperty("hadoop.home.dir", "F:\\Linux\\hadoop-2.7.7");
		//配置訪問地址
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://192.168.3.8:9000");
		try {
			//得到job任務對象
			Job job = Job.getInstance(conf);
			//設置driver類
			job.setJarByClass(WordCountDriver.class);
			//設置Map類
			job.setMapperClass(WordCountMapper.class);
			//設置Map類輸出的key數據的格式類
			job.setMapOutputKeyClass(Text.class);
			//設置Map類輸出的value數據的格式類
			job.setMapOutputValueClass(IntWritable.class);
			//設置Reduce類  若是Reduce類輸出格式類與Map類的相同，可不寫
			job.setReducerClass(WordCountReduce.class);
			//設置Map類輸出的key數據的格式類
			job.setOutputKeyClass(Text.class);
			//設置Map類輸出的value數據的格式類
			job.setOutputValueClass(IntWritable.class);
			//設置被統計的文件的地址
			FileInputFormat.setInputPaths(job, new Path("/in/bob.txt"));
			//設置統計獲得的數據文件的存放地址
			//注：文件所在的文件夾需不存在，由系統建立
			FileOutputFormat.setOutputPath(job, new Path("/out/"));
			//true表示將運行進度等信息及時輸出給用戶，false的話只是等待做業結束
			job.waitForCompletion(true);
		} catch (IOException e) {
			e.printStackTrace();
		} catch (ClassNotFoundException e) {
			e.printStackTrace();
		} catch (InterruptedException e) {
			e.printStackTrace();
		}
	}
}

4. 統計結果

5. 相關問題

5.1 問題一

Input path does not exist: file:/in/bob.txt設計

解決：檢查訪問地址及相關配置

5.2 問題二

解決：環境變量沒配置好或還沒生效（選擇如下其中一種便可）

配置好hadoop環境變量，重啓eclipse

加入代碼System.setProperty("hadoop.home.dir", "F:\Linux\hadoop-2.7.7")，見reduce類代碼

5.3 問題三

解決：見上文2.3.2

5.4 問題四

中文亂碼

解決：
1.確保eclipse編碼格式爲utf-8
2.數據源文件保存格式爲utf-8
3.使用轉換流，字節流轉字符流：new OutputStreamWrite(out,"UTF-8")

6. 拓展

6.1 打jar包

將FileInputFormat.setInputPaths(job, new Path("/in/bob.txt"))地址改成"/in/",統計in目錄下全部文件

將此項目打成jar包上傳到Linux系統/opt/test目錄下

運行jar包，代碼:hadoop jar jar包名 ,即可獲得統計結果

之後即可將數據源文件放置於in文件夾中，直接運行jar包進行統計（統計前需刪掉hdfs中的out文件夾）