前面博主介紹了sql中join功能的大數據實現,本節將繼續爲小夥伴們分享倒排索引的創建。java
1、需求sql
在不少項目中,咱們須要對咱們的文檔創建索引(如:論壇帖子);咱們須要記錄某個詞在各個文檔中出現的次數而且記錄下來供咱們進行查詢搜素,這就是咱們作搜素引擎最基礎的功能;分詞框架有開源的CJK等,搜素框架有lucene等。可是當咱們須要創建索引的文件數量太多的時候,咱們使用lucene來作效率就會很低;此時咱們須要創建本身的索引,可使用hadoop來實現。apache
圖一、待統計的文檔centos
圖二、創建的索引文件效果服務器
2、代碼實現app
step1:map-reduce框架
package com.empire.hadoop.mr.inverindex; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class InverIndexStepOne { static class InverIndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] words = line.split(" "); FileSplit inputSplit = (FileSplit) context.getInputSplit(); String fileName = inputSplit.getPath().getName(); for (String word : words) { k.set(word + "--" + fileName); context.write(k, v); } } } static class InverIndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count = 0; for (IntWritable value : values) { count += value.get(); } context.write(key, new IntWritable(count)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJarByClass(InverIndexStepOne.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // FileInputFormat.setInputPaths(job, new Path(args[0])); // FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(InverIndexStepOneMapper.class); job.setReducerClass(InverIndexStepOneReducer.class); job.waitForCompletion(true); } }
step2:map-reduceide
package com.empire.hadoop.mr.inverindex; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class IndexStepTwo { public static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] files = line.split("--"); context.write(new Text(files[0]), new Text(files[1])); } } public static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { StringBuffer sb = new StringBuffer(); for (Text text : values) { sb.append(text.toString().replace("\t", "-->") + "\t"); } context.write(key, new Text(sb.toString())); } } public static void main(String[] args) throws Exception { if (args.length < 1 || args == null) { args = new String[] { "D:/temp/out/part-r-00000", "D:/temp/out2" }; } Configuration config = new Configuration(); Job job = Job.getInstance(config); job.setJarByClass(IndexStepTwo.class); job.setMapperClass(IndexStepTwoMapper.class); job.setReducerClass(IndexStepTwoReducer.class); // job.setMapOutputKeyClass(Text.class); // job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 1 : 0); } }
3、執行程序oop
#上傳jar Alt+p lcd d:/ put IndexStepOne.jar IndexStepTwo.jar put a.txt b.txt c.txt #準備hadoop處理的數據文件 cd /home/hadoop hadoop fs -mkdir -p /index/indexinput hdfs dfs -put a.txt b.txt c.txt /index/indexinput #運行程序 hadoop jar IndexStepOne.jar com.empire.hadoop.mr.inverindex.InverIndexStepOne /index/indexinput /index/indexsteponeoutput hadoop jar IndexStepTwo.jar com.empire.hadoop.mr.inverindex.IndexStepTwo /index/indexsteponeoutput /index/indexsteptwooutput
4、運行效果大數據
[hadoop@centos-aaron-h1 ~]$ hadoop jar IndexStepOne.jar com.empire.hadoop.mr.inverindex.InverIndexStepOne /index/indexinput /index/indexsteponeoutput 18/12/19 07:08:42 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032 18/12/19 07:08:43 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 18/12/19 07:08:43 INFO input.FileInputFormat: Total input files to process : 3 18/12/19 07:08:43 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 18/12/19 07:08:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545173547743_0001 18/12/19 07:08:45 INFO impl.YarnClientImpl: Submitted application application_1545173547743_0001 18/12/19 07:08:45 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1545173547743_0001/ 18/12/19 07:08:45 INFO mapreduce.Job: Running job: job_1545173547743_0001 18/12/19 07:08:56 INFO mapreduce.Job: Job job_1545173547743_0001 running in uber mode : false 18/12/19 07:08:56 INFO mapreduce.Job: map 0% reduce 0% 18/12/19 07:09:05 INFO mapreduce.Job: map 33% reduce 0% 18/12/19 07:09:20 INFO mapreduce.Job: map 67% reduce 0% 18/12/19 07:09:21 INFO mapreduce.Job: map 100% reduce 100% 18/12/19 07:09:23 INFO mapreduce.Job: Job job_1545173547743_0001 completed successfully 18/12/19 07:09:23 INFO mapreduce.Job: Counters: 50 File System Counters FILE: Number of bytes read=1252 FILE: Number of bytes written=791325 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=689 HDFS: Number of bytes written=297 HDFS: Number of read operations=12 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Killed map tasks=1 Launched map tasks=4 Launched reduce tasks=1 Data-local map tasks=4 Total time spent by all maps in occupied slots (ms)=53828 Total time spent by all reduces in occupied slots (ms)=13635 Total time spent by all map tasks (ms)=53828 Total time spent by all reduce tasks (ms)=13635 Total vcore-milliseconds taken by all map tasks=53828 Total vcore-milliseconds taken by all reduce tasks=13635 Total megabyte-milliseconds taken by all map tasks=55119872 Total megabyte-milliseconds taken by all reduce tasks=13962240 Map-Reduce Framework Map input records=14 Map output records=70 Map output bytes=1106 Map output materialized bytes=1264 Input split bytes=345 Combine input records=0 Combine output records=0 Reduce input groups=21 Reduce shuffle bytes=1264 Reduce input records=70 Reduce output records=21 Spilled Records=140 Shuffled Maps =3 Failed Shuffles=0 Merged Map outputs=3 GC time elapsed (ms)=1589 CPU time spent (ms)=5600 Physical memory (bytes) snapshot=749715456 Virtual memory (bytes) snapshot=3382075392 Total committed heap usage (bytes)=380334080 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=344 File Output Format Counters Bytes Written=297 [hadoop@centos-aaron-h1 ~]$
[hadoop@centos-aaron-h1 ~]$ hadoop jar IndexStepTwo.jar com.empire.hadoop.mr.inverindex.IndexStepTwo /index/indexsteponeoutput /index/indexsteptwooutput 18/12/19 07:11:27 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032 18/12/19 07:11:27 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 18/12/19 07:11:27 INFO input.FileInputFormat: Total input files to process : 1 18/12/19 07:11:28 INFO mapreduce.JobSubmitter: number of splits:1 18/12/19 07:11:28 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 18/12/19 07:11:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545173547743_0002 18/12/19 07:11:28 INFO impl.YarnClientImpl: Submitted application application_1545173547743_0002 18/12/19 07:11:29 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1545173547743_0002/ 18/12/19 07:11:29 INFO mapreduce.Job: Running job: job_1545173547743_0002 18/12/19 07:11:36 INFO mapreduce.Job: Job job_1545173547743_0002 running in uber mode : false 18/12/19 07:11:36 INFO mapreduce.Job: map 0% reduce 0% 18/12/19 07:11:42 INFO mapreduce.Job: map 100% reduce 0% 18/12/19 07:11:48 INFO mapreduce.Job: map 100% reduce 100% 18/12/19 07:11:48 INFO mapreduce.Job: Job job_1545173547743_0002 completed successfully 18/12/19 07:11:48 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=324 FILE: Number of bytes written=394987 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=427 HDFS: Number of bytes written=253 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=3234 Total time spent by all reduces in occupied slots (ms)=3557 Total time spent by all map tasks (ms)=3234 Total time spent by all reduce tasks (ms)=3557 Total vcore-milliseconds taken by all map tasks=3234 Total vcore-milliseconds taken by all reduce tasks=3557 Total megabyte-milliseconds taken by all map tasks=3311616 Total megabyte-milliseconds taken by all reduce tasks=3642368 Map-Reduce Framework Map input records=21 Map output records=21 Map output bytes=276 Map output materialized bytes=324 Input split bytes=130 Combine input records=0 Combine output records=0 Reduce input groups=7 Reduce shuffle bytes=324 Reduce input records=21 Reduce output records=7 Spilled Records=42 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=210 CPU time spent (ms)=990 Physical memory (bytes) snapshot=339693568 Virtual memory (bytes) snapshot=1694265344 Total committed heap usage (bytes)=137760768 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=297 File Output Format Counters Bytes Written=253 [hadoop@centos-aaron-h1 ~]$
5、運行結果
[hadoop@centos-aaron-h1 ~]$ hdfs dfs -cat /index/indexsteponeoutput/part-r-00000 boby--a.txt 1 boby--b.txt 2 boby--c.txt 4 fork--a.txt 2 fork--b.txt 4 fork--c.txt 8 hello--a.txt 2 hello--b.txt 4 hello--c.txt 8 integer--a.txt 1 integer--b.txt 2 integer--c.txt 4 source--a.txt 1 source--b.txt 2 source--c.txt 4 tom--a.txt 1 tom--b.txt 2 tom--c.txt 4 [hadoop@centos-aaron-h1 ~]$
[hadoop@centos-aaron-h1 ~]$ hdfs dfs -cat /index/indexsteptwooutput/part-r-00000 boby a.txt-->1 b.txt-->2 c.txt-->4 fork a.txt-->2 b.txt-->4 c.txt-->8 hello b.txt-->4 c.txt-->8 a.txt-->2 integer a.txt-->1 b.txt-->2 c.txt-->4 source a.txt-->1 b.txt-->2 c.txt-->4 tom a.txt-->1 b.txt-->2 c.txt-->4 [hadoop@centos-aaron-h1 ~]$
最後寄語,以上是博主本次文章的所有內容,若是你們以爲博主的文章還不錯,請點贊;若是您對博主其它服務器大數據技術或者博主本人感興趣,請關注博主博客,而且歡迎隨時跟博主溝通交流。