幾種方式實現WordCount

時間 2020-03-28

標籤幾種方式實現 wordcount 欄目 Microsoft Office 简体版

原文原文鏈接

精簡的Shell

cat  /home/sev7e0/access.log | tr -s ' ' '\n' | sort | uniq -c | sort -r | awk '{print $2, $1}'
#cat  命令一次性展現出文本內容
#tr -s ' ' '\n'  將文本中空格使用回車鍵替換
#sort   串聯排序全部指定文件並將結果寫到標準輸出。
#uniq -c    從輸入文件或者標準輸入中篩選相鄰的匹配行並寫入到輸出文件或標準輸出,-c 在每行前加上表示相應行目出現次數的前綴編號
#sort | uniq -c     同時使用用來統計出現的次數
#sort -r    把結果逆序排列
#awk '{print $2,$1}'    將結果輸出,文本在前,計數在後

Scala

import scala.io.Source._
val file = fromFile("/home/hadoopadmin/test.txt")
val map = file.getLines().toList.flatMap(_.split(" ")).map((_,1)).groupBy(_._1)
val value = map.mapValues(_.size)
value.foreach(println(_))

反人類的MapReduce

//mapreduce方式
public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
//        conf.set("fs.defaultFS", "hdfs://spark01:9000");
//        conf.set("yarn.resourcemanager.hostname", "spark01");

    Path out = new Path(args[1]);
    FileSystem fs = FileSystem.get(conf);

    //判斷輸出路徑是否存在，當路徑存在時mapreduce會報錯
    if (fs.exists(out)) {
        fs.delete(out, true);
        System.out.println("ouput is exit  will delete");
    }
    
    // 建立任務
    Job job = Job.getInstance(conf, "wordcountDemo");
    // 設置job的主類
    job.setJarByClass(WordCount.class); // 主類

    // 設置做業的輸入路徑
    FileInputFormat.setInputPaths(job, new Path(args[0]));

    //設置map的相關參數
    job.setMapperClass(WordCountMapper.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(LongWritable.class);
    
    //設置reduce相關參數
    job.setReducerClass(WordCountReduce.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);

    //設置做業的輸出路徑
    FileOutputFormat.setOutputPath(job, out);
    job.setNumReduceTasks(2);
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

好用的spark

//spark版wordcount
sc.textFile("/home/sev7e0/access.log").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).foreach(println(_))

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。