對文件進行詞頻統計,是一個大數據領域的hello word級別的應用,來看下實現有多簡單:分佈式
egrep -o "\b[[:alpha:]]+\b" test_word.log|sort|uniq -c|sort -rn|head -10oop
line.split(" ").map((_, 1)).groupBy(_._1).map(_._2.reduce((v1, v2) => (v1._1, v1._2 + v2._2))).toArray.sortWith(_._2 > _._2).foreach(println)
val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) sc.textFile("test_word.log").flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _).sortBy(_._2, false).take(10).foreach(println)
val env = ExecutionEnvironment.getExecutionEnvironment
env.readTextFile("test_word.log").flatMap(_.toLowerCase.split("\\s+").map((_, 1)).groupBy(0).sum(1).sortPartition(1, Order.DESCENDING).first(10).print
>db.table_name.mapReduce(function(){ emit(this.column,1);}, function(key, values){return Array.sum(values);}, {out:"post_total"})post
hadoop jar /path/hadoop-2.6.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.1.jar wordcount /tmp/wordcount/input /tmp/wordcount/output 測試
附:測試文件test_word.log內容以下:大數據
hello world
hello wwwthis
輸出以下:spa
2 hello
1 world
1 wwwcode