spark wordcount—IDEA

  • 1.首先在IDEA中,確認是否存在scala編譯工具沒有的話去官網下載http://www.scala-lang.org/,下面咱們來用scala來寫一個wordcount demo
  • 在IDEA創建HelloWord項目,項目爲scala項目,而後在創建一個包名爲com.admin 創建HelloWord.scala文件,這邊是一個對象文件Object.此處jar文件要引用spark-assembly-1.6.1-hadoop2.6.0.jar包
  • 1
  • scala代碼以下:
    package com.admin
    
    import org.apache.spark.{SparkContext, SparkConf}
    
    /**
     * Created by test on 2016/5/24.
     */
    object HelloWord {
      def main(args: Array[String]) {
        if(args.length>1){
          System.err.println("")
          System.exit(1)
        }
    
        val conf=new SparkConf()
        val sc = new SparkContext(conf)
        val line = sc.textFile(args(0))
        line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)
        sc.stop()
      }
    
    }
  • scala代碼比java簡單多了,JAVA真的是要好多的代碼,JAVA代碼以下:
    package com.admin;
    
    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaPairRDD;
    import org.apache.spark.api.java.JavaRDD;
    import org.apache.spark.api.java.JavaSparkContext;
    import org.apache.spark.api.java.function.FlatMapFunction;
    import org.apache.spark.api.java.function.Function2;
    import org.apache.spark.api.java.function.PairFunction;
    import scala.Tuple2;
    
    import java.util.Arrays;
    import java.util.List;
    import java.util.regex.Pattern;
    
    public final class JavaWordCount {
        private static final Pattern SPACE = Pattern.compile(" ");
    
        public static void main(String[] args) throws Exception {
    
            if (args.length < 1) {
                System.err.println("Usage: JavaWordCount <file>");
                System.exit(1);
            }
    
            SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
            JavaSparkContext ctx = new JavaSparkContext(sparkConf);
            JavaRDD<String> lines = ctx.textFile(args[0], 1);
    
            JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
                @Override
                public Iterable<String> call(String s) {
                    return Arrays.asList(SPACE.split(s));
                }
            });
    
            JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
                @Override
                public Tuple2<String, Integer> call(String s) {
                    return new Tuple2<String, Integer>(s, 1);
                }
            });
    
            JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
                @Override
                public Integer call(Integer i1, Integer i2) {
                    return i1 + i2;
                }
            });
    
            List<Tuple2<String, Integer>> output = counts.collect();
            for (Tuple2<?, ?> tuple : output) {
                System.out.println(tuple._1() + ": " + tuple._2());
            }
            ctx.stop();
        }
    }
  • 關於IDEA我也是剛接觸之前都是eclipse開發 ,程序寫好了接下來就是打成jar文件了234
  • 能夠在打包的目錄下找到jar文件,以後傳入集羣服務器中。1
  • 開始運行,運行在hadoop yarn上,--name 爲當前運行job的名字 --class 是jar中的主類名,最後的 /user/test.txt是hdfs上的一個文件。文件的類容:import org apache spark api java JavaPairRDD import org apache spark api java JavaRDD import org apache spark api java JavaSparkContext import org apache spark api java function FlatMapFunction import org apache spark api java function Function import org apache spark api java function Function2 import org apache spark api java function PairFunction import scala Tuple2
  • 運行結束查看結果1
  • 單詞統計出現的次數
相關文章
相關標籤/搜索