Spark從s3中讀取數據

根據Spark官網Quick Start,簡單修改下file source
ref: http://spark.apache.org/docs/latest/quick-start.htmlhtml

package myspark;

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class LogAnalyser {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID");
        sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET");

        String logFile = "s3n://bucket/*.log";

        JavaRDD<String> logData = sc.textFile(logFile).cache();

        long numAs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains("a");
            }
        }).count();

        long numBs = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains("b");
            }
        }).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

        sc.stop();
    }
}

將項目打包爲test-0.1.0.jar,提交給Spark執行:java

SPARK_HOME/bin/spark-submit --class myspark.LogAnalyser \
--master local[4] build/libs/test-0.1.0.jar

發現報錯:web

No FileSystem for scheme: s3n

緣由及解決方法:apache

This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the –packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use –jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script.api

SPARK_HOME/bin/spark-submit --class myspark.LogAnalyser \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 \
--master local[4] build/libs/test-0.1.0.jar

其餘語言(語言)參考: https://sparkour.urizone.net/recipes/using-s3/app