【rdd 惰性執行】html
爲了提升計算效率 spark 採用了哪些機制java
1-rdd 基於分佈式內存數據集進行運算node
2-lazy evaluation :惰性執行,即rdd的變換操做並非在運行該代碼時當即執行,而僅記錄下轉換操做的對象;只有當運行到一個行動代碼時,變換操做的計算邏輯才真正執行。python
http://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rddslinux
【 rdd 是數據結構,spark最小的處理單元,rdd的方法(rdd算子)實現數據的轉換和歸約。】git
RDD Operationssql
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).shell
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.express
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.apache
http://wiki.mbalib.com/wiki/肯德爾等級相關係數
http://baike.baidu.com/item/kendall秩相關係數
斯皮爾曼等級相關(Spearman Rank Correlation)
http://wiki.mbalib.com/wiki/斯皮爾曼等級相關
Spark3重相關係數的計算
【安裝】
C:\spark211binhadoop27\bin>echo %HADOOP_HOME% C:\spark211binhadoop27 C:\spark211binhadoop27\bin>%HADOOP_HOME%\bin\winutils.exe ls \tmp\hive d--------- 1 PC-20170702HMLH\Administrator PC-20170702HMLH\None 0 Jul 7 2017 \tmp\hive C:\spark211binhadoop27\bin> %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive C:\spark211binhadoop27\bin> %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive drwxrwxrwx 1 PC-20170702HMLH\Administrator PC-20170702HMLH\None 0 Jul 7 2017 \tmp\hive C:\spark211binhadoop27\bin>spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/07 09:43:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/07 09:43:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 17/07/07 09:43:15 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark211binha doop27/bin/../jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark211binhadoop27/jars/datanucleus-core-3.2.10.jar. " 17/07/07 09:43:15 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/s park211binhadoop27/bin/../jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark211binhadoop27/jars/datanucleus-rdbm s-3.2.9.jar." 17/07/07 09:43:15 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark 211binhadoop27/bin/../jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark211binhadoop27/jars/datanucleus-api-jd o-3.2.6.jar." 17/07/07 09:43:18 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 17/07/07 09:43:18 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 17/07/07 09:43:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.0.155:4041 Spark context available as 'sc' (master = local[*], app id = local-1499391794538). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> var file = sc.textFile("E:\\LearnSpark\\word.txt") file: org.apache.spark.rdd.RDD[String] = E:\LearnSpark\word.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> var counts = file.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_ + _) counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26 scala> counts.saveASTextFile("E:\\LearnSpark\\count.txt") <console>:29: error: value saveASTextFile is not a member of org.apache.spark.rdd.RDD[(String, Int)] counts.saveASTextFile("E:\\LearnSpark\\count.txt") ^ scala> counts.saveAsTextFile("E:\\LearnSpark\\count.txt") org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/E:/LearnSpark/count.txt already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1191) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1168) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1071) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1037) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:963) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1489) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1468) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1468) ... 48 elided scala> counts.saveAsTextFile("E:\\LearnSpark\\count.txt") scala>
https://blogs.msdn.microsoft.com/arsen/2016/02/09/resolving-spark-1-6-0-java-lang-nullpointerexception-not-found-value-sqlcontext-error-when-running-spark-shell-on-windows-10-64-bit/
C:\spark211binhadoop27\bin>echo %HADOOP_HOME% C:\spark211binhadoop27 C:\spark211binhadoop27\bin>%HADOOP_HOME%\bin\winutils.exe ls \tmp\hive d--------- 1 PC-20170702HMLH\Administrator PC-20170702HMLH\None 0 Jul 7 2017 \tmp\hive C:\spark211binhadoop27\bin> %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive C:\spark211binhadoop27\bin> %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive drwxrwxrwx 1 PC-20170702HMLH\Administrator PC-20170702HMLH\None 0 Jul 7 2017 \tmp\hive C:\spark211binhadoop27\bin>spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/07/07 09:43:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/07 09:43:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 17/07/07 09:43:15 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark211binha doop27/bin/../jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark211binhadoop27/jars/datanucleus-core-3.2.10.jar. " 17/07/07 09:43:15 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/s park211binhadoop27/bin/../jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark211binhadoop27/jars/datanucleus-rdbm s-3.2.9.jar." 17/07/07 09:43:15 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/spark 211binhadoop27/bin/../jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark211binhadoop27/jars/datanucleus-api-jd o-3.2.6.jar." 17/07/07 09:43:18 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 17/07/07 09:43:18 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 17/07/07 09:43:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.0.155:4041 Spark context available as 'sc' (master = local[*], app id = local-1499391794538). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala>
設置Windows系統的環境變量 HADDOP_HOME爲
HADDOP_HOME E:\LearnSpark\spark-2.1.1-bin-hadoop2.7
from pyspark.sql import SparkSession import re f = 'part-m-00003' spark = SparkSession.builder.appName("spark_from_log_chk_unique_num").getOrCreate() unique_mac_num = spark.read.text(f).rdd.map(lambda r: r[0]).map( lambda s: sorted(re.findall('[0-9a-z]{2}[0-9a-z]{2}[0-9a-z]{2}[0-9a-z]{2}[0-9a-z]{2}', s)[3:])).map( lambda l: ' '.join(l)).flatMap(lambda x: x.split(' ')).distinct().count()
ssh://sci@192.168.16.128:22/usr/bin/python -u /home/sci/.pycharm_helpers/pydev/pydevd.py --multiproc --qt-support --client '0.0.0.0' --port 45131 --file /home/win_pymine_clean/filter_mac/do_action.py warning: Debugger speedups using cython not found. Run '"/usr/bin/python" "/home/sci/.pycharm_helpers/pydev/setup_cython.py" build_ext --inplace' to build. pydev debugger: process 10092 is connecting Connected to pydev debugger (build 171.3780.115) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/filter_mac/4.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/feature_wifi/mechanical_proving.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/filter_mac/chk_single_mac.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/spider_businessarea/spider_58.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/filter_mac/4.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/filter_mac/4.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/spider_map/one_row_unique_num.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/spider_businessarea/spider_58.py (will have no effect) pydev debugger: warning: trying to add breakpoint to file that does not exist: /home/win_pymine_clean/feature_wifi/matrix_singular_value_decomposition.py (will have no effect) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/10/24 00:45:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/10/24 00:45:54 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.16.128 instead (on interface ens33) 17/10/24 00:45:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 17/10/24 00:45:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 17/10/24 00:45:55 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
【 WebUi 做業信息 】
【DAG】
sci 192.168.16.128
sudo apt install net-tools
h0 192.168.16.129
h1 192.168.16.130
h2 192.168.16.131
h3 192.168.16.132
h4 192.168.16.133
h5 192.168.16.134
test 192.168.16.135
sudo apt install ssh
【
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
Two types of links
There are two types of links
symbolic links: Refer to a symbolic path indicating the abstract location of another file
hard links : Refer to the specific location of physical data.
How do I create soft link / symbolic link?
Soft links are created with the ln command. For example, the following would create a soft link named link1 to a file named file1, both in the current directory
$ ln -s file1 link1
To verify new soft link run:
$ ls -l file1 link1
】
【
Hard link vs. Soft link in Linux or UNIX
Hard links cannot link directories.
Cannot cross file system boundaries.
Soft or symbolic links are just like hard links. It allows to associate multiple filenames with a single file. However, symbolic links allows:
To create links between directories.
Can cross file system boundaries.
These links behave differently when the source of the link is moved or removed.
Symbolic links are not updated.
Hard links always refer to the source, even if moved or removed.
】
{
【
lrwxrwxrwx 1 root root 9 Jan 24 2017 python -> python2.7*
lrwxrwxrwx 1 root root 9 Jan 24 2017 python2 -> python2.7*
-rwxr-xr-x 1 root root 3785256 Jan 19 2017 python2.7*
lrwxrwxrwx 1 root root 33 Jan 19 2017 python2.7-config -> x86_64-linux-gnu-python2.7-config*
lrwxrwxrwx 1 root root 16 Jan 24 2017 python2-config -> python2.7-config*
lrwxrwxrwx 1 root root 9 Sep 17 20:24 python3 -> python3.5*
}
sudo rm py*2
sudo rm python
sudo ln -s ./python3.5 python
【
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.
】
【pyspark mapreduce 計算文件中指定類型值的個數】
from pyspark.sql import SparkSession import re f = 'part-m-00003' spark = SparkSession.builder.appName("spark_from_log_chk_unique_num").getOrCreate() unique_mac_num = spark.read.text(f).rdd.map(lambda r: r[0]).map( lambda s: sorted(re.findall('[0-9a-z]{2}[0-9a-z]{2}[0-9a-z]{2}[0-9a-z]{2}[0-9a-z]{2}', s)[3:])).map( lambda l: ' '.join(l)).flatMap(lambda x: x.split(' ')).distinct().count() dd =9
【全局臨時視圖】
http://spark.apache.org/docs/latest/sql-programming-guide.html#global-temporary-view
【 session-scoped】
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp
, and we must use the qualified name to refer it, e.g. SELECT * FROM global_temp.view1
.
// Register the DataFrame as a global temporary view df.createGlobalTempView("people"); // Global temporary view is tied to a system preserved database `global_temp` spark.sql("SELECT * FROM global_temp.people").show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ // Global temporary view is cross-session spark.newSession().sql("SELECT * FROM global_temp.people").show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
scala spark wordcount windows
1-安裝
- jdk jdk-8u171-windows-x64.exe 並添加到系統Path spark-shell運行須要 D:\scalajre\jre\bin\java.exe ; 路徑問題,在代碼中有明確的答案
- IDE http://scala-ide.org/download/sdk.html Download Scala IDE for Eclipse
- scala https://scala-lang.org/download/ https://downloads.lightbend.com/scala/2.12.6/scala-2.12.6.msi 添加到系統Path,便於在cmd開啓
注意安裝到無空格目錄
cmd scala 環境變量 SCALA_HOME D:\scalainstall\bin
- 下載spark
http://220.112.193.199/files/9037000006180326/mirrors.shu.edu.cn/apache/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
解壓至 D:\LearnSpark\win
下載 http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe (win7 可用) ,拷貝到D:\LearnSpark\win\spark-2.3.0-bin-hadoop2.7\bin
版本參考 Running Spark Applications on Windows · Mastering Apache Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-tips-and-tricks-running-spark-windows.html
設置HADOOP_HOME 爲 D:\LearnSpark\win\spark-2.3.0-bin-hadoop2.7
爲了在cmd.exe 中直接使用spark-shell,將D:\LearnSpark\win\spark-2.3.0-bin-hadoop2.7\bin添加到Path
C:\Users\Administrator>spark-shell 2018-05-20 20:10:42 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platfo rm... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.1.102:4040 Spark context available as 'sc' (master = local[*], app id = local-1526818258887). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Type in expressions to have them evaluated. Type :help for more information. scala> 移除winutils.exe Microsoft Windows [版本 6.1.7601] 版權全部 (c) 2009 Microsoft Corporation。保留全部權利。 C:\Users\Administrator>spark-shell 2018-05-20 20:15:37 ERROR Shell:397 - Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable D:\LearnSpark\win\spark-2.3.0-bin-hadoop2.7\bin\win utils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80) at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.ja va:261) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation .java:791) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:76 1) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java: 634) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2464) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2464) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2464) at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:222) at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.scala:393) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkS ubmit.scala:393) at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit .scala:401) at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit .scala:401) at scala.Option.map(Option.scala:146) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:400) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2018-05-20 20:15:37 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platfo rm... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.1.102:4040 Spark context available as 'sc' (master = local[*], app id = local-1526818554657). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Type in expressions to have them evaluated. Type :help for more information. scala>
主要windows下winutils.exe的位置,結果文件名不要在目標文件夾存在
Microsoft Windows [版本 6.1.7601] 版權全部 (c) 2009 Microsoft Corporation。保留全部權利。 C:\Users\Administrator>spark-shell 2018-05-20 20:29:41 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platfo rm... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.1.102:4040 Spark context available as 'sc' (master = local[*], app id = local-1526819397675). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Type in expressions to have them evaluated. Type :help for more information. scala> var file=sc.textFile("D:\\LearnSpark\\win\\mytest.txt") file: org.apache.spark.rdd.RDD[String] = D:\LearnSpark\win\mytest.txt MapPartitionsRDD[1] at textFil e at <console>:24 scala> var counts100=file.flatMap(line=>line.split(" ")).map(word=>(word,100)).reduceByKey(_+_) counts100: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:25 scala> counts100.saveAsTextFile("D:\\LearnSpark\\win\\my_unique_name.txt") scala>
win10 運行demo
var textFile=spark.read.textFile("D://SparkProducing//userdata//testenv.txt")
The root scratch dir: /tmp/hive on HDFS should be writable
The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- (on Windows) - Stack Overflow https://stackoverflow.com/questions/34196302/the-root-scratch-dir-tmp-hive-on-hdfs-should-be-writable-current-permissions
查C盤 存在 c://tmp//hive 修改權限
textFile.count() // Number of items in this Dataset
textFile.first() // First item in this Dataset
var lineWithSpark=textFile.filter(line=>line.contains("Spark"))
lineWithSpark.collect()