Spark是UC Berkeley AMP lab所開源的類Hadoop MapReduce的通用的並行,Spark,擁有Hadoop MapReduce所具備的優勢;但不一樣於MapReduce的是Job中間輸出結果能夠保存在內存中,從而再也不須要讀寫HDFS,所以Spark能更好地適用於數據挖掘與機器學習等須要迭代的map reduce的算法。html
Spark是基於內存,是雲計算領域的繼Hadoop以後的下一代的最熱門的通用的並行計算框架開源項目,尤爲出色的支持Interactive Query、流計算、圖計算等。
Spark在機器學習方面有着無與倫比的優點,特別適合須要屢次迭代計算的算法。同時Spark的擁有很是出色的容錯和調度機制,確保系統的穩定運行,Spark目前的發展理念是經過一個計算框架集合SQL、Machine Learning、Graph Computing、Streaming Computing等多種功能於一個項目中,具備很是好的易用性。目前SPARK已經構建了本身的整個大數據處理生態系統,如流處理、圖技術、機器學習、NoSQL查詢等方面都有本身的技術,而且是Apache頂級Project,能夠預計的是2014年下半年在社區和商業應用上會有爆發式的增加。Spark最大的優點在於速度,在迭代處理計算方面比Hadoop快100倍以上;Spark另一個無可取代的優點是:「One Stack to rule them all」,Spark採用一個統一的技術堆棧解決了雲計算大數據的全部核心問題,這直接奠基了其一統雲計算大數據領域的霸主地位;java
下圖是使用邏輯迴歸算法的使用時間:python
Spark目前支持scala、python、JAVA編程。linux
做爲Spark的原生語言,scala是開發Spark應用程序的首選,其優雅簡潔的代碼,令開發過mapreduce代碼的碼農感受象是上了天堂。web
能夠架構在hadoop之上,讀取hadoop、hbase數據。算法
一、standalone模式,即獨立模式,自帶完整的服務,可單獨部署到一個集羣中,無需依賴任何其餘資源管理系統。apache
二、Spark On Mesos模式。這是不少公司採用的模式,官方推薦這種模式(固然,緣由之一是血緣關係)。編程
三、Spark On YARN模式。這是一種最有前景的部署模式。json
流程:進入linux->安裝JDK->安裝scala->安裝spark。架構
JDK的安裝和配置(略)。
安裝scala,進入http://www.scala-lang.org/download/下載。
下載後解壓縮。
tar zxvf scala-2.11.6.tgz //更名 mv scala-2.11.6 scala //設置配置 export SCALA_HOME=/home/hadoop/software/scala export PATH=$SCALA_HOME/bin;$PATH
source /etc/profile
scala -version
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
scala設置成功。
從http://spark.apache.org/downloads.html下載spark並安裝。
下載後解壓縮。
進入$SPARK_HOME/bin,運行
./run-example SparkPi
運行結果
Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/14 23:41:40 INFO SparkContext: Running Spark version 1.3.0 15/03/14 23:41:40 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.126.147 instead (on interface eth0) 15/03/14 23:41:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/03/14 23:41:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/14 23:41:41 INFO SecurityManager: Changing view acls to: hadoop 15/03/14 23:41:41 INFO SecurityManager: Changing modify acls to: hadoop 15/03/14 23:41:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 15/03/14 23:41:42 INFO Slf4jLogger: Slf4jLogger started 15/03/14 23:41:42 INFO Remoting: Starting remoting 15/03/14 23:41:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.126.147:60926] 15/03/14 23:41:42 INFO Utils: Successfully started service 'sparkDriver' on port 60926. 15/03/14 23:41:42 INFO SparkEnv: Registering MapOutputTracker 15/03/14 23:41:43 INFO SparkEnv: Registering BlockManagerMaster 15/03/14 23:41:43 INFO DiskBlockManager: Created local directory at /tmp/spark-285a6144-217c-442c-bfde-4b282378ac1e/blockmgr-f6cb0d15-d68d-4079-a0fe-9ec0bf8297a4 15/03/14 23:41:43 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 15/03/14 23:41:43 INFO HttpFileServer: HTTP File server directory is /tmp/spark-96b3f754-9cad-4ef8-9da7-2a2c5029c42a/httpd-b28f3f6d-73f7-46d7-9078-7ba7ea84ca5b 15/03/14 23:41:43 INFO HttpServer: Starting HTTP Server 15/03/14 23:41:43 INFO Server: jetty-8.y.z-SNAPSHOT 15/03/14 23:41:43 INFO AbstractConnector: Started SocketConnector@0.0.0.0:42548 15/03/14 23:41:43 INFO Utils: Successfully started service 'HTTP file server' on port 42548. 15/03/14 23:41:43 INFO SparkEnv: Registering OutputCommitCoordinator 15/03/14 23:41:43 INFO Server: jetty-8.y.z-SNAPSHOT 15/03/14 23:41:43 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/03/14 23:41:43 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/03/14 23:41:43 INFO SparkUI: Started SparkUI at http://192.168.126.147:4040 15/03/14 23:41:44 INFO SparkContext: Added JAR file:/home/hadoop/software/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar at http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1426347704488 15/03/14 23:41:44 INFO Executor: Starting executor ID <driver> on host localhost 15/03/14 23:41:44 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.126.147:60926/user/HeartbeatReceiver 15/03/14 23:41:44 INFO NettyBlockTransferService: Server created on 39408 15/03/14 23:41:44 INFO BlockManagerMaster: Trying to register BlockManager 15/03/14 23:41:44 INFO BlockManagerMasterActor: Registering block manager localhost:39408 with 265.1 MB RAM, BlockManagerId(<driver>, localhost, 39408) 15/03/14 23:41:44 INFO BlockManagerMaster: Registered BlockManager 15/03/14 23:41:45 INFO SparkContext: Starting job: reduce at SparkPi.scala:35 15/03/14 23:41:45 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 2 output partitions (allowLocal=false) 15/03/14 23:41:45 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35) 15/03/14 23:41:45 INFO DAGScheduler: Parents of final stage: List() 15/03/14 23:41:45 INFO DAGScheduler: Missing parents: List() 15/03/14 23:41:45 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31), which has no missing parents 15/03/14 23:41:45 INFO MemoryStore: ensureFreeSpace(1848) called with curMem=0, maxMem=278019440 15/03/14 23:41:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1848.0 B, free 265.1 MB) 15/03/14 23:41:45 INFO MemoryStore: ensureFreeSpace(1296) called with curMem=1848, maxMem=278019440 15/03/14 23:41:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1296.0 B, free 265.1 MB) 15/03/14 23:41:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:39408 (size: 1296.0 B, free: 265.1 MB) 15/03/14 23:41:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/14 23:41:45 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:839 15/03/14 23:41:45 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31) 15/03/14 23:41:45 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/03/14 23:41:45 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1340 bytes) 15/03/14 23:41:45 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1340 bytes) 15/03/14 23:41:45 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 15/03/14 23:41:45 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/14 23:41:45 INFO Executor: Fetching http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1426347704488 15/03/14 23:41:45 INFO Utils: Fetching http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar to /tmp/spark-db1e742b-020f-4db1-9ee3-f3e2d90e1bc2/userFiles-96c6db61-e95e-4f9e-a6c4-0db892583854/fetchFileTemp5600234414438914634.tmp 15/03/14 23:41:46 INFO Executor: Adding file:/tmp/spark-db1e742b-020f-4db1-9ee3-f3e2d90e1bc2/userFiles-96c6db61-e95e-4f9e-a6c4-0db892583854/spark-examples-1.3.0-hadoop2.4.0.jar to class loader 15/03/14 23:41:47 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 736 bytes result sent to driver 15/03/14 23:41:47 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 736 bytes result sent to driver 15/03/14 23:41:47 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1560 ms on localhost (1/2) 15/03/14 23:41:47 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1540 ms on localhost (2/2) 15/03/14 23:41:47 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/14 23:41:47 INFO DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 1.578 s 15/03/14 23:41:47 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 2.099817 s Pi is roughly 3.14438 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 15/03/14 23:41:47 INFO SparkUI: Stopped Spark web UI at http://192.168.126.147:4040 15/03/14 23:41:47 INFO DAGScheduler: Stopping DAGScheduler 15/03/14 23:41:47 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 15/03/14 23:41:47 INFO MemoryStore: MemoryStore cleared 15/03/14 23:41:47 INFO BlockManager: BlockManager stopped 15/03/14 23:41:47 INFO BlockManagerMaster: BlockManagerMaster stopped 15/03/14 23:41:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped! 15/03/14 23:41:47 INFO SparkContext: Successfully stopped SparkContext 15/03/14 23:41:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/14 23:41:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
能夠看到輸出結果爲3.14438。