【轉】Hadoop vs Spark性能對比

時間 2019-12-18

標籤 hadoop spark 性能對比欄目 Hadoop 简体版

原文原文鏈接

原文地址：http://www.cnblogs.com/jerrylead/archive/2012/08/13/2636149.html

基於Spark-0.4和Hadoop-0.20.2

1. Kmeans

數據：本身產生的三維數據，分別圍繞正方形的8個頂點html

{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10, 10},node

{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10, 10}算法

Point numbershell	189,918,082 （1億9千萬個三維點）apache
Capacityapp	10GBdom
HDFS Locationjsp	/user/LijieXu/Kmeans/Square-10GB.txtoop

程序邏輯：性能

讀取HDFS上的block到內存，每一個block轉化爲RDD，裏面包含vector。

而後對RDD進行map操做，抽取每一個vector（point）對應的類號，輸出（K,V）爲（class，（Point，1）），組成新的RDD。

而後再reduce以前，對每一個新的RDD進行combine，在RDD內部算出每一個class的中心和。使得每一個RDD的輸出只有最多K個KV對。

最後進行reduce獲得新的RDD（內容的Key是class，Value是中心和，再通過map後獲得最後的中心。

先上傳到HDFS上，而後在Master上運行

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:9000/user/LijieXu/Kmeans/Square-10GB.txt 8 2.0

迭代執行Kmeans算法。

一共160個task。（160 * 64MB = 10GB）

利用了32個CPU cores，18.9GB的內存。

每一個機器的內存消耗爲4.5GB （共40GB）（自己points數據10GB*2，Map後中間數據(K, V) => (int, (vector, 1)) （大概10GB）

最後結果：

0.505246194 s

Final centers: Map(5 -> (13.997101228817169, 9.208875044622895, -2.494072457488311), 8 -> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7 -> (8.658031587043952, 2.162306996983008, 17.670646829079146), 3 -> (11.530154433698268, 0.17834347219956842, 9.224352885937776), 4 -> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1 -> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6 -> (12.887793408866614, -1.5189406469928937, 9.526393664105957), 2 -> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143))

50MB/s 10GB => 3.5min

10MB/s 10GB => 15min

在20GB的數據上測試

Point number	377,370,313 （3億7千萬個三維點）
Capacity	20GB
HDFS Location	/user/LijieXu/Kmeans/Square-20GB.txt

運行測試命令：

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:9000/user/LijieXu/Kmeans/Square-20GB.txt 8 2.0 | tee mylogs/sqaure-20GB-kmeans.log

獲得聚類結果：

Final centers: Map(5 -> (-0.47785701742763115, -1.5901830956323306, -0.18453046159033773), 8 -> (1.1073911553593858, 9.051671594514225, -0.44722211311446924), 7 -> (1.4960397239284795, 10.173412443492643, -1.7932911100570954), 3 -> (-1.4771114031182642, 9.046878176063172, -2.4747981387714444), 4 -> (-0.2796747780312184, 0.06910629855122015, 10.268115903887612), 1 -> (10.467618592186486, -1.168580362309453, -1.0462842137817263), 6 -> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 2 -> (10.807948500515304, -0.5368803187391366, 0.04258123037074164))

基本就是8箇中心點

內存消耗：（每一個節點大約5.8GB），共50GB左右。

內存分析：

20GB原始數據，20GB的Map輸出

迭代次數	時間
1	108 s
2	0.93 s

12/06/05 11:11:08 INFO spark.CacheTracker: Looking for RDD partition 2:302

12/06/05 11:11:08 INFO spark.CacheTracker: Found partition in cache!

在20GB的數據上測試（迭代更多的次數）

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:900

0/user/LijieXu/Kmeans/Square-20GB.txt 8 0.8

Task數目：320

時間：

迭代次數	時間
1	100.9 s
2	0.93 s
3	4.6 s
4	3.9 s
5	3.9 s
6	3.9 s

迭代輪數對內存容量的影響：

基本沒有什麼影響，主要內存消耗：20GB的輸入數據RDD，20GB的中間數據。

Final centers: Map(5 -> (-4.728089224526789E-5, 3.17334874733142E-5, -2.0605806380414582E-4), 8 -> (1.1841686358289191E-4, 10.000062966002101, 9.999933240005394), 7 -> (9.999976672588097, 10.000199556926772, -2.0695123602840933E-4), 3 -> (-1.3506815993198176E-4, 9.999948270638338, 2.328148782609023E-5), 4 -> (3.2493629851483764E-4, -7.892413981250518E-5, 10.00002515017671), 1 -> (10.00004313126956, 7.431996896171192E-6, 7.590402882208648E-5), 6 -> (9.999982611661382, 10.000144597573051, 10.000037734639696), 2 -> (9.999958673426654, -1.1917651103354863E-4, 9.99990217533504))

結果可視化

2. HdfsTest

測試邏輯：

package spark.examples

import spark._

object HdfsTest {

def main(args: Array[String]) {

val sc = new SparkContext(args(0), "HdfsTest")

val file = sc.textFile(args(1))

val mapped = file.map(s => s.length).cache()

for (iter <- 1 to 10) {

val start = System.currentTimeMillis()

for (x <- mapped) { x + 2 }

// println("Processing: " + x)

val end = System.currentTimeMillis()

println("Iteration " + iter + " took " + (end-start) + " ms")

}

首先去HDFS上讀取一個文本文件保存在file

再次計算file中每行的字符數，保存在內存RDD的mapped中

而後讀取mapped中的每個字符數，將其加2，計算讀取+相加的耗時

只有map，沒有reduce。

測試10GB的Wiki

實際測試的是RDD的讀取性能。

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000:/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt

測試結果：

Iteration 1 took 12900 ms = 12s

Iteration 2 took 388 ms

Iteration 3 took 472 ms

Iteration 4 took 490 ms

Iteration 5 took 459 ms

Iteration 6 took 492 ms

Iteration 7 took 480 ms

Iteration 8 took 501 ms

Iteration 9 took 479 ms

Iteration 10 took 432 ms

每一個node的內存消耗爲2.7GB （共9.4GB * 3）

實際測試的是RDD的讀取性能。

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt

測試90GB的RandomText數據

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000/user/LijieXu/RandomText90GB/RandomText90GB

耗時：

迭代次數	耗時
1	111.905310882 s
2	4.681715228 s
3	4.469296148 s
4	4.441203887 s
5	1.999792125 s
6	2.151376037 s
7	1.889345699 s
8	1.847487668 s
9	1.827241743 s
10	1.747547323 s

內存總消耗30GB左右。

單個節點的資源消耗：

3. 測試WordCount

寫程序：

import spark.SparkContext

import SparkContext._

object WordCount {

def main(args: Array[String]) {

if (args.length < 2) {

System.err.println("Usage: wordcount <master> <jar>")

System.exit(1)

}

val sp = new SparkContext(args(0), "wordcount", "/opt/spark", List(args(1)))

val file = sp.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt");

val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://master:9000/user/Output/WikiResult3")

}

打包成mySpark.jar，上傳到Master的/opt/spark/newProgram。

運行程序：

root@master:/opt/spark# ./run -cp newProgram/mySpark.jar WordCount master@master:5050 newProgram/mySpark.jar

Mesos自動將jar拷貝到執行節點，而後執行。

內存消耗：（10GB輸入file + 10GB的flatMap + 15GB的Map中間結果（word，1））

還有部份內存不知道分配到哪裏了。

耗時：50 sec（未通過排序）

Hadoop WordCount耗時：120 sec到140 sec

結果未排序

單個節點：

Hadoop測試

Kmeans

運行Mahout裏的Kmeans

root@master:/opt/mahout-distribution-0.6# bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -Dmapred.reduce.tasks=36 -i /user/LijieXu/Kmeans/Square-20GB.txt -o output -t1 3 -t2 1.5 -cd 0.8 -k 8 -x 6

在運行（320個map，1個reduce）

Canopy Driver running buildClusters over input: output/data

時某個slave的資源消耗狀況

Completed Jobs

Jobid	Name	Map Total	Reduce Total	Time
job_201206050916_0029	Input Driver running over input: /user/LijieXu/Kmeans/Square-10GB.txt	160	0	1分2秒
job_201206050916_0030	KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed	160	1	1分6秒
job_201206050916_0031	KMeans Driver running runIteration over clustersIn: output/clusters-1	160	1	1分7秒
job_201206050916_0032	KMeans Driver running runIteration over clustersIn: output/clusters-2	160	1	1分7秒
job_201206050916_0033	KMeans Driver running runIteration over clustersIn: output/clusters-3	160	1	1分6秒
job_201206050916_0034	KMeans Driver running runIteration over clustersIn: output/clusters-4	160	1	1分6秒
job_201206050916_0035	KMeans Driver running runIteration over clustersIn: output/clusters-5	160	1	1分5秒
job_201206050916_0036	KMeans Driver running clusterData over input: output/data	160	0	55秒
job_201206050916_0037	Input Driver running over input: /user/LijieXu/Kmeans/Square-20GB.txt	320	0	1分31秒
job_201206050916_0038	KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed	320	36	1分46秒
job_201206050916_0039	KMeans Driver running runIteration over clustersIn: output/clusters-1	320	36	1分46秒
job_201206050916_0040	KMeans Driver running runIteration over clustersIn: output/clusters-2	320	36	1分46秒
job_201206050916_0041	KMeans Driver running runIteration over clustersIn: output/clusters-3	320	36	1分47秒
job_201206050916_0042	KMeans Driver running clusterData over input: output/data	320	0	1分34秒