數據:本身產生的三維數據,分別圍繞正方形的8個頂點html
{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10, 10},node
{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10, 10}算法
Point numbershell |
189,918,082 (1億9千萬個三維點)apache |
Capacityapp |
10GBdom |
HDFS Locationjsp |
/user/LijieXu/Kmeans/Square-10GB.txtoop |
程序邏輯:性能
讀取HDFS上的block到內存,每一個block轉化爲RDD,裏面包含vector。 而後對RDD進行map操做,抽取每一個vector(point)對應的類號,輸出(K,V)爲(class,(Point,1)),組成新的RDD。 而後再reduce以前,對每一個新的RDD進行combine,在RDD內部算出每一個class的中心和。使得每一個RDD的輸出只有最多K個KV對。 最後進行reduce獲得新的RDD(內容的Key是class,Value是中心和,再通過map後獲得最後的中心。 |
先上傳到HDFS上,而後在Master上運行
root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:9000/user/LijieXu/Kmeans/Square-10GB.txt 8 2.0 |
迭代執行Kmeans算法。
一共160個task。(160 * 64MB = 10GB)
利用了32個CPU cores,18.9GB的內存。
每一個機器的內存消耗爲4.5GB (共40GB)(自己points數據10GB*2,Map後中間數據(K, V) => (int, (vector, 1)) (大概10GB)
最後結果:
0.505246194 s Final centers: Map(5 -> (13.997101228817169, 9.208875044622895, -2.494072457488311), 8 -> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7 -> (8.658031587043952, 2.162306996983008, 17.670646829079146), 3 -> (11.530154433698268, 0.17834347219956842, 9.224352885937776), 4 -> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1 -> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6 -> (12.887793408866614, -1.5189406469928937, 9.526393664105957), 2 -> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143)) |
50MB/s 10GB => 3.5min
10MB/s 10GB => 15min
Point number |
377,370,313 (3億7千萬個三維點) |
Capacity |
20GB |
HDFS Location |
/user/LijieXu/Kmeans/Square-20GB.txt |
運行測試命令:
root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:9000/user/LijieXu/Kmeans/Square-20GB.txt 8 2.0 | tee mylogs/sqaure-20GB-kmeans.log |
獲得聚類結果:
Final centers: Map(5 -> (-0.47785701742763115, -1.5901830956323306, -0.18453046159033773), 8 -> (1.1073911553593858, 9.051671594514225, -0.44722211311446924), 7 -> (1.4960397239284795, 10.173412443492643, -1.7932911100570954), 3 -> (-1.4771114031182642, 9.046878176063172, -2.4747981387714444), 4 -> (-0.2796747780312184, 0.06910629855122015, 10.268115903887612), 1 -> (10.467618592186486, -1.168580362309453, -1.0462842137817263), 6 -> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 2 -> (10.807948500515304, -0.5368803187391366, 0.04258123037074164)) |
基本就是8箇中心點
內存消耗:(每一個節點大約5.8GB),共50GB左右。
內存分析:
20GB原始數據,20GB的Map輸出
迭代次數 |
時間 |
1 |
108 s |
2 |
0.93 s |
12/06/05 11:11:08 INFO spark.CacheTracker: Looking for RDD partition 2:302
12/06/05 11:11:08 INFO spark.CacheTracker: Found partition in cache!
root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:900 0/user/LijieXu/Kmeans/Square-20GB.txt 8 0.8 |
Task數目:320
時間:
迭代次數 |
時間 |
1 |
100.9 s |
2 |
0.93 s |
3 |
4.6 s |
4 |
3.9 s |
5 |
3.9 s |
6 |
3.9 s |
迭代輪數對內存容量的影響:
基本沒有什麼影響,主要內存消耗:20GB的輸入數據RDD,20GB的中間數據。
Final centers: Map(5 -> (-4.728089224526789E-5, 3.17334874733142E-5, -2.0605806380414582E-4), 8 -> (1.1841686358289191E-4, 10.000062966002101, 9.999933240005394), 7 -> (9.999976672588097, 10.000199556926772, -2.0695123602840933E-4), 3 -> (-1.3506815993198176E-4, 9.999948270638338, 2.328148782609023E-5), 4 -> (3.2493629851483764E-4, -7.892413981250518E-5, 10.00002515017671), 1 -> (10.00004313126956, 7.431996896171192E-6, 7.590402882208648E-5), 6 -> (9.999982611661382, 10.000144597573051, 10.000037734639696), 2 -> (9.999958673426654, -1.1917651103354863E-4, 9.99990217533504)) |
結果可視化
測試邏輯:
package spark.examples import spark._ object HdfsTest { def main(args: Array[String]) { val sc = new SparkContext(args(0), "HdfsTest") val file = sc.textFile(args(1)) val mapped = file.map(s => s.length).cache() for (iter <- 1 to 10) { val start = System.currentTimeMillis() for (x <- mapped) { x + 2 } // println("Processing: " + x) val end = System.currentTimeMillis() println("Iteration " + iter + " took " + (end-start) + " ms") } } } |
首先去HDFS上讀取一個文本文件保存在file
再次計算file中每行的字符數,保存在內存RDD的mapped中
而後讀取mapped中的每個字符數,將其加2,計算讀取+相加的耗時
只有map,沒有reduce。
實際測試的是RDD的讀取性能。
root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000:/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt |
測試結果:
Iteration 1 took 12900 ms = 12s Iteration 2 took 388 ms Iteration 3 took 472 ms Iteration 4 took 490 ms Iteration 5 took 459 ms Iteration 6 took 492 ms Iteration 7 took 480 ms Iteration 8 took 501 ms Iteration 9 took 479 ms Iteration 10 took 432 ms |
每一個node的內存消耗爲2.7GB (共9.4GB * 3)
實際測試的是RDD的讀取性能。
root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt |
root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000/user/LijieXu/RandomText90GB/RandomText90GB |
耗時:
迭代次數 |
耗時 |
1 |
111.905310882 s |
2 |
4.681715228 s |
3 |
4.469296148 s |
4 |
4.441203887 s |
5 |
1.999792125 s |
6 |
2.151376037 s |
7 |
1.889345699 s |
8 |
1.847487668 s |
9 |
1.827241743 s |
10 |
1.747547323 s |
內存總消耗30GB左右。
單個節點的資源消耗:
寫程序:
import spark.SparkContext import SparkContext._ object WordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: wordcount <master> <jar>") System.exit(1) } val sp = new SparkContext(args(0), "wordcount", "/opt/spark", List(args(1))) val file = sp.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt"); val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("hdfs://master:9000/user/Output/WikiResult3") } } |
打包成mySpark.jar,上傳到Master的/opt/spark/newProgram。
運行程序:
root@master:/opt/spark# ./run -cp newProgram/mySpark.jar WordCount master@master:5050 newProgram/mySpark.jar |
Mesos自動將jar拷貝到執行節點,而後執行。
內存消耗:(10GB輸入file + 10GB的flatMap + 15GB的Map中間結果(word,1))
還有部份內存不知道分配到哪裏了。
耗時:50 sec(未通過排序)
Hadoop WordCount耗時:120 sec到140 sec
結果未排序
單個節點:
運行Mahout裏的Kmeans
root@master:/opt/mahout-distribution-0.6# bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -Dmapred.reduce.tasks=36 -i /user/LijieXu/Kmeans/Square-20GB.txt -o output -t1 3 -t2 1.5 -cd 0.8 -k 8 -x 6 |
在運行(320個map,1個reduce)
Canopy Driver running buildClusters over input: output/data
時某個slave的資源消耗狀況
Jobid |
Name |
Map Total |
Reduce Total |
Time |
Input Driver running over input: /user/LijieXu/Kmeans/Square-10GB.txt |
160 |
0 |
1分2秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed |
160 |
1 |
1分6秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-1 |
160 |
1 |
1分7秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-2 |
160 |
1 |
1分7秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-3 |
160 |
1 |
1分6秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-4 |
160 |
1 |
1分6秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-5 |
160 |
1 |
1分5秒 |
|
KMeans Driver running clusterData over input: output/data |
160 |
0 |
55秒 |
|
Input Driver running over input: /user/LijieXu/Kmeans/Square-20GB.txt |
320 |
0 |
1分31秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed |
320 |
36 |
1分46秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-1 |
320 |
36 |
1分46秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-2 |
320 |
36 |
1分46秒 |
|
KMeans Driver running runIteration over clustersIn: output/clusters-3 |
320 |
36 |
1分47秒 |
|
KMeans Driver running clusterData over input: output/data |
320 |
0 |
1分34秒 |
運行屢次10GB、20GB上的Kmeans,資源消耗
進入Master的/opt/spark
運行
MASTER=master@master:5050 ./spark-shell |
打開Mesos版本的spark
在master:8080能夠看到framework
ID |
User |
Name |
Running Tasks |
CPUs |
MEM |
Max Share |
Connected |
201206050924-0-0018 |
root |
0 |
0 |
0.0 MB |
0.00 |
2012-06-06 21:12:56 |
scala> val file = sc.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt") scala> file.first scala> val words = file.map(_.split(' ')).filter(_.size < 100) //獲得RDD[Array[String]] scala> words.cache scala> words.filter(_.contains("Beijing")).count 12/06/06 22:12:33 INFO SparkContext: Job finished in 10.862765819 s res1: Long = 855 scala> words.filter(_.contains("Beijing")).count 12/06/06 22:12:52 INFO SparkContext: Job finished in 0.71051464 s res2: Long = 855 scala> words.filter(_.contains("Shanghai")).count 12/06/06 22:13:23 INFO SparkContext: Job finished in 0.667734427 s res3: Long = 614 scala> words.filter(_.contains("Guangzhou")).count 12/06/06 22:13:42 INFO SparkContext: Job finished in 0.800617719 s res4: Long = 134 |
因爲GC的問題,不能cache很大的數據集。