http://www.talkwithtrend.com/Question/177983-1247453html
一些測試的描述以下內容最爲詳細,供你參考:java
測試對於驗證系統的正確性、分析系統的性能來講很是重要,但每每容易被咱們所忽視。爲了能對系統有更全面的瞭解、能找到系統的瓶頸所在、能對系統性能作更好的改進,打算先從測試入手,學習Hadoop幾種主要的測試手段。本文將分紅兩部分:第一部分記錄如何使用Hadoop自帶的測試工具進行測試;第二部分記錄Intel開放的Hadoop Benchmark Suit: HiBench的安裝及使用。node
1. Hadoop基準測試git
Hadoop自帶了幾個基準測試,被打包在幾個jar包中,如hadoop-test.jar和hadoop-examples.jar,在Hadoop環境中能夠很方便地運行測試。本文測試使用的Hadoop版本是cloudera的hadoop-0.20.2-cdh3u3。github
在測試前,先設置好環境變量:web
$ export $HADOOP_HOME=/home/hadoop/hadoop $ export $PATH=$PATH:$HADOOP_HOME/bin
使用如下命令就能夠調用jar包中的類:正則表達式
$ hadoop jar $HADOOP_HOME/xxx.jar
(1). Hadoop Test算法
當不帶參數調用hadoop-test-0.20.2-cdh3u3.jar時,會列出全部的測試程序:數據庫
$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar An example program must be given as the first argument. Valid program names are: DFSCIOTest: Distributed i/o benchmark of libhdfs. DistributedFSCheck: Distributed checkup of the file system consistency. MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures TestDFSIO: Distributed i/o benchmark. dfsthroughput: measure hdfs throughput filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed) loadgen: Generic map/reduce load generator mapredtest: A map/reduce test check. minicluster: Single process HDFS and MR cluster. mrbench: A map/reduce benchmark that can create many small jobs nnbench: A benchmark that stresses the namenode. testarrayfile: A test for flat files of binary key/value pairs. testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce testfilesystem: A test for FileSystem read/write. testipc: A test for ipc. testmapredsort: A map/reduce program that validates the map-reduce framework's sort. testrpc: A test for rpc. testsequencefile: A test for flat files of binary key value pairs. testsequencefileinputformat: A test for sequence file input format. testsetfile: A test for flat files of binary key/value pairs. testtextinputformat: A test for text input format. threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
這些程序從多個角度對Hadoop進行測試,TestDFSIO、mrbench和nnbench是三個普遍被使用的測試。vim
TestDFSIO
TestDFSIO用於測試HDFS的IO性能,使用一個MapReduce做業來併發地執行讀寫操做,每一個map任務用於讀或寫每一個文件,map的輸出用於收集與處理文件相關的統計信息,reduce用於累積統計信息,併產生summary。TestDFSIO的用法以下:
TestDFSIO.0.0.6 Usage: TestDFSIO [genericOptions] -read | -write | -append | -clean [-nrFiles N] [-fileSize Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]
如下的例子將往HDFS中寫入10個1000MB的文件:
$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
結果將會寫到一個本地文件TestDFSIO_results.log:
----- TestDFSIO ----- : write Date & time: Mon Dec 10 11:11:15 CST 2012 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 3.5158047729862436 Average IO rate mb/sec: 3.5290374755859375 IO rate std deviation: 0.22884063705950305 Test exec time sec: 316.615
如下的例子將從HDFS中讀取10個1000MB的文件:
$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
結果將會寫到一個本地文件TestDFSIO_results.log:
----- TestDFSIO ----- : read Date & time: Mon Dec 10 11:21:17 CST 2012 Number of files: 10 Total MBytes processed: 10000.0 Throughput mb/sec: 255.8002711482874 Average IO rate mb/sec: 257.1685791015625 IO rate std deviation: 19.514058659935184 Test exec time sec: 18.459
使用如下命令刪除測試數據:
$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar TestDFSIO -clean
nnbench
nnbench用於測試NameNode的負載,它會生成不少與HDFS相關的請求,給NameNode施加較大的壓力。這個測試能在HDFS上模擬建立、讀取、重命名和刪除文件等操做。nnbench的用法以下:
NameNode Benchmark 0.4 Usage: nnbench <options> Options: -operation <Available operations are create_write open_read rename delete. This option is mandatory> * NOTE: The open_read, rename and delete operations assume that the files they operate on, are already available. The create_write operation must be run before running the other operations. -maps <number of maps. default is 1. This is not mandatory> -reduces <number of reduces. default is 1. This is not mandatory> -startTime <time to start, given in seconds from the epoch. Make sure this is far enough into the future, so all maps (operations) will start at the same time>. default is launch time + 2 mins. This is not mandatory -blockSize <Block size in bytes. default is 1. This is not mandatory> -bytesToWrite <Bytes to write. default is 0. This is not mandatory> -bytesPerChecksum <Bytes per checksum for the files. default is 1. This is not mandatory> -numberOfFiles <number of files to create. default is 1. This is not mandatory> -replicationFactorPerFile <Replication factor for the files. default is 1. This is not mandatory> -baseDir <base DFS path. default is /becnhmarks/NNBench. This is not mandatory> -readFileAfterOpen <true or false. if true, it reads the file and reports the average time to read. This is valid with the open_read operation. default is false. This is not mandatory> -help: Display the help statement
如下例子使用12個mapper和6個reducer來建立1000個文件:
$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname -s`
mrbench
mrbench會屢次重複執行一個小做業,用於檢查在機羣上小做業的運行是否可重複以及運行是否高效。mrbench的用法以下:
MRBenchmark.0.0.2 Usage: mrbench [-baseDir <base DFS path for output/input, default is /benchmarks/MRBench>] [-jar <local path to job jar file containing Mapper and Reducer implementations, default is current jar file>] [-numRuns <number of times to run the job, default is 1>] [-maps <number of maps for each run, default is 2>] [-reduces <number of reduces for each run, default is 1>] [-inputLines <number of input lines to generate, default is 1>] [-inputType <type of input to generate, one of ascending (default), descending, random>] [-verbose]
如下例子會運行一個小做業50次:
$ hadoop jar $HADOOP_HOME/hadoop-test-0.20.2-cdh3u3.jar mrbench -numRuns 50
運行結果以下所示:
DataLines Maps Reduces AvgTime (milliseconds) 1 2 1 14237
以上結果表示平均做業完成時間是14秒。
(2). Hadoop Examples
除了上文提到的測試,Hadoop還自帶了一些例子,好比WordCount和TeraSort,這些例子在hadoop-examples-0.20.2-cdh3u3.jar中。執行如下命令會列出全部的示例程序:
$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.
WordCount在 Running Hadoop On CentOS (Single-Node Cluster) 一文中已有介紹,這裏就再也不贅述。
TeraSort
一個完整的TeraSort測試須要按如下三步執行:
用TeraGen生成隨機數據對輸入數據運行TeraSort用TeraValidate驗證排好序的輸出數據
並不須要在每次測試時都生成輸入數據,生成一次數據以後,每次測試能夠跳過第一步。
TeraGen的用法以下:
$ hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>
如下命令運行TeraGen生成1GB的輸入數據,並輸出到目錄/examples/terasort-input:
$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar teragen 10000000 /examples/terasort-input
TeraGen產生的數據每行的格式以下:
<10 bytes key><10 bytes rowid><78 bytes filler>rn
其中:
key是一些隨機字符,每一個字符的ASCII碼取值範圍爲[32, 126]rowid是一個整數,右對齊filler由7組字符組成,每組有10個字符(最後一組8個),字符從’A’到’Z’依次取值
如下命令運行TeraSort對數據進行排序,並將結果輸出到目錄/examples/terasort-output:
$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar terasort /examples/terasort-input /examples/terasort-output
如下命令運行TeraValidate來驗證TeraSort輸出的數據是否有序,若是檢測到問題,將亂序的key輸出到目錄/examples/terasort-validate
$ hadoop jar $HADOOP_HOME/hadoop-examples-0.20.2-cdh3u3.jar teravalidate /examples/terasort-output /examples/terasort-validate
(3). Hadoop Gridmix2
Gridmix是Hadoop自帶的基準測試程序,是對其它幾個基準測試程序的進一步封裝,包括產生數據、提交做業、統計完成時間等功能模塊。Gridmix自帶了各類類型的做業,分別爲streamSort、javaSort、combiner、monsterQuery、webdataScan和webdataSort。
$ cd $HADOOP_HOME/src/benchmarks/gridmix2 $ ant $ cp build/gridmix.jar .
修改環境變量
修改gridmix-env-2文件:
export HADOOP_INSTALL_HOME=/home/jeoygin export HADOOP_VERSION=hadoop-0.20.2-cdh3u3 export HADOOP_HOME=${HADOOP_INSTALL_HOME}/${HADOOP_VERSION} export HADOOP_CONF_DIR=${HADOOP_HOME}/conf export USE_REAL_DATASET= export APP_JAR=${HADOOP_HOME}/hadoop-test-0.20.2-cdh3u3.jar export EXAMPLE_JAR=${HADOOP_HOME}/hadoop-examples-0.20.2-cdh3u3.jar export STREAMING_JAR=${HADOOP_HOME}/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar
若是USE_REAL_DATASET的值爲TRUE的話,將使用500GB壓縮數據(等價於2TB非壓縮數據),若是留空將使用500MB壓縮數據(等價於2GB非壓縮數據)。
修改配置信息
配置信息在gridmix_config.xml文件中。gridmix中,每種做業有大中小三種類型:小做業只有3個輸入文件(即3個map);中做業的輸入文件是與正則表達式{part-0000,part-0001,part-000*2}匹配的文件;大做業會處理處有數據。
產生數據
$ chmod +x generateGridmix2data.sh $ ./generateGridmix2data.sh
generateGridmix2data.sh腳本會運行一個做業,在HDFS的目錄/gridmix/data中產生輸入數據。
運行
$ chmod +x rungridmix_2 $ ./rungridmix_2
運行後,會建立_start.out文件來記錄開始時間,結束後,建立_end.out文件來記錄完成時間。
(4). 查看任務統計信息
Hadoop提供很是方便的方式來獲取一個任務的統計信息,使用如下命令便可做到:
$ hadoop job -history all <job output directory>
這個命令會分析任務的兩個歷史文件(這兩個文件存儲在<job output directory>/_logs/history目錄中)並計算任務的統計信息。
2. HiBench
HiBench是Intel開放的一個Hadoop Benchmark Suit,包含9個典型的Hadoop負載(Micro benchmarks、HDFS benchmarks、web search benchmarks、machine learning benchmarks和data analytics benchmarks),主頁是: https://github.com/intel-hadoop/hibench 。
HiBench爲大多數負載提供是否啓用壓縮的選項,默認的compression codec是zlib。
Micro Benchmarks:
Sort (sort):使用Hadoop RandomTextWriter生成數據,並對數據進行排序WordCount (wordcount):統計輸入數據中每一個單詞的出現次數,輸入數據使用Hadoop RandomTextWriter生成TeraSort (terasort):這是由微軟的數據庫大牛Jim Gray(2007年失蹤)建立的標準benchmark,輸入數據由Hadoop TeraGen產生
HDFS Benchmarks:
加強的DFSIO (dfsioe):經過產生大量同時執行讀寫請求的任務來測試Hadoop機羣的HDFS吞吐量
Web Search Benchmarks:
Nutch indexing (nutchindexing):大規模搜索引擎索引是MapReduce的一個重要應用,這個負載測試Nutch(Apache的一個開源搜索引擎)的索引子系統,使用自動生成的Web數據,Web數據中的連接和單詞符合Zipfian分佈PageRank (pagerank):這個負載包含一種在Hadoop上的PageRank算法實現,使用自動生成的Web數據,Web數據中的連接符合Zipfian分佈
Machine Learning Benchmarks:
Mahout Bayesian classification (bayes):大規模機器學習也是MapReduce的一個重要應用,這個負載測試Mahout 0.7(Apache的一個開源機器學習庫)中的Naive Bayesian訓練器,輸入數據是自動生成的文檔,文檔中的單詞符合Zipfian分佈Mahout K-means clustering (kmeans):這個負載測試Mahout 0.7中的K-means聚類算法,輸入數據集由基於均勻分佈和高斯分佈的GenKMeansDataset產生
Data Analytics Benchmarks:
Hive Query Benchmarks (hivebench):這個負載的開發基於SIGMOD 09的一篇論文「A Comparison of Approaches to Large-Scale Data Analysis」和HIVE-396,包含執行典型OLAP查詢的Hive查詢(Aggregation and Join),使用自動生成的Web數據,Web數據中的連接符合Zipfian分佈
下文將${HIBENCH_HOME}定義爲HiBench的解壓縮目錄。
(1). 安裝與配置
創建環境:
HiBench-2.2:從https://github.com/intel-hadoop/HiBench/zipball/HiBench-2.2下載Hadoop:在運行任何負載以前,請確保Hadoop環境能正常運行,全部負載在Cloudera Distribution of Hadoop 3 update 4 (cdh3u4)和Hadoop 1.0.3上測試經過Hive:若是要測試hivebench,請確保已正確創建了Hive環境
配置全部負載:
須要在${HIBENCH_HOME}/bin/hibench-config.sh文件中設置一些全局的環境變量。
$ unzip HiBench-2.2.zip $ cd HiBench-2.2 $ vim bin/hibench-config.sh HADOOP_HOME <The Hadoop installation location> HADOOP_CONF_DIR <The hadoop configuration DIR, default is $HADOOP_HOME/conf> COMPRESS_GLOBAL <Whether to enable the in/out compression for all workloads, 0 is disable, 1 is enable> COMPRESS_CODEC_GLOBAL <The default codec used for in/out data compression>
配置單個負載:
在每一個負載目錄下,能夠修改conf/configure.sh這個文件,設置負載運行的參數。
同步每一個節點的時間
(2). 運行
同時運行幾個負載:
修改${HIBENCH_HOME}/conf/benchmarks.lst文件,該文件定義了將要運行的負載,每行指定一個負載,在任意一行前可使用#跳過該行運行${HIBENCH_HOME}/bin/run-all.sh腳本
單獨運行每一個負載:
能夠單獨運行每一個負載,一般,在每一個負載目錄下有三個不一樣的文件:
conf/configure.sh 包含全部參數的配置文件,能夠設置數據大小及測試選項等 bin/prepare*.sh 生成或拷貝做業輸入數據到HDFS bin/run*.sh 運行benchmark
(3). 小結
HiBench覆蓋了一些廣被使用的Hadoop Benchmark,若是看過該項目的源碼,會發現該項目很精悍,代碼很少,經過一些腳本使每一個benchmark的配置、準備和運行變得規範化,用起來十分方便。