Copy from: 一篇文章看懂TPCx-BB(大數據基準測試工具)源碼java
TPCx-BB是大數據基準測試工具,它經過模擬零售商的30個應用場景,執行30個查詢來衡量基於Hadoop的大數據系統的包括硬件和軟件的性能。其中一些場景還用到了機器學習算法(聚類、線性迴歸等)。爲了更好地瞭解被測試的系統的性能,須要對TPCx-BB整個測試流程深刻了解。本文詳細分析了整個TPCx-BB測試工具的源碼,但願可以對你們理解TPCx-BB有所幫助。node
主目錄($BENCH_MARK_HOME
)下有:web
幾個子目錄。算法
bin下有幾個 module
,是執行時須要用到的腳本:bigBench、cleanLogs、logEnvInformation、runBenchmark、zipLogs等sql
conf下有兩個配置文件:bigBench.properties
和 userSettings.conf
數據庫
bigBench.properties
主要設置 workload
(執行的benchmarkPhases)和 power_test_0
(POWER_TEST
階段須要執行的SQL查詢)api
默認 workload
:bash
workload=CLEAN_ALL,ENGINE_VALIDATION_DATA_GENERATION,ENGINE_VALIDATION_LOAD_TEST,ENGINE_VALIDATION_POWER_TEST,ENGINE_VALIDATION_RESULT_VALIDATION,CLEAN_DATA,DATA_GENERATION,BENCHMARK_START,LOAD_TEST,POWER_TEST,THROUGHPUT_TEST_1,BENCHMARK_STOP,VALIDATE_POWER_TEST,VALIDATE_THROUGHPUT_TEST_1
默認 power_test_0
:1-30
app
userSetting.conf
是一些基本設置,包括JAVA environment 、default settings for benchmark(database、engine、map_tasks、scale_factor ...)、HADOOP environment、
HDFS config and paths、Hadoop data generation options(DFS_REPLICATION、HADOOP_JVM_ENV...)dom
data-generator下是跟數據生成相關的腳本及配置文件。詳細內容在下面介紹。
engines下是TPCx-BB支持的4種引擎:biginsights、hive、impala、spark_sql。默認引擎爲hive。實際上,只有hive目錄下不爲空,其餘三個目錄下均爲空,估計是如今還未完善。
tools下有兩個jar包:HadoopClusterExec.jar
和 RunBigBench.jar
。其中 RunBigBench.jar
是執行TPCx-BB測試的一個很是重要的文件,大部分程序都在該jar包內。
數據生成相關程序和配置都在 data-generator
目錄下。該目錄下有一個 pdgf.jar
包和 config、dicts、extlib
三個子目錄。
pdgf.jar是數據生成的Java程序。config下有兩個配置文件:bigbench-generation.xml
和 bigbench-schema.xml
。
bigbench-generation.xml
主要設置生成的原始數據(不是數據庫表)包含哪幾張表、每張表的表名、表的大小以及表輸出的目錄、表文件的後綴、分隔符、字符編碼等。
<schema name="default"> <tables> <!-- not refreshed tables --> <!-- tables not used in benchmark, but some tables have references to them. not refreshed. Kept for legacy reasons --> <table name="income_band"></table> <table name="reason"></table> <table name="ship_mode"></table> <table name="web_site"></table> <!-- /tables not used in benchmark --> <!-- Static tables (fixed small size, generated only on node 1, skipped on others, not generated during refresh) --> <table name="date_dim" static="true"></table> <table name="time_dim" static="true"></table> <table name="customer_demographics" static="true"></table> <table name="household_demographics" static="true"></table> <!-- /static tables --> <!-- "normal" tables. split over all nodes. not generated during refresh --> <table name="store"></table> <table name="warehouse"></table> <table name="promotion"></table> <table name="web_page"></table> <!-- /"normal" tables.--> <!-- /not refreshed tables --> <!-- refreshed tables. Generated on all nodes. Refresh tables generate extra data during refresh (e.g. add new data to the existing tables) In "normal"-Phase generate table rows: [0,REFRESH_PERCENTAGE*Table.Size]; In "refresh"-Phase generate table rows: [REFRESH_PERCENTAGE*Table.Size+1, Table.Size] .Has effect only if ${REFRESH_SYSTEM_ENABLED}==1. --> <table name="customer"> <scheduler name="DefaultScheduler"> <partitioner name="pdgf.core.dataGenerator.scheduler.TemplatePartitioner"> <prePartition><![CDATA[ if(${REFRESH_SYSTEM_ENABLED}>0){ int tableID = table.getTableID(); int timeID = 0; long lastTableRow=table.getSize()-1; long rowStart; long rowStop; boolean exclude=false; long refreshRows=table.getSize()*(1.0-${REFRESH_PERCENTAGE}); if(${REFRESH_PHASE}>0){ //Refresh part rowStart = lastTableRow - refreshRows +1; rowStop = lastTableRow; if(refreshRows<=0){ exclude=true; } }else{ //"normal" part rowStart = 0; rowStop = lastTableRow - refreshRows; } return new pdgf.core.dataGenerator.scheduler.Partition(tableID, timeID,rowStart,rowStop,exclude); }else{ //DEFAULT return getParentPartitioner().getDefaultPrePartition(project, table); } ]]></prePartition> </partitioner> </scheduler> </table>
<output name="SplitFileOutputWrapper"> <!-- DEFAULT output for all Tables, if no table specific output is specified--> <output name="CSVRowOutput"> <fileTemplate><![CDATA[outputDir + table.getName() +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate> <outputDir>output/</outputDir> <fileEnding>.dat</fileEnding> <delimiter>|</delimiter> <charset>UTF-8</charset> <sortByRowID>true</sortByRowID> </output> <output name="StatisticsOutput" active="1"> <size>${item_size}</size><!-- a counter per item .. initialize later--> <fileTemplate><![CDATA[outputDir + table.getName()+"_audit" +(nodeCount!=1?"_"+pdgf.util.StaticHelper.zeroPaddedNumber(nodeNumber,nodeCount):"")+ fileEnding]]></fileTemplate> <outputDir>output/</outputDir> <fileEnding>.csv</fileEnding> <delimiter>,</delimiter> <header><!--"" + pdgf.util.Constants.DEFAULT_LINESEPARATOR--> </header> <footer></footer>
bigbench-schema.xml
設置了不少參數,有跟表的規模有關的,好比每張表的大小(記錄的條數);絕大多數是跟表的字段有關的,好比時間的起始、結束、性別比例、結婚比例、指標的上下界等。還具體定義了每一個字段是怎麼生成的,以及限制條件。示例以下:
生成的數據大小由 SCALE_FACTOR(-f)
決定。如 -f 1
,則生成的數據總大小約爲1G;-f 100
,則生成的數據總大小約爲100G。那麼SCALE_FACTOR(-f)
是怎麼精確控制生成的數據的大小呢?
緣由是 SCALE_FACTOR(-f)
決定了每張表的記錄數。以下,customer
表的記錄數爲 100000.0d * ${SF_sqrt}
,即若是 -f 1
則 customer
表的記錄數爲 100000*sqrt(1)= 10萬條
;若是 -f 100
則 customer
表的記錄數爲 100000*sqrt(100)= 100萬條
<property name="${customer_size}" type="long">100000.0d * ${SF_sqrt}</property>
<property name="${DIMENSION_TABLES_START_DAY}" type="datetime">2000-01-03 00:00:00</property> <property name="${DIMENSION_TABLES_END_DAY}" type="datetime">2004-01-05 00:00:00</property>
<property name="${gender_likelihood}" type="double">0.5</property> <property name="${married_likelihood}" type="double">0.3</property>
<property name="${WP_LINK_MIN}" type="double">2</property> <property name="${WP_LINK_MAX}" type="double">25</property>
<field name="d_date" size="13" type="CHAR" primary="false"> <gen_DateTime> <disableRng>true</disableRng> <useFixedStepSize>true</useFixedStepSize> <startDate>${date_dim_begin_date}</startDate> <endDate>${date_dim_end_date}</endDate> <outputFormat>yyyy-MM-dd</outputFormat> </gen_DateTime> </field>
<field name="t_time_id" size="16" type="CHAR" primary="false"> <gen_ConvertNumberToString> <gen_Id/> <size>16.0</size> <characters>ABCDEFGHIJKLMNOPQRSTUVWXYZ</characters> </gen_ConvertNumberToString> </field>
<field name="cd_dep_employed_count" size="10" type="INTEGER" primary="false"> <gen_Null probability="${NULL_CHANCE}"> <gen_WeightedListItem filename="dicts/bigbench/ds-genProbabilities.txt" list="dependent_count" valueColumn="0" weightColumn="0" /> </gen_Null> </field>
dicts下有city.dict、country.dict、male.dict、female.dict、state.dict、mail_provider.dict等字典文件,表裏每一條記錄的各個字段應該是從這些字典裏生成的。
extlib下是引用的外部程序jar包。有 lucene-core-4.9.0.jar
、commons-net-3.3.jar
、xml-apis.jar
和log4j-1.2.15.jar
等
總結:
pdgf.jar
根據bigbench-generation.xml
和 bigbench-schema.xml
兩個文件裏的配置(表名、字段名、表的記錄條數、每一個字段生成的規則),從 dicts
目錄下對應的 .dict
文件獲取表中每一條記錄、每一個字段的值,生成原始數據。
customer
表裏的某條記錄以下:
0 AAAAAAAAAAAAAAAA 1824793 3203 2555 28776 14690 Ms. Marisa Harrington N 17 4 1988 UNITED ARAB EMIRATES RRCyuY3XfE3a Marisa.Harrington@lawyer.com gdMmGdU9
若是執行 TPCx-BB 測試時指定 -f 1(SCALE_FACTOR = 1)
則最終生成的原始數據總大小約爲 1G(977M+8.6M)
[root@node-20-100 ~]# hdfs dfs -du -h /user/root/benchmarks/bigbench/data 12.7 M 38.0 M /user/root/benchmarks/bigbench/data/customer 5.1 M 15.4 M /user/root/benchmarks/bigbench/data/customer_address 74.2 M 222.5 M /user/root/benchmarks/bigbench/data/customer_demographics 14.7 M 44.0 M /user/root/benchmarks/bigbench/data/date_dim 151.5 K 454.4 K /user/root/benchmarks/bigbench/data/household_demographics 327 981 /user/root/benchmarks/bigbench/data/income_band 405.3 M 1.2 G /user/root/benchmarks/bigbench/data/inventory 6.5 M 19.5 M /user/root/benchmarks/bigbench/data/item 4.0 M 12.0 M /user/root/benchmarks/bigbench/data/item_marketprices 53.7 M 161.2 M /user/root/benchmarks/bigbench/data/product_reviews 45.3 K 135.9 K /user/root/benchmarks/bigbench/data/promotion 3.0 K 9.1 K /user/root/benchmarks/bigbench/data/reason 1.2 K 3.6 K /user/root/benchmarks/bigbench/data/ship_mode 3.3 K 9.9 K /user/root/benchmarks/bigbench/data/store 4.1 M 12.4 M /user/root/benchmarks/bigbench/data/store_returns 88.5 M 265.4 M /user/root/benchmarks/bigbench/data/store_sales 4.9 M 14.6 M /user/root/benchmarks/bigbench/data/time_dim 584 1.7 K /user/root/benchmarks/bigbench/data/warehouse 170.4 M 511.3 M /user/root/benchmarks/bigbench/data/web_clickstreams 7.9 K 23.6 K /user/root/benchmarks/bigbench/data/web_page 5.1 M 15.4 M /user/root/benchmarks/bigbench/data/web_returns 127.6 M 382.8 M /user/root/benchmarks/bigbench/data/web_sales 8.6 K 25.9 K /user/root/benchmarks/bigbench/data/web_site
要執行TPCx-BB測試,首先須要切換到TPCx-BB源程序的目錄下,而後進入bin目錄,執行如下語句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
其中,-f、-m、-s、-j都是參數,用戶可根據集羣的性能以及本身的需求來設置。若是不指定,則使用默認值,默認值在 conf
目錄下的 userSetting.conf
文件指定,以下:
export BIG_BENCH_DEFAULT_DATABASE="bigbench" export BIG_BENCH_DEFAULT_ENGINE="hive" export BIG_BENCH_DEFAULT_MAP_TASKS="80" export BIG_BENCH_DEFAULT_SCALE_FACTOR="1000" export BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS="2" export BIG_BENCH_DEFAULT_BENCHMARK_PHASE="run_query"
默認 MAP_TASKS
爲 80(-m 80)
、SCALE_FACTOR
爲 1000(-f 1000)
、NUMBER_OF_PARALLEL_STREAMS
爲 2(-s 2)
。
全部可選參數及其意義以下:
General options: -d 使用的數據庫 (默認: $BIG_BENCH_DEFAULT_DATABASE -> bigbench) -e 使用的引擎 (默認: $BIG_BENCH_DEFAULT_ENGINE -> hive) -f 數據集的規模因子(scale factor) (默認: $BIG_BENCH_DEFAULT_SCALE_FACTOR -> 1000) -h 顯示幫助 -m 數據生成的`map tasks`數 (default: $BIG_BENCH_DEFAULT_MAP_TASKS)" -s 並行的`stream`數 (默認: $BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS -> 2) Driver specific options: -a 假裝模式執行 -b 執行期間將調用的bash腳本在標準輸出中打印出來 -i 指定須要執行的階段 (詳情見$BIG_BENCH_CONF_DIR/bigBench.properties) -j 指定須要執行的查詢 (默認:1-30共30個查詢均執行)" -U 解鎖專家模式
若指定了-U
,即解鎖了專家模式,則:
echo "EXPERT MODE ACTIVE" echo "WARNING - INTERNAL USE ONLY:" echo "Only set manually if you know what you are doing!" echo "Ignoring them is probably the best solution" echo "Running individual modules:" echo "Usage: `basename $0` module [options]" -D 指定須要debug的查詢部分. 大部分查詢都只有一個單獨的部分 -p 須要執行的benchmark phase (默認: $BIG_BENCH_DEFAULT_BENCHMARK_PHASE -> run_query)" -q 指定須要執行哪一個查詢(只能指定一個) -t 指定執行該查詢時用第哪一個stream -v metastore population的sql腳本 (默認: ${USER_POPULATE_FILE:-"$BIG_BENCH_POPULATION_DIR/hiveCreateLoad.sql"})" -w metastore refresh的sql腳本 (默認: ${USER_REFRESH_FILE:-"$BIG_BENCH_REFRESH_DIR/hiveRefreshCreateLoad.sql"})" -y 含額外的用戶自定義查詢參數的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/queryParameters.sql)" -z 含額外的用戶自定義引擎設置的文件 (global: $BIG_BENCH_ENGINE_CONF_DIR/engineSettings.sql)" List of available modules: $BIG_BENCH_ENGINE_BIN_DIR
回到剛剛執行TPCx-BB測試的語句:
./bigBench runBenchmark -f 1 -m 8 -s 2 -j 5
bigBench
是主腳本,runBenchmark
是module。
bigBench
裏設置了不少環境變量(包括路徑、引擎、STREAM數等等),由於後面調用 runBigBench.jar
的時候須要在Java程序裏讀取這些環境變量。
bigBench
前面都是在作一些基本工做,如設置環境變量、解析用戶輸入參數、賦予文件權限、設置路徑等等。到最後一步調用 runBenchmark
的 runModule()
方法:
設置基本路徑
export BIG_BENCH_VERSION="1.0" export BIG_BENCH_BIN_DIR="$BIG_BENCH_HOME/bin" export BIG_BENCH_CONF_DIR="$BIG_BENCH_HOME/conf" export BIG_BENCH_DATA_GENERATOR_DIR="$BIG_BENCH_HOME/data-generator" export BIG_BENCH_TOOLS_DIR="$BIG_BENCH_HOME/tools" export BIG_BENCH_LOGS_DIR="$BIG_BENCH_HOME/logs"
指定 core-site.xml
和 hdfs-site.xml
的路徑
數據生成時要用到Hadoop集羣,生成在hdfs上
export BIG_BENCH_DATAGEN_CORE_SITE="$BIG_BENCH_HADOOP_CONF/core-site.xml" export BIG_BENCH_DATAGEN_HDFS_SITE="$BIG_BENCH_HADOOP_CONF/hdfs-site.xml"
賦予整個包下全部可執行文件權限(.sh/.jar/.py)
find "$BIG_BENCH_HOME" -name '*.sh' -exec chmod 755 {} + find "$BIG_BENCH_HOME" -name '*.jar' -exec chmod 755 {} + find "$BIG_BENCH_HOME" -name '*.py' -exec chmod 755 {} +
設置 userSetting.conf
的路徑並 source
USER_SETTINGS="$BIG_BENCH_CONF_DIR/userSettings.conf" if [ ! -f "$USER_SETTINGS" ] then echo "User settings file $USER_SETTINGS not found" exit 1 else source "$USER_SETTINGS" fi
解析輸入參數和選項並根據選項的內容做設置
第一個參數必須是module_name
若是沒有輸入參數或者第一個參數以"-"開頭,說明用戶沒有輸入須要運行的module。
if [[ $# -eq 0 || "`echo "$1" | cut -c1`" = "-" ]] then export MODULE_NAME="" SHOW_HELP="1" else export MODULE_NAME="$1" shift fi export LIST_OF_USER_OPTIONS="$@"
解析用戶輸入的參數
根據用戶輸入的參數來設置環境變量
while getopts ":d:D:e:f:hm:p:q:s:t:Uv:w:y:z:abi:j:" OPT; do case "$OPT" in # script options d) #echo "-d was triggered, Parameter: $OPTARG" >&2 USER_DATABASE="$OPTARG" ;; D) #echo "-D was triggered, Parameter: $OPTARG" >&2 DEBUG_QUERY_PART="$OPTARG" ;; e) #echo "-e was triggered, Parameter: $OPTARG" >&2 USER_ENGINE="$OPTARG" ;; f) #echo "-f was triggered, Parameter: $OPTARG" >&2 USER_SCALE_FACTOR="$OPTARG" ;; h) #echo "-h was triggered, Parameter: $OPTARG" >&2 SHOW_HELP="1" ;; m) #echo "-m was triggered, Parameter: $OPTARG" >&2 USER_MAP_TASKS="$OPTARG" ;; p) #echo "-p was triggered, Parameter: $OPTARG" >&2 USER_BENCHMARK_PHASE="$OPTARG" ;; q) #echo "-q was triggered, Parameter: $OPTARG" >&2 QUERY_NUMBER="$OPTARG" ;; s) #echo "-t was triggered, Parameter: $OPTARG" >&2 USER_NUMBER_OF_PARALLEL_STREAMS="$OPTARG" ;; t) #echo "-s was triggered, Parameter: $OPTARG" >&2 USER_STREAM_NUMBER="$OPTARG" ;; U) #echo "-U was triggered, Parameter: $OPTARG" >&2 USER_EXPERT_MODE="1" ;; v) #echo "-v was triggered, Parameter: $OPTARG" >&2 USER_POPULATE_FILE="$OPTARG" ;; w) #echo "-w was triggered, Parameter: $OPTARG" >&2 USER_REFRESH_FILE="$OPTARG" ;; y) #echo "-y was triggered, Parameter: $OPTARG" >&2 USER_QUERY_PARAMS_FILE="$OPTARG" ;; z) #echo "-z was triggered, Parameter: $OPTARG" >&2 USER_ENGINE_SETTINGS_FILE="$OPTARG" ;; # driver options a) #echo "-a was triggered, Parameter: $OPTARG" >&2 export USER_PRETEND_MODE="1" ;; b) #echo "-b was triggered, Parameter: $OPTARG" >&2 export USER_PRINT_STD_OUT="1" ;; i) #echo "-i was triggered, Parameter: $OPTARG" >&2 export USER_DRIVER_WORKLOAD="$OPTARG" ;; j) #echo "-j was triggered, Parameter: $OPTARG" >&2 export USER_DRIVER_QUERIES_TO_RUN="$OPTARG" ;; \?) echo "Invalid option: -$OPTARG" >&2 exit 1 ;; :) echo "Option -$OPTARG requires an argument." >&2 exit 1 ;; esac done
設置全局變量。若是用戶指定了某個參數的值,則採用該值,不然使用默認值。
export BIG_BENCH_EXPERT_MODE="${USER_EXPERT_MODE:-"0"}" export SHOW_HELP="${SHOW_HELP:-"0"}" export BIG_BENCH_DATABASE="${USER_DATABASE:-"$BIG_BENCH_DEFAULT_DATABASE"}" export BIG_BENCH_ENGINE="${USER_ENGINE:-"$BIG_BENCH_DEFAULT_ENGINE"}" export BIG_BENCH_MAP_TASKS="${USER_MAP_TASKS:-"$BIG_BENCH_DEFAULT_MAP_TASKS"}" export BIG_BENCH_SCALE_FACTOR="${USER_SCALE_FACTOR:-"$BIG_BENCH_DEFAULT_SCALE_FACTOR"}" export BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS="${USER_NUMBER_OF_PARALLEL_STREAMS:-"$BIG_BENCH_DEFAULT_NUMBER_OF_PARALLEL_STREAMS"}" export BIG_BENCH_BENCHMARK_PHASE="${USER_BENCHMARK_PHASE:-"$BIG_BENCH_DEFAULT_BENCHMARK_PHASE"}" export BIG_BENCH_STREAM_NUMBER="${USER_STREAM_NUMBER:-"0"}" export BIG_BENCH_ENGINE_DIR="$BIG_BENCH_HOME/engines/$BIG_BENCH_ENGINE" export BIG_BENCH_ENGINE_CONF_DIR="$BIG_BENCH_ENGINE_DIR/conf"
檢測 -s -m -f -j的選項是否爲數字
if [ -n "`echo "$BIG_BENCH_MAP_TASKS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_MAP_TASKS is not a number" fi if [ -n "`echo "$BIG_BENCH_SCALE_FACTOR" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_SCALE_FACTOR is not a number" fi if [ -n "`echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS is not a number" fi if [ -n "`echo "$BIG_BENCH_STREAM_NUMBER" | sed -e 's/[0-9]*//g'`" ] then echo "$BIG_BENCH_STREAM_NUMBER is not a number" fi
檢查引擎是否存在
if [ ! -d "$BIG_BENCH_ENGINE_DIR" ] then echo "Engine directory $BIG_BENCH_ENGINE_DIR not found. Aborting script..." exit 1 fi if [ ! -d "$BIG_BENCH_ENGINE_CONF_DIR" ] then echo "Engine configuration directory $BIG_BENCH_ENGINE_CONF_DIR not found. Aborting script..." exit 1 fi
設置 engineSetting.conf
路徑並 source
ENGINE_SETTINGS="$BIG_BENCH_ENGINE_CONF_DIR/engineSettings.conf" if [ ! -f "$ENGINE_SETTINGS" ] then echo "Engine settings file $ENGINE_SETTINGS not found" exit 1 else source "$ENGINE_SETTINGS" fi
檢查module是否存在
當輸入某個module時,系統會先到$BIG_BENCH_ENGINE_BIN_DIR/
目錄下去找該module是否存在,若是存在,就source "$MODULE"
;若是該目錄下不存在指定的module,再到export MODULE="$BIG_BENCH_BIN_DIR/"
目錄下找該module,若是存在,就source "$MODULE"
;不然,輸出Module $MODULE not found, aborting script.
export MODULE="$BIG_BENCH_ENGINE_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else export MODULE="$BIG_BENCH_BIN_DIR/$MODULE_NAME" if [ -f "$MODULE" ] then source "$MODULE" else echo "Module $MODULE not found, aborting script." exit 1 fi fi
檢查module裏的runModule()、helpModule ( )、runEngineCmd()方法是否有定義
MODULE_RUN_METHOD="runModule" if ! declare -F "$MODULE_RUN_METHOD" > /dev/null 2>&1 then echo "$MODULE_RUN_METHOD was not implemented, aborting script" exit 1 fi
運行module
若是module是runBenchmark,執行runCmdWithErrorCheck "$MODULE_RUN_METHOD"
也就是runCmdWithErrorCheck runModule()
由上能夠看出,bigBench腳本主要執行一些如設置環境變量、賦予權限、檢查並解析輸入參數等基礎工做,最終調用runBenchmark
的runModule()
方法繼續往下執行。
接下來看看runBenchmark
腳本。
runBenchmark
裏有兩個函數:helpModule ()
和runModule ()
。
helpModule ()
就是顯示幫助。
runModule ()
是運行runBenchmark
模塊時真正調用的函數。該函數主要作四件事:
RunBigBench.jar
來執行源碼以下:
runModule () { #check input parameters if [ "$BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS" -le 0 ] then echo "The number of parallel streams -s must be greater than 0" return 1 fi "${BIG_BENCH_BIN_DIR}/bigBench" cleanLogs -U $LIST_OF_USER_OPTIONS "$BIG_BENCH_JAVA" -jar "${BIG_BENCH_TOOLS_DIR}/RunBigBench.jar" "${BIG_BENCH_BIN_DIR}/bigBench" logEnvInformation -U $LIST_OF_USER_OPTIONS "${BIG_BENCH_BIN_DIR}/bigBench" zipLogs -U $LIST_OF_USER_OPTIONS return $? }
至關於運行runBenchmark
模塊時又調用了cleanLogs
、logEnvInformation
、zipLogs
三個模塊以及RunBigBench.jar
。其中RunBigBench.jar
是TCPx-BB測試執行的核心代碼,用Java語言編寫。接下來分析RunBigBench.jar
源碼。
runModule()函數用來執行某個module。咱們已知,執行某個module須要切換到主目錄下的bin目錄,而後執行:
./bigBench module_name arguments
在runModule()函數裏,cmdLine用來生成如上命令。
ArrayList cmdLine = new ArrayList(); cmdLine.add("bash"); cmdLine.add(this.runScript); cmdLine.add(benchmarkPhase.getRunModule()); cmdLine.addAll(arguments);
其中,this.runScript
爲:
this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench";
benchmarkPhase.getRunModule()
用來得到須要執行的module。
arguments
爲用戶輸入的參數。
至此,cmdLine爲:
bash $BIG_BENCH_BIN_DIR/bigBench module_name arguments
那麼,怎麼讓系統執行該bash命令呢?答案是調用runCmd()
方法。
boolean successful = this.runCmd(this.homeDir, benchmarkPhase.isPrintStdOut(), (String[])cmdLine.toArray(new String[0]));
接下來介紹rumCmd()方法
runCmd()方法經過ProcessBuilder
來建立一個操做系統進程,並用該進程執行以上的bash命令。
ProcessBuilder
還能夠設置工做目錄和環境。
ProcessBuilder pb = new ProcessBuilder(command); pb.directory(new File(workingDirectory)); Process p = null; --- p = pb.start();
getQueryList()用來得到須要執行的查詢列表。從$BIG_BENCH_LOGS_DIR/bigBench.properties
文件中讀取。與$BIG_BENCH_HOME/conf/bigBench.properties
內容一致。
bigBench.properties
裏power_test_0=1-30
規定了powter_test_0
階段須要執行的查詢及其順序。
能夠用區間如 5-12
或者單個數字如 21
表示,中間用 ,
隔開。
power_test_0=28-25,2-5,10,22,30
表示powter_test_0
階段須要執行的查詢及其順序爲:28,27,26,25,2,3,4,5,10,22,30
若是想讓30個查詢按順序執行,則:
power_test_0=1-30
得到查詢列表的源碼以下:
private List<Integer> getQueryList(BigBench.BenchmarkPhase benchmarkPhase, int streamNumber) { String SHUFFLED_NAME_PATTERN = "shuffledQueryList"; BigBench.BenchmarkPhase queryOrderBasicPhase = BigBench.BenchmarkPhase.POWER_TEST; String propertyKey = benchmarkPhase.getQueryListProperty(streamNumber); boolean queryOrderCached = benchmarkPhase.isQueryOrderCached(); if(queryOrderCached && this.queryListCache.containsKey(propertyKey)) { return new ArrayList((Collection)this.queryListCache.get(propertyKey)); } else { Object queryList; String basicPhaseNamePattern; if(!this.properties.containsKey(propertyKey)) { if(benchmarkPhase.isQueryOrderRandom()) { if(!this.queryListCache.containsKey("shuffledQueryList")) { basicPhaseNamePattern = queryOrderBasicPhase.getQueryListProperty(0); if(!this.properties.containsKey(basicPhaseNamePattern)) { throw new IllegalArgumentException("Property " + basicPhaseNamePattern + " is not defined, but is the basis for shuffling the query list."); } this.queryListCache.put("shuffledQueryList", this.getQueryList(queryOrderBasicPhase, 0)); } queryList = (List)this.queryListCache.get("shuffledQueryList"); this.shuffleList((List)queryList, this.rnd); } else { queryList = this.getQueryList(queryOrderBasicPhase, 0); } } else { queryList = new ArrayList(); String[] var11; int var10 = (var11 = this.properties.getProperty(propertyKey).split(",")).length; label65: for(int var9 = 0; var9 < var10; ++var9) { basicPhaseNamePattern = var11[var9]; String[] queryRange = basicPhaseNamePattern.trim().split("-"); switch(queryRange.length) { case 1: ((List)queryList).add(Integer.valueOf(Integer.parseInt(queryRange[0].trim()))); break; case 2: int startQuery = Integer.parseInt(queryRange[0]); int endQuery = Integer.parseInt(queryRange[1]); int i; if(startQuery > endQuery) { i = startQuery; while(true) { if(i < endQuery) { continue label65; } ((List)queryList).add(Integer.valueOf(i)); --i; } } else { i = startQuery; while(true) { if(i > endQuery) { continue label65; } ((List)queryList).add(Integer.valueOf(i)); ++i; } } default: throw new IllegalArgumentException("Query numbers must be in the form X or X-Y, comma separated."); } } } if(queryOrderCached) { this.queryListCache.put(propertyKey, new ArrayList((Collection)queryList)); } return new ArrayList((Collection)queryList); } }
parseEnvironment()讀取系統的環境變量並解析。
Map env = System.getenv(); this.version = (String)env.get("BIG_BENCH_VERSION"); this.homeDir = (String)env.get("BIG_BENCH_HOME"); this.confDir = (String)env.get("BIG_BENCH_CONF_DIR"); this.runScript = (String)env.get("BIG_BENCH_BIN_DIR") + "/bigBench"; this.datagenDir = (String)env.get("BIG_BENCH_DATA_GENERATOR_DIR"); this.logDir = (String)env.get("BIG_BENCH_LOGS_DIR"); this.dataGenLogFile = (String)env.get("BIG_BENCH_DATAGEN_STAGE_LOG"); this.loadLogFile = (String)env.get("BIG_BENCH_LOADING_STAGE_LOG"); this.engine = (String)env.get("BIG_BENCH_ENGINE"); this.database = (String)env.get("BIG_BENCH_DATABASE"); this.mapTasks = (String)env.get("BIG_BENCH_MAP_TASKS"); this.numberOfParallelStreams = Integer.parseInt((String)env.get("BIG_BENCH_NUMBER_OF_PARALLEL_STREAMS")); this.scaleFactor = Long.parseLong((String)env.get("BIG_BENCH_SCALE_FACTOR")); this.stopAfterFailure = ((String)env.get("BIG_BENCH_STOP_AFTER_FAILURE")).equals("1");
並自動在用戶指定的參數後面加上 -U
(解鎖專家模式)
this.userArguments.add("-U");
若是用戶指定了 PRETEND_MODE
、PRINT_STD_OUT
、WORKLOAD
、QUERIES_TO_RUN
,則以用戶指定的參數爲準,不然使用默認值。
if(env.containsKey("USER_PRETEND_MODE")) { this.properties.setProperty("pretend_mode", (String)env.get("USER_PRETEND_MODE")); } if(env.containsKey("USER_PRINT_STD_OUT")) { this.properties.setProperty("show_command_stdout", (String)env.get("USER_PRINT_STD_OUT")); } if(env.containsKey("USER_DRIVER_WORKLOAD")) { this.properties.setProperty("workload", (String)env.get("USER_DRIVER_WORKLOAD")); } if(env.containsKey("USER_DRIVER_QUERIES_TO_RUN")) { this.properties.setProperty(BigBench.BenchmarkPhase.POWER_TEST.getQueryListProperty(0), (String)env.get("USER_DRIVER_QUERIES_TO_RUN")); }
讀取 workload
並賦值 benchmarkPhases
。若是 workload
裏不包含 BENCHMARK_START
和 BENCHMARK_STOP
,自動在 benchmarkPhases
的首位和末位分別加上 BENCHMARK_START
和 BENCHMARK_STOP
。
this.benchmarkPhases = new ArrayList(); Iterator var7 = Arrays.asList(this.properties.getProperty("workload").split(",")).iterator(); while(var7.hasNext()) { String benchmarkPhase = (String)var7.next(); this.benchmarkPhases.add(BigBench.BenchmarkPhase.valueOf(benchmarkPhase.trim())); } if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_START)) { this.benchmarkPhases.add(0, BigBench.BenchmarkPhase.BENCHMARK_START); } if(!this.benchmarkPhases.contains(BigBench.BenchmarkPhase.BENCHMARK_STOP)) { this.benchmarkPhases.add(BigBench.BenchmarkPhase.BENCHMARK_STOP); }
run()
方法是 RunBigBench.jar
裏核心的方法。全部的執行都是經過 run()
方法調用的。好比 runQueries()
、runModule()
、generateData()
等。runQueries()
、runModule()
、generateData()
又經過調用 runCmd()
方法來建立操做系統進程,執行bash命令,調用bash腳本。
run()
方法裏經過一個 while
循環來逐一執行 workload
裏的每個 benchmarkPhase
。 不一樣的 benchmarkPhase
會調用 runQueries()
、runModule()
、generateData()
...中的不一樣方法。
try { long e = 0L; this.log.finer("Benchmark phases: " + this.benchmarkPhases); Iterator startCheckpoint = this.benchmarkPhases.iterator(); long throughputStart; while(startCheckpoint.hasNext()) { BigBench.BenchmarkPhase children = (BigBench.BenchmarkPhase)startCheckpoint.next(); if(children.isPhaseDone()) { this.log.info("The phase " + children.name() + " was already performed earlier. Skipping this phase"); } else { try { switch($SWITCH_TABLE$io$bigdatabenchmark$v1$driver$BigBench$BenchmarkPhase()[children.ordinal()]) { case 1: case 20: throw new IllegalArgumentException("The value " + children.name() + " is only used internally."); case 2: this.log.info(children.getConsoleMessage()); e = System.currentTimeMillis(); break; case 3: if(!BigBench.BenchmarkPhase.BENCHMARK_START.isPhaseDone()) { throw new IllegalArgumentException("Error: Cannot stop the benchmark before starting it"); } throughputStart = System.currentTimeMillis(); this.log.info(String.format("%-55s finished. Time: %25s", new Object[]{children.getConsoleMessage(), BigBench.Helper.formatTime(throughputStart - e)})); this.logTreeRoot.setCheckpoint(new BigBench.Checkpoint(BigBench.BenchmarkPhase.BENCHMARK, -1L, -1L, e, throughputStart, this.logTreeRoot.isSuccessful())); break; case 4: case 15: case 18: case 22: case 27: case 28: case 29: this.runModule(children, this.userArguments); break; case 5: case 10: case 11: this.runQueries(children, 1, validationArguments); break; case 6: case 9: this.runModule(children, validationArguments); break; case 7: this.generateData(children, false, validationArguments); break; case 8: this.generateData(children, true, validationArguments); break; case 12: case 19: case 24: this.runQueries(children, 1, this.userArguments); break; case 13: case 14: case 21: case 23: case 25: case 26: this.runQueries(children, this.numberOfParallelStreams, this.userArguments); break; case 16: this.generateData(children, false, this.userArguments); break; case 17: this.generateData(children, true, this.userArguments); } children.setPhaseDone(true); } catch (IOException var21) { this.log.info("==============\nBenchmark run terminated\nReason: An error occured while running a command in phase " + children + "\n=============="); var21.printStackTrace(); if(this.stopAfterFailure || children.mustSucceed()) { break; } } } }
這裏的 case 1-29
並非 1-29
條查詢,而是枚舉類型裏的 1-29
個 benmarkPhase
。以下所示:
private static enum BenchmarkPhase { BENCHMARK((String)null, "benchmark", false, false, false, false, "BigBench benchmark"), BENCHMARK_START((String)null, "benchmark_start", false, false, false, false, "BigBench benchmark: Start"), BENCHMARK_STOP((String)null, "benchmark_stop", false, false, false, false, "BigBench benchmark: Stop"), CLEAN_ALL("cleanAll", "clean_all", false, false, false, false, "BigBench clean all"), ENGINE_VALIDATION_CLEAN_POWER_TEST("cleanQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Clean power test queries"), ENGINE_VALIDATION_CLEAN_LOAD_TEST("cleanMetastore", "engine_validation_metastore", false, false, false, false, "BigBench engine validation: Clean metastore"), ENGINE_VALIDATION_CLEAN_DATA("cleanData", "engine_validation_data", false, false, false, false, "BigBench engine validation: Clean data"), ENGINE_VALIDATION_DATA_GENERATION("dataGen", "engine_validation_data", false, false, false, true, "BigBench engine validation: Data generation"), ENGINE_VALIDATION_LOAD_TEST("populateMetastore", "engine_validation_metastore", false, false, false, true, "BigBench engine validation: Populate metastore"), ENGINE_VALIDATION_POWER_TEST("runQuery", "engine_validation_power_test", false, false, false, false, "BigBench engine validation: Power test"), ENGINE_VALIDATION_RESULT_VALIDATION("validateQuery", "engine_validation_power_test", false, false, true, false, "BigBench engine validation: Check all query results"), CLEAN_POWER_TEST("cleanQuery", "power_test", false, false, false, false, "BigBench clean: Clean power test queries"), CLEAN_THROUGHPUT_TEST_1("cleanQuery", "throughput_test_1", false, false, false, false, "BigBench clean: Clean first throughput test queries"), CLEAN_THROUGHPUT_TEST_2("cleanQuery", "throughput_test_2", false, false, false, false, "BigBench clean: Clean second throughput test queries"), CLEAN_LOAD_TEST("cleanMetastore", "metastore", false, false, false, false, "BigBench clean: Load test"), CLEAN_DATA("cleanData", "data", false, false, false, false, "BigBench clean: Data"), DATA_GENERATION("dataGen", "data", false, false, false, true, "BigBench preparation: Data generation"), LOAD_TEST("populateMetastore", "metastore", false, false, false, true, "BigBench phase 1: Load test"), POWER_TEST("runQuery", "power_test", false, true, false, false, "BigBench phase 2: Power test"), THROUGHPUT_TEST((String)null, "throughput_test", false, false, false, false, "BigBench phase 3: Throughput test"), THROUGHPUT_TEST_1("runQuery", "throughput_test_1", true, true, false, false, "BigBench phase 3: First throughput test run"), THROUGHPUT_TEST_REFRESH("refreshMetastore", "throughput_test_refresh", false, false, false, false, "BigBench phase 3: Throughput test data refresh"), THROUGHPUT_TEST_2("runQuery", "throughput_test_2", true, true, false, false, "BigBench phase 3: Second throughput test run"), VALIDATE_POWER_TEST("validateQuery", "power_test", false, false, true, false, "BigBench validation: Power test results"), VALIDATE_THROUGHPUT_TEST_1("validateQuery", "throughput_test_1", false, false, true, false, "BigBench validation: First throughput test results"), VALIDATE_THROUGHPUT_TEST_2("validateQuery", "throughput_test_2", false, false, true, false, "BigBench validation: Second throughput test results"), SHOW_TIMES("showTimes", "show_times", false, false, true, false, "BigBench: show query times"), SHOW_ERRORS("showErrors", "show_errors", false, false, true, false, "BigBench: show query errors"), SHOW_VALIDATION("showValidation", "show_validation", false, false, true, false, "BigBench: show query validation results"); private String runModule; private String namePattern; private boolean queryOrderRandom; private boolean queryOrderCached; private boolean printStdOut; private boolean mustSucceed; private String consoleMessage; private boolean phaseDone; private BenchmarkPhase(String runModule, String namePattern, boolean queryOrderRandom, boolean queryOrderCached, boolean printStdOut, boolean mustSucceed, String consoleMessage) { this.runModule = runModule; this.namePattern = namePattern; this.queryOrderRandom = queryOrderRandom; this.queryOrderCached = queryOrderCached; this.printStdOut = printStdOut; this.mustSucceed = mustSucceed; this.consoleMessage = consoleMessage; this.phaseDone = false; }
3對應 BENCHMARK_STOP
,4對應 CLEAN_ALL
,29對應 SHOW_VALIDATION
,依此類推...
能夠看出:
CLEAN_ALL、CLEAN_LOAD_TEST、LOAD_TEST、THROUGHPUT_TEST_REFRESH、SHOW_TIMES、SHOW_ERRORS、SHOW_VALIDATION
等benchmarkPhases調用的是
this.runModule(children, this.userArguments);
方法是 runModule
,參數是 this.userArguments
。
ENGINE_VALIDATION_CLEAN_POWER_TEST、ENGINE_VALIDATION_POWER_TEST、ENGINE_VALIDATION_RESULT_VALIDATION
調用的是
this.runQueries(children, 1, validationArguments);
方法是 runQueries
,參數是 1
(stream number) 和 validationArguments
。
ENGINE_VALIDATION_CLEAN_LOAD_TEST
和 ENGINE_VALIDATION_LOAD_TEST
調用的是
this.runModule(children, validationArguments);
ENGINE_VALIDATION_CLEAN_DATA
調用的是
this.generateData(children, false, validationArguments);
ENGINE_VALIDATION_DATA_GENERATION
調用的是
this.generateData(children, true, validationArguments);
CLEAN_POWER_TEST
、POWER_TEST
、VALIDATE_POWER_TEST
調用的是
this.runQueries(children, 1, this.userArguments);
CLEAN_THROUGHPUT_TEST_1``CLEAN_THROUGHPUT_TEST_2``THROUGHPUT_TEST_1``THROUGHPUT_TEST_2``VALIDATE_THROUGHPUT_TEST_1
VALIDATE_THROUGHPUT_TEST_2
調用的是
this.runQueries(children, this.numberOfParallelStreams, this.userArguments);
CLEAN_DATA
調用的是
this.generateData(children, false, this.userArguments);
DATA_GENERATION
調用的是
this.generateData(children, true, this.userArguments);
總結一下以上的方法調用能夠發現:
ENGINE_VALIDATION
相關的benchmarkPhase用的參數都是 validationArguments
。其他用的是 userArguments
( validationArguments 和 userArguments 惟一的區別是 validationArguments 的 SCALE_FACTOR
恆爲1)POWER_TEST
相關的都是調用 runQueries()
方法,由於 POWER_TEST
就是執行SQL查詢CLEAN_DATA
DATA_GENERATION
相關的都是調用 generateData()
方法LOAD_TEST
SHOW
相關的都是調用 runModule()
方法具體每一個 benchmarkPhase
跟 module
(執行的腳本)的對應關係以下:
CLEAN_ALL -> "cleanAll" ENGINE_VALIDATION_CLEAN_POWER_TEST -> "cleanQuery" ENGINE_VALIDATION_CLEAN_LOAD_TEST -> "cleanMetastore", ENGINE_VALIDATION_CLEAN_DATA -> "cleanData" ENGINE_VALIDATION_DATA_GENERATION -> "dataGen" ENGINE_VALIDATION_LOAD_TEST -> "populateMetastore" ENGINE_VALIDATION_POWER_TEST -> "runQuery" ENGINE_VALIDATION_RESULT_VALIDATION -> "validateQuery" CLEAN_POWER_TEST -> "cleanQuery" CLEAN_THROUGHPUT_TEST_1 -> "cleanQuery" CLEAN_THROUGHPUT_TEST_2 -> "cleanQuery" CLEAN_LOAD_TEST -> "cleanMetastore" CLEAN_DATA -> "cleanData" DATA_GENERATION -> "dataGen" LOAD_TEST -> "populateMetastore" POWER_TEST -> "runQuery" THROUGHPUT_TEST -> (String)null THROUGHPUT_TEST_1 -> "runQuery" THROUGHPUT_TEST_REFRESH -> "refreshMetastore" THROUGHPUT_TEST_2 -> "runQuery" VALIDATE_POWER_TEST -> "validateQuery" VALIDATE_THROUGHPUT_TEST_1 -> "validateQuery" VALIDATE_THROUGHPUT_TEST_2 -> "validateQuery" SHOW_TIMES -> "showTimes" SHOW_ERRORS -> "showErrors" SHOW_VALIDATION -> "showValidation"
當執行某個 benchmarkPhase
時會去調用如上該 benchmarkPhase
對應的 module
(腳本位於 $BENCH_MARK_HOME/engines/hive/bin
目錄下)
cmdLine.add(benchmarkPhase.getRunModule());
接下來介紹每一個module的功能
1. DROP DATABASE 2. 刪除hdfs上的源數據
echo "dropping database (with all tables)" runCmdWithErrorCheck runEngineCmd -e "DROP DATABASE IF EXISTS $BIG_BENCH_DATABASE CASCADE;" echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_HOME}" hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_HOME}"
1. 刪除對應的 Query 生成的臨時表 2. 刪除對應的 Query 生成的結果表
runCmdWithErrorCheck runEngineCmd -e "DROP TABLE IF EXISTS $TEMP_TABLE1; DROP TABLE IF EXISTS $TEMP_TABLE2; DROP TABLE IF EXISTS $RESULT_TABLE;" return $?
1. 調用 `dropTables.sql` 將23張表依次DROP
echo "cleaning metastore tables" runCmdWithErrorCheck runEngineCmd -f "$BIG_BENCH_CLEAN_METASTORE_FILE"
export BIG_BENCH_CLEAN_METASTORE_FILE="$BIG_BENCH_CLEAN_DIR/dropTables.sql"
dropTables.sql
將23張表依次DROP,源碼以下:
DROP TABLE IF EXISTS ${hiveconf:customerTableName}; DROP TABLE IF EXISTS ${hiveconf:customerAddressTableName}; DROP TABLE IF EXISTS ${hiveconf:customerDemographicsTableName}; DROP TABLE IF EXISTS ${hiveconf:dateTableName}; DROP TABLE IF EXISTS ${hiveconf:householdDemographicsTableName}; DROP TABLE IF EXISTS ${hiveconf:incomeTableName}; DROP TABLE IF EXISTS ${hiveconf:itemTableName}; DROP TABLE IF EXISTS ${hiveconf:promotionTableName}; DROP TABLE IF EXISTS ${hiveconf:reasonTableName}; DROP TABLE IF EXISTS ${hiveconf:shipModeTableName}; DROP TABLE IF EXISTS ${hiveconf:storeTableName}; DROP TABLE IF EXISTS ${hiveconf:timeTableName}; DROP TABLE IF EXISTS ${hiveconf:warehouseTableName}; DROP TABLE IF EXISTS ${hiveconf:webSiteTableName}; DROP TABLE IF EXISTS ${hiveconf:webPageTableName}; DROP TABLE IF EXISTS ${hiveconf:inventoryTableName}; DROP TABLE IF EXISTS ${hiveconf:storeSalesTableName}; DROP TABLE IF EXISTS ${hiveconf:storeReturnsTableName}; DROP TABLE IF EXISTS ${hiveconf:webSalesTableName}; DROP TABLE IF EXISTS ${hiveconf:webReturnsTableName}; DROP TABLE IF EXISTS ${hiveconf:marketPricesTableName}; DROP TABLE IF EXISTS ${hiveconf:clickstreamsTableName}; DROP TABLE IF EXISTS ${hiveconf:reviewsTableName};
1. 刪除hdfs上 /user/root/benchmarks/bigbench/data 目錄 2. 刪除hdfs上 /user/root/benchmarks/bigbench/data_refresh 目錄
echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}" hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}" echo "cleaning ${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}" hadoop fs -rm -r -f -skipTrash "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
1. 建立目錄 /user/root/benchmarks/bigbench/data 並賦予權限 2. 建立目錄 /user/root/benchmarks/bigbench/data_refresh 並賦予權限 3. 調用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data 到 /user/root/benchmarks/bigbench/data 目錄下 4. 調用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data 到 /user/root/benchmarks/bigbench/data_refresh 目錄下
建立目錄 /user/root/benchmarks/bigbench/data 並賦予權限
runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}" runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}"
建立目錄 /user/root/benchmarks/bigbench/data_refresh 並賦予權限
runCmdWithErrorCheck hadoop fs -mkdir -p "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}" runCmdWithErrorCheck hadoop fs -chmod 777 "${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}"
調用 HadoopClusterExec.jar 和 pdgf.jar 生成 base data
runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 0 -o "'${BIG_BENCH_HDFS_ABSOLUTE_INIT_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1
調用 HadoopClusterExec.jar 和 pdgf.jar 生成 refresh data
runCmdWithErrorCheck hadoop jar "${BIG_BENCH_TOOLS_DIR}/HadoopClusterExec.jar" -archives "${PDGF_ARCHIVE_PATH}" ${BIG_BENCH_DATAGEN_HADOOP_EXEC_DEBUG} -taskFailOnNonZeroReturnValue -execCWD "${PDGF_DISTRIBUTED_NODE_DIR}" ${HadoopClusterExecOptions} -exec ${BIG_BENCH_DATAGEN_HADOOP_JVM_ENV} -cp "${HADOOP_CP}:pdgf.jar" ${PDGF_CLUSTER_CONF} pdgf.Controller -nc HadoopClusterExec.tasks -nn HadoopClusterExec.taskNumber -ns -c -sp REFRESH_PHASE 1 -o "'${BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR}/'+table.getName()+'/'" ${BIG_BENCH_DATAGEN_HADOOP_OPTIONS} -s ${BIG_BENCH_DATAGEN_TABLES} ${PDGF_OPTIONS} "$@" 2>&1 | tee -a "$BIG_BENCH_DATAGEN_STAGE_LOG" 2>&1
該過程是真正的建立數據庫表的過程。建表的過程調用的是 $BENCH_MARK_HOME/engines/hive/population/
目錄下的 hiveCreateLoad.sql
,經過該sql文件來建數據庫表。
select * from 臨時表
來建立最終的 ORC 格式的數據庫表從 /user/root/benchmarks/bigbench/data 路徑下讀取 .dat 的原始數據,生成 TEXTFILE 格式的外部臨時表
DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}; CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix} ( c_customer_sk bigint --not null , c_customer_id string --not null , c_current_cdemo_sk bigint , c_current_hdemo_sk bigint , c_current_addr_sk bigint , c_first_shipto_date_sk bigint , c_first_sales_date_sk bigint , c_salutation string , c_first_name string , c_last_name string , c_preferred_cust_flag string , c_birth_day int , c_birth_month int , c_birth_year int , c_birth_country string , c_login string , c_email_address string , c_last_review_date string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}' STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}' ;
用 select * from 臨時表
來建立最終的 ORC 格式的數據庫表
DROP TABLE IF EXISTS ${hiveconf:customerTableName}; CREATE TABLE ${hiveconf:customerTableName} STORED AS ${hiveconf:tableFormat} AS SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix} ;
刪除外部臨時表
DROP TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix};
1. runQuery 調用每一個query下的 run.sh 裏的 `query_run_main_method()` 方法 2. `query_run_main_method()` 調用 `runEngineCmd` 來執行query腳本(qxx.sql)
runQuery 調用每一個query下的 run.sh 裏的 query_run_main_method()
方法
QUERY_MAIN_METHOD="query_run_main_method" ----------------------------------------- "$QUERY_MAIN_METHOD" 2>&1 | tee -a "$LOG_FILE_NAME" 2>&1
query_run_main_method()
調用 runEngineCmd
來執行query腳本(qxx.sql)
query_run_main_method () { QUERY_SCRIPT="$QUERY_DIR/$QUERY_NAME.sql" if [ ! -r "$QUERY_SCRIPT" ] then echo "SQL file $QUERY_SCRIPT can not be read." exit 1 fi runCmdWithErrorCheck runEngineCmd -f "$QUERY_SCRIPT" return $? }
通常狀況下 query_run_main_method ()
方法只是執行對應的query腳本,可是像 q0五、q20... 這些查詢,用到了機器學習算法,因此在執行對應的query腳本後會把生成的結果表做爲輸入,而後調用執行機器學習算法(如聚類、邏輯迴歸)的jar包繼續執行,獲得最終的結果。
runEngineCmd () { if addInitScriptsToParams then "$BINARY" "${BINARY_PARAMS[@]}" "${INIT_PARAMS[@]}" "$@" else return 1 fi } -------------------------- BINARY="/usr/bin/hive" BINARY_PARAMS+=(--hiveconf BENCHMARK_PHASE=$BIG_BENCH_BENCHMARK_PHASE --hiveconf STREAM_NUMBER=$BIG_BENCH_STREAM_NUMBER --hiveconf QUERY_NAME=$QUERY_NAME --hiveconf QUERY_DIR=$QUERY_DIR --hiveconf RESULT_TABLE=$RESULT_TABLE --hiveconf RESULT_DIR=$RESULT_DIR --hiveconf TEMP_TABLE=$TEMP_TABLE --hiveconf TEMP_DIR=$TEMP_DIR --hiveconf TABLE_PREFIX=$TABLE_PREFIX) INIT_PARAMS=(-i "$BIG_BENCH_QUERY_PARAMS_FILE" -i "$BIG_BENCH_ENGINE_SETTINGS_FILE") INIT_PARAMS+=(-i "$LOCAL_QUERY_ENGINE_SETTINGS_FILE") if [ -n "$USER_QUERY_PARAMS_FILE" ] then if [ -r "$USER_QUERY_PARAMS_FILE" ] then echo "User defined query parameter file found. Adding $USER_QUERY_PARAMS_FILE to hive init." INIT_PARAMS+=(-i "$USER_QUERY_PARAMS_FILE") else echo "User query parameter file $USER_QUERY_PARAMS_FILE can not be read." return 1 fi fi if [ -n "$USER_ENGINE_SETTINGS_FILE" ] then if [ -r "$USER_ENGINE_SETTINGS_FILE" ] then echo "User defined engine settings file found. Adding $USER_ENGINE_SETTINGS_FILE to hive init." INIT_PARAMS+=(-i "$USER_ENGINE_SETTINGS_FILE") else echo "User hive settings file $USER_ENGINE_SETTINGS_FILE can not be read." return 1 fi fi return 0
1. 調用每一個query下的 run.sh 裏的 `query_run_validate_method()` 方法 2. `query_run_validate_method()` 比較 `$BENCH_MARK_HOME/engines/hive/queries/qxx/results/qxx-result` 和hdfs上 `/user/root/benchmarks/bigbench/queryResults/qxx_hive_${BIG_BENCH_BENCHMARK_PHASE}_${BIG_BENCH_STREAM_NUMBER}_result` 兩個文件,若是同樣,則驗證經過,不然驗證失敗。
if diff -q "$VALIDATION_RESULTS_FILENAME" <(hadoop fs -cat "$RESULT_DIR/*") then echo "Validation of $VALIDATION_RESULTS_FILENAME passed: Query returned correct results" else echo "Validation of $VALIDATION_RESULTS_FILENAME failed: Query returned incorrect results" VALIDATION_PASSED="0" fi
SF爲1時(-f 1),用上面的方法比較,SF不爲1(>1)時,只要hdfs上的結果表中行數大於等於1即驗證經過
if [ `hadoop fs -cat "$RESULT_DIR/*" | head -n 10 | wc -l` -ge 1 ] then echo "Validation passed: Query returned results" else echo "Validation failed: Query did not return results" return 1 fi
1. 調用 `$BENCH_MARK_HOME/engines/hive/refresh/` 目錄下的 `hiveRefreshCreateLoad.sql` 腳本 2. `hiveRefreshCreateLoad.sql` 將hdfs上 `/user/root/benchmarks/bigbench/data_refresh/` 目錄下每一個表數據插入外部臨時表 3. 外部臨時表再將每一個表的數據插入Hive數據庫對應的表中
hiveRefreshCreateLoad.sql
將hdfs上 /user/root/benchmarks/bigbench/data_refresh/
目錄下每一個表數據插入外部臨時表
DROP TABLE IF EXISTS ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix}; CREATE EXTERNAL TABLE ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix} ( c_customer_sk bigint --not null , c_customer_id string --not null , c_current_cdemo_sk bigint , c_current_hdemo_sk bigint , c_current_addr_sk bigint , c_first_shipto_date_sk bigint , c_first_sales_date_sk bigint , c_salutation string , c_first_name string , c_last_name string , c_preferred_cust_flag string , c_birth_day int , c_birth_month int , c_birth_year int , c_birth_country string , c_login string , c_email_address string , c_last_review_date string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '${hiveconf:fieldDelimiter}' STORED AS TEXTFILE LOCATION '${hiveconf:hdfsDataPath}/${hiveconf:customerTableName}' ; ----------------- set hdfsDataPath=${env:BIG_BENCH_HDFS_ABSOLUTE_REFRESH_DATA_DIR};
外部臨時表再將每一個表的數據插入Hive數據庫對應的表中
INSERT INTO TABLE ${hiveconf:customerTableName} SELECT * FROM ${hiveconf:customerTableName}${hiveconf:temporaryTableSuffix} ;