前言mysql
Hadoop和Hive的環境已經搭建起來了,開始導入數據進行測試。個人數據1G大概對應500W行,MySQL的查詢500W行大概3.29秒,用hive一樣的查詢大概30秒。若是咱們把數據增長到10G,100G,讓咱們來看看Hive的表現吧。sql
目錄shell
下面是個人表,天天會產生一新表,用日期的方式命名。今天是2013年7月19日,對應的表是cb_hft,記錄數646W條記錄。ssh
mysql> show tables; +-----------------+ | Tables_in_CB | +-----------------+ | NSpremium | | cb_hft | | cb_hft_20130710 | | cb_hft_20130712 | | cb_hft_20130715 | | cb_hft_20130716 | +-----------------+ 6 rows in set (0.00 sec) mysql> select count(1) from cb_hft; +----------+ | count(1) | +----------+ | 6461338 | +----------+ 1 row in set (3.29 sec)
快速複製表:
因爲這個表是離線系統的,沒有線上應用,我重命名錶cb_hft爲cb_hft_20130719,再複製表結構。oop
mysql> RENAME TABLE cb_hft TO cb_hft_20130719; Query OK, 0 rows affected (0.00 sec) mysql> CREATE TABLE cb_hft like cb_hft_20130719; Query OK, 0 rows affected (0.02 sec) mysql> show tables; +-----------------+ | Tables_in_CB | +-----------------+ | NSpremium | | cb_hft | | cb_hft_20130710 | | cb_hft_20130712 | | cb_hft_20130715 | | cb_hft_20130716 | | cb_hft_20130719 | +-----------------+ 7 rows in set (0.00 sec)
導出表到csv
以hft_20130712表爲例性能
mysql> SELECT SecurityID,TradeTime,PreClosePx,OpenPx,HighPx,LowPx,LastPx, BidSize1,BidPx1,BidSize2,BidPx2,BidSize3,BidPx3,BidSize4,BidPx4,BidSize5,BidPx5, OfferSize1,OfferPx1,OfferSize2,OfferPx2,OfferSize3,OfferPx3,OfferSize4,OfferPx4,OfferSize5,OfferPx5, NumTrades,TotalVolumeTrade,TotalValueTrade,PE,PE1,PriceChange1,PriceChange2,Positions FROM cb_hft_20130712 INTO OUTFILE '/tmp/export_cb_hft_20130712.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; Query OK, 6127080 rows affected (2 min 55.04 sec)
查看數據文件測試
~ ls -l /tmp -rw-rw-rw- 1 mysql mysql 1068707117 Jul 19 15:59 export_cb_hft_20130712.csv
登錄c1.wtmart.com機器,下載數據文件優化
~ ssh cos@c1.wtmart.com ~ cd /home/cos/hadoop/sqldb ~ scp -P 10003 cos@d2.wtmart.com:/tmp/export_cb_hft_20130712.csv . export_cb_hft_20130712.csv 100% 1019MB 39.2MB/s 00:26
在hive上建表code
~ bin/hive shell #刪除已存在的表 hive> DROP TABLE IF EXISTS t_hft_tmp; Time taken: 4.898 seconds #建立t_hft_tmp表 hive> CREATE TABLE t_hft_tmp( SecurityID STRING,TradeTime STRING, PreClosePx DOUBLE,OpenPx DOUBLE,HighPx DOUBLE,LowPx DOUBLE,LastPx DOUBLE, BidSize1 DOUBLE,BidPx1 DOUBLE,BidSize2 DOUBLE,BidPx2 DOUBLE,BidSize3 DOUBLE,BidPx3 DOUBLE,BidSize4 DOUBLE,BidPx4 DOUBLE,BidSize5 DOUBLE,BidPx5 DOUBLE, OfferSize1 DOUBLE,OfferPx1 DOUBLE,OfferSize2 DOUBLE,OfferPx2 DOUBLE,OfferSize3 DOUBLE,OfferPx3 DOUBLE,OfferSize4 DOUBLE,OfferPx4 DOUBLE,OfferSize5 DOUBLE,OfferPx5 DOUBLE, NumTrades INT,TotalVolumeTrade DOUBLE,TotalValueTrade DOUBLE,PE DOUBLE,PE1 DOUBLE,PriceChange1 DOUBLE,PriceChange2 DOUBLE,Positions DOUBLE ) PARTITIONED BY (tradeDate INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; Time taken: 0.189 seconds #導入數據 hive> LOAD DATA LOCAL INPATH '/home/cos/hadoop/sqldb/export_cb_hft_20130712.csv' OVERWRITE INTO TABLE t_hft_tmp PARTITION (tradedate=20130712); Copying data from file:/home/cos/hadoop/sqldb/export_cb_hft_20130712.csv Copying file: file:/home/cos/hadoop/sqldb/export_cb_hft_20130712.csv Loading data to table default.t_hft_tmp partition (tradedate=20130712) Time taken: 16.535 seconds
當數據被加載至表中時,不會對數據進行任何轉換。Load操做只是將數據複製至Hive表對應的位置,這個表只有一個文件,文件沒有切分紅多份。索引
hive> dfs -ls /user/hive/warehouse/t_hft_tmp/tradedate=20130712; Found 1 items -rw-r--r-- 1 cos supergroup 1068707117 2013-07-19 16:07 /user/hive/warehouse/t_hft_tmp/tradedate=20130712/export_cb_hft_20130712.csv
第二步導入,咱們要把剛纔的一個大文件切分紅多少小文件,大概按照64M一個block的要求。咱們設置作16個Bucket。
新建數據表t_hft_day,並定義CLUSTERED BY,SORTED BY,16 BUCKETS
hive> CREATE TABLE t_hft_day( SecurityID STRING,TradeTime STRING, PreClosePx DOUBLE,OpenPx DOUBLE,HighPx DOUBLE,LowPx DOUBLE,LastPx DOUBLE, BidSize1 DOUBLE,BidPx1 DOUBLE,BidSize2 DOUBLE,BidPx2 DOUBLE,BidSize3 DOUBLE,BidPx3 DOUBLE,BidSize4 DOUBLE,BidPx4 DOUBLE,BidSize5 DOUBLE,BidPx5 DOUBLE, OfferSize1 DOUBLE,OfferPx1 DOUBLE,OfferSize2 DOUBLE,OfferPx2 DOUBLE,OfferSize3 DOUBLE,OfferPx3 DOUBLE,OfferSize4 DOUBLE,OfferPx4 DOUBLE,OfferSize5 DOUBLE,OfferPx5 DOUBLE, NumTrades INT,TotalVolumeTrade DOUBLE,TotalValueTrade DOUBLE,PE DOUBLE,PE1 DOUBLE,PriceChange1 DOUBLE,PriceChange2 DOUBLE,Positions DOUBLE ) PARTITIONED BY (tradeDate INT) CLUSTERED BY(SecurityID) SORTED BY(TradeTime) INTO 16 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
從t_hft_tmp臨時數據表導入到t_hft_day數據表
#強制執行裝桶的操做 hive> set hive.enforce.bucketing = true; #數據導入 hive> FROM t_hft_tmp INSERT OVERWRITE TABLE t_hft_day PARTITION (tradedate=20130712) SELECT SecurityID , TradeTime , PreClosePx ,OpenPx ,HighPx ,LowPx ,LastPx , BidSize1 ,BidPx1 ,BidSize2 ,BidPx2 ,BidSize3 ,BidPx3 ,BidSize4 ,BidPx4 ,BidSize5 ,BidPx5 , OfferSize1 ,OfferPx1 ,OfferSize2 ,OfferPx2 ,OfferSize3 ,OfferPx3 ,OfferSize4 ,OfferPx4 ,OfferSize5 ,OfferPx5 , NumTrades,TotalVolumeTrade ,TotalValueTrade ,PE ,PE1 ,PriceChange1 ,PriceChange2 ,Positions WHERE tradedate=20130712; MapReduce Total cumulative CPU time: 8 minutes 5 seconds 810 msec Ended Job = job_201307191356_0016 Loading data to table default.t_hft_day partition (tradedate=20130712) Partition default.t_hft_day{tradedate=20130712} stats: [num_files: 16, num_rows: 0, total_size: 1291728298, raw_data_size: 0] Table default.t_hft_day stats: [num_partitions: 11, num_files: 176, num_rows: 0, total_size: 10425980914, raw_data_size: 0] 6127080 Rows loaded to t_hft_day MapReduce Jobs Launched: Job 0: Map: 4 Reduce: 16 Cumulative CPU: 485.81 sec HDFS Read: 1068771008 HDFS Write: 1291728298 SUCCESS Total MapReduce CPU Time Spent: 8 minutes 5 seconds 810 msec OK Time taken: 172.617 seconds
導入操做累計CPU時間是8分05秒,8*60+5=485秒。因爲有4個Map並行,16個Reduce並行,因此實際消耗時間是172秒。
咱們再看一下新表的文件是否被分片:
hive> dfs -ls /user/hive/warehouse/t_hft_day/tradedate=20130712; Found 16 items -rw-r--r-- 1 cos supergroup 95292536 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000000_0 -rw-r--r-- 1 cos supergroup 97136495 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000001_0 -rw-r--r-- 1 cos supergroup 90695623 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000002_0 -rw-r--r-- 1 cos supergroup 84132171 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000003_0 -rw-r--r-- 1 cos supergroup 81552397 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000004_0 -rw-r--r-- 1 cos supergroup 80580028 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000005_0 -rw-r--r-- 1 cos supergroup 73195335 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000006_0 -rw-r--r-- 1 cos supergroup 68648786 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000007_0 -rw-r--r-- 1 cos supergroup 72210159 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000008_0 -rw-r--r-- 1 cos supergroup 66851502 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000009_0 -rw-r--r-- 1 cos supergroup 69292538 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000010_0 -rw-r--r-- 1 cos supergroup 75282272 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000011_0 -rw-r--r-- 1 cos supergroup 79572724 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000012_0 -rw-r--r-- 1 cos supergroup 78151866 2013-07-19 16:19 /user/hive/warehouse/t_hft_day/tradedate=20130712/000013_0 -rw-r--r-- 1 cos supergroup 86850954 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000014_0 -rw-r--r-- 1 cos supergroup 92282912 2013-07-19 16:19 /user/hive/warehouse/t_hft_day/tradedate=20130712/000015_0
一共16個分片。
當前1G的文件,使用Hive執行一個簡單的查詢:34.974秒
hive> select count(1) from t_hft_day where tradedate=20130712; MapReduce Total cumulative CPU time: 34 seconds 670 msec Ended Job = job_201307191356_0017 MapReduce Jobs Launched: Job 0: Map: 7 Reduce: 1 Cumulative CPU: 34.67 sec HDFS Read: 1291793812 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 34 seconds 670 msec 6127080 Time taken: 34.974 seconds
MySQL執行一樣的查詢,在開始時我已經測試過3.29秒。
相差了10倍的時間,不過只有1G的數據量,是發揮不出hadoop的優點的。
接下來,按照上面的方法,咱們把十幾天的數據都導入到hive裏面,而後再進行比較。
查看已導入hive的數據集
hive> SHOW PARTITIONS t_hft_day; tradedate=20130627 tradedate=20130628 tradedate=20130701 tradedate=20130702 tradedate=20130703 tradedate=20130704 tradedate=20130705 tradedate=20130708 tradedate=20130709 tradedate=20130710 tradedate=20130712 tradedate=20130715 tradedate=20130716 tradedate=20130719 Time taken: 0.099 seconds
在MySQL中,對5張表進行查詢。(5G數據量)
#單表:因爲PreClosePx不是索引列,第一次查詢 mysql> select SecurityID,20130719 as tradedate,count(1) as count from cb_hft_20130716 where PreClosePx>8.17 group by SecurityID limit 10; +------------+-----------+-------+ | SecurityID | tradedate | count | +------------+-----------+-------+ | 000001 | 20130719 | 5200 | | 000002 | 20130719 | 5193 | | 000003 | 20130719 | 1978 | | 000004 | 20130719 | 3201 | | 000005 | 20130719 | 1975 | | 000006 | 20130719 | 1910 | | 000007 | 20130719 | 3519 | | 000008 | 20130719 | 4229 | | 000009 | 20130719 | 5147 | | 000010 | 20130719 | 2176 | +------------+-----------+-------+ 10 rows in set (24.60 sec) #多表查詢 select t.SecurityID,t.tradedate,t.count from ( select SecurityID,20130710 as tradedate,count(1) as count from cb_hft_20130710 where PreClosePx>8.17 group by SecurityID union select SecurityID,20130712 as tradedate,count(1) as count from cb_hft_20130712 group by SecurityID union select SecurityID,20130715 as tradedate,count(1) as count from cb_hft_20130715 where PreClosePx>8.17 group by SecurityID union select SecurityID,20130716 as tradedate,count(1) as count from cb_hft_20130716 where PreClosePx>8.17 group by SecurityID union select SecurityID,20130719 as tradedate,count(1) as count from cb_hft_20130719 where PreClosePx>8.17 group by SecurityID ) as t limit 10 #超過3分鐘,無返回結果。 ....
在Hive中,對一樣的5張表進行查詢。(5G數據量)
select SecurityID,tradedate,count(1) from t_hft_day where tradedate in (20130710,20130712,20130715,20130716,20130719) and PreClosePx>8.17 group by SecurityID,tradedate limit 10; MapReduce Total cumulative CPU time: 3 minutes 56 seconds 540 msec Ended Job = job_201307191356_0023 MapReduce Jobs Launched: Job 0: Map: 25 Reduce: 7 Cumulative CPU: 236.54 sec HDFS Read: 6577084486 HDFS Write: 1470 SUCCESS Total MapReduce CPU Time Spent: 3 minutes 56 seconds 540 msec OK 000001 20130710 5813 000004 20130715 3546 000005 20130712 1820 000005 20130719 2364 000006 20130716 1910 000008 20130710 2426 000011 20130715 2113 000012 20130712 3554 000012 20130719 3756 000013 20130716 1646 Time taken: 66.32 seconds #對以上14張表的查詢 MapReduce Total cumulative CPU time: 8 minutes 40 seconds 380 msec Ended Job = job_201307191356_0022 MapReduce Jobs Launched: Job 0: Map: 53 Reduce: 15 Cumulative CPU: 520.38 sec HDFS Read: 14413501282 HDFS Write: 3146 SUCCESS Total MapReduce CPU Time Spent: 8 minutes 40 seconds 380 msec OK 000001 20130716 5200 000002 20130715 5535 000003 20130705 1634 000004 20130704 2173 000005 20130703 996 000005 20130712 1820 000006 20130702 1176 000007 20130701 2973 000007 20130710 4084 000010 20130716 2176 Time taken: 119.161 seconds
咱們看到hadoop對以G爲單位量級的數據增加是不敏感的,多了3倍的數據(15G),執行查詢的時間是原來(5G)的兩倍。而MySQL數據增加到5G,查詢時間幾乎是不可忍受的。
1G如下的數據是單機能夠處理的,MySQL會很是好的完成查詢任務。Hadoop只有在數據量大的狀況下才能發揮出優點,當數據量到達10G時,MySQL的單表查詢就顯得就會性能不足。若是數據量到達了100G,MySQL就已經解決不了了,要經過各類優化的程序才能完成查詢。
測試過程已經描述的很清楚了,咱們接下來的工做就是把過程自動化。