大數據-使用Hive導入10G數據

時間 2019-11-11

標籤數據使用 hive 導入 10g 欄目 Hadoop 简体版

原文原文鏈接

前言mysql

Hadoop和Hive的環境已經搭建起來了，開始導入數據進行測試。個人數據1G大概對應500W行，MySQL的查詢500W行大概3.29秒，用hive一樣的查詢大概30秒。若是咱們把數據增長到10G，100G，讓咱們來看看Hive的表現吧。sql

目錄shell

導出MySQL數據
導入到Hive
優化導入過程Hive Bucket
執行查詢

1. 導出MySQL數據

下面是個人表，天天會產生一新表，用日期的方式命名。今天是2013年7月19日，對應的表是cb_hft，記錄數646W條記錄。ssh

mysql> show tables;
+-----------------+
| Tables_in_CB    |
+-----------------+
| NSpremium       |
| cb_hft          |
| cb_hft_20130710 |
| cb_hft_20130712 |
| cb_hft_20130715 |
| cb_hft_20130716 |
+-----------------+
6 rows in set (0.00 sec)

mysql> select count(1) from cb_hft;
+----------+
| count(1) |
+----------+
|  6461338 |
+----------+
1 row in set (3.29 sec)

快速複製表：
因爲這個表是離線系統的，沒有線上應用，我重命名錶cb_hft爲cb_hft_20130719，再複製表結構。oop

mysql> RENAME TABLE cb_hft TO cb_hft_20130719;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE cb_hft like cb_hft_20130719;
Query OK, 0 rows affected (0.02 sec)

mysql> show tables;
+-----------------+
| Tables_in_CB    |
+-----------------+
| NSpremium       |
| cb_hft          |
| cb_hft_20130710 |
| cb_hft_20130712 |
| cb_hft_20130715 |
| cb_hft_20130716 |
| cb_hft_20130719 |
+-----------------+
7 rows in set (0.00 sec)

導出表到csv
以hft_20130712表爲例性能

mysql> SELECT
  SecurityID,TradeTime,PreClosePx,OpenPx,HighPx,LowPx,LastPx, 
  BidSize1,BidPx1,BidSize2,BidPx2,BidSize3,BidPx3,BidSize4,BidPx4,BidSize5,BidPx5,
  OfferSize1,OfferPx1,OfferSize2,OfferPx2,OfferSize3,OfferPx3,OfferSize4,OfferPx4,OfferSize5,OfferPx5,
  NumTrades,TotalVolumeTrade,TotalValueTrade,PE,PE1,PriceChange1,PriceChange2,Positions
FROM cb_hft_20130712
INTO OUTFILE '/tmp/export_cb_hft_20130712.csv'  FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; 

Query OK, 6127080 rows affected (2 min 55.04 sec)

查看數據文件測試

~ ls -l /tmp
-rw-rw-rw- 1 mysql mysql 1068707117 Jul 19 15:59 export_cb_hft_20130712.csv

2. 導入到Hive

登錄c1.wtmart.com機器，下載數據文件優化

~ ssh cos@c1.wtmart.com
~ cd /home/cos/hadoop/sqldb
~ scp -P 10003 cos@d2.wtmart.com:/tmp/export_cb_hft_20130712.csv .
export_cb_hft_20130712.csv                                                                 100% 1019MB  39.2MB/s   00:26

在hive上建表code

~ bin/hive shell

#刪除已存在的表
hive> DROP TABLE IF EXISTS t_hft_tmp;
Time taken: 4.898 seconds

#建立t_hft_tmp表
hive> CREATE TABLE t_hft_tmp(
  SecurityID STRING,TradeTime STRING,
  PreClosePx DOUBLE,OpenPx DOUBLE,HighPx DOUBLE,LowPx DOUBLE,LastPx DOUBLE,
  BidSize1 DOUBLE,BidPx1 DOUBLE,BidSize2 DOUBLE,BidPx2 DOUBLE,BidSize3 DOUBLE,BidPx3 DOUBLE,BidSize4 DOUBLE,BidPx4 DOUBLE,BidSize5 DOUBLE,BidPx5 DOUBLE,
  OfferSize1 DOUBLE,OfferPx1 DOUBLE,OfferSize2 DOUBLE,OfferPx2 DOUBLE,OfferSize3 DOUBLE,OfferPx3 DOUBLE,OfferSize4 DOUBLE,OfferPx4 DOUBLE,OfferSize5 DOUBLE,OfferPx5 DOUBLE,
  NumTrades INT,TotalVolumeTrade DOUBLE,TotalValueTrade DOUBLE,PE DOUBLE,PE1 DOUBLE,PriceChange1 DOUBLE,PriceChange2 DOUBLE,Positions DOUBLE
) PARTITIONED BY (tradeDate INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Time taken: 0.189 seconds

#導入數據
hive> LOAD DATA LOCAL INPATH '/home/cos/hadoop/sqldb/export_cb_hft_20130712.csv' OVERWRITE INTO TABLE t_hft_tmp PARTITION (tradedate=20130712);
Copying data from file:/home/cos/hadoop/sqldb/export_cb_hft_20130712.csv
Copying file: file:/home/cos/hadoop/sqldb/export_cb_hft_20130712.csv
Loading data to table default.t_hft_tmp partition (tradedate=20130712)
Time taken: 16.535 seconds

當數據被加載至表中時，不會對數據進行任何轉換。Load操做只是將數據複製至Hive表對應的位置，這個表只有一個文件，文件沒有切分紅多份。索引

hive> dfs -ls /user/hive/warehouse/t_hft_tmp/tradedate=20130712;
Found 1 items
-rw-r--r--   1 cos supergroup 1068707117 2013-07-19 16:07 /user/hive/warehouse/t_hft_tmp/tradedate=20130712/export_cb_hft_20130712.csv

3. 優化導入過程Hive Bucket

第二步導入，咱們要把剛纔的一個大文件切分紅多少小文件，大概按照64M一個block的要求。咱們設置作16個Bucket。

新建數據表t_hft_day，並定義CLUSTERED BY,SORTED BY,16 BUCKETS

hive> CREATE TABLE t_hft_day(
  SecurityID STRING,TradeTime STRING,
  PreClosePx DOUBLE,OpenPx DOUBLE,HighPx DOUBLE,LowPx DOUBLE,LastPx DOUBLE,
  BidSize1 DOUBLE,BidPx1 DOUBLE,BidSize2 DOUBLE,BidPx2 DOUBLE,BidSize3 DOUBLE,BidPx3 DOUBLE,BidSize4 DOUBLE,BidPx4 DOUBLE,BidSize5 DOUBLE,BidPx5 DOUBLE,
  OfferSize1 DOUBLE,OfferPx1 DOUBLE,OfferSize2 DOUBLE,OfferPx2 DOUBLE,OfferSize3 DOUBLE,OfferPx3 DOUBLE,OfferSize4 DOUBLE,OfferPx4 DOUBLE,OfferSize5 DOUBLE,OfferPx5 DOUBLE,
  NumTrades INT,TotalVolumeTrade DOUBLE,TotalValueTrade DOUBLE,PE DOUBLE,PE1 DOUBLE,PriceChange1 DOUBLE,PriceChange2 DOUBLE,Positions DOUBLE
) PARTITIONED BY (tradeDate INT)
CLUSTERED BY(SecurityID) SORTED BY(TradeTime) INTO 16 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

從t_hft_tmp臨時數據表導入到t_hft_day數據表

#強制執行裝桶的操做
hive> set hive.enforce.bucketing = true;

#數據導入
hive> FROM t_hft_tmp
INSERT OVERWRITE TABLE t_hft_day
PARTITION (tradedate=20130712)
SELECT SecurityID , TradeTime ,
  PreClosePx ,OpenPx ,HighPx ,LowPx ,LastPx ,
  BidSize1 ,BidPx1 ,BidSize2 ,BidPx2 ,BidSize3 ,BidPx3 ,BidSize4 ,BidPx4 ,BidSize5 ,BidPx5 ,
  OfferSize1 ,OfferPx1 ,OfferSize2 ,OfferPx2 ,OfferSize3 ,OfferPx3 ,OfferSize4 ,OfferPx4 ,OfferSize5 ,OfferPx5 ,
  NumTrades,TotalVolumeTrade ,TotalValueTrade ,PE ,PE1 ,PriceChange1 ,PriceChange2 ,Positions 
WHERE tradedate=20130712;

MapReduce Total cumulative CPU time: 8 minutes 5 seconds 810 msec
Ended Job = job_201307191356_0016
Loading data to table default.t_hft_day partition (tradedate=20130712)
Partition default.t_hft_day{tradedate=20130712} stats: [num_files: 16, num_rows: 0, total_size: 1291728298, raw_data_size: 0]
Table default.t_hft_day stats: [num_partitions: 11, num_files: 176, num_rows: 0, total_size: 10425980914, raw_data_size: 0]
6127080 Rows loaded to t_hft_day
MapReduce Jobs Launched:
Job 0: Map: 4  Reduce: 16   Cumulative CPU: 485.81 sec   HDFS Read: 1068771008 HDFS Write: 1291728298 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 5 seconds 810 msec
OK
Time taken: 172.617 seconds

導入操做累計CPU時間是8分05秒，8*60+5=485秒。因爲有4個Map並行，16個Reduce並行，因此實際消耗時間是172秒。

咱們再看一下新表的文件是否被分片：

hive> dfs -ls /user/hive/warehouse/t_hft_day/tradedate=20130712;
Found 16 items
-rw-r--r--   1 cos supergroup   95292536 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000000_0
-rw-r--r--   1 cos supergroup   97136495 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000001_0
-rw-r--r--   1 cos supergroup   90695623 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000002_0
-rw-r--r--   1 cos supergroup   84132171 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000003_0
-rw-r--r--   1 cos supergroup   81552397 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000004_0
-rw-r--r--   1 cos supergroup   80580028 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000005_0
-rw-r--r--   1 cos supergroup   73195335 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000006_0
-rw-r--r--   1 cos supergroup   68648786 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000007_0
-rw-r--r--   1 cos supergroup   72210159 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000008_0
-rw-r--r--   1 cos supergroup   66851502 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000009_0
-rw-r--r--   1 cos supergroup   69292538 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000010_0
-rw-r--r--   1 cos supergroup   75282272 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000011_0
-rw-r--r--   1 cos supergroup   79572724 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000012_0
-rw-r--r--   1 cos supergroup   78151866 2013-07-19 16:19 /user/hive/warehouse/t_hft_day/tradedate=20130712/000013_0
-rw-r--r--   1 cos supergroup   86850954 2013-07-19 16:18 /user/hive/warehouse/t_hft_day/tradedate=20130712/000014_0
-rw-r--r--   1 cos supergroup   92282912 2013-07-19 16:19 /user/hive/warehouse/t_hft_day/tradedate=20130712/000015_0

一共16個分片。

4. 執行查詢

當前1G的文件，使用Hive執行一個簡單的查詢：34.974秒

hive> select count(1) from t_hft_day where tradedate=20130712;

MapReduce Total cumulative CPU time: 34 seconds 670 msec
Ended Job = job_201307191356_0017
MapReduce Jobs Launched:
Job 0: Map: 7  Reduce: 1   Cumulative CPU: 34.67 sec   HDFS Read: 1291793812 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 34 seconds 670 msec
6127080
Time taken: 34.974 seconds

MySQL執行一樣的查詢，在開始時我已經測試過3.29秒。
相差了10倍的時間，不過只有1G的數據量，是發揮不出hadoop的優點的。

接下來，按照上面的方法，咱們把十幾天的數據都導入到hive裏面，而後再進行比較。

查看已導入hive的數據集

hive> SHOW PARTITIONS t_hft_day;
tradedate=20130627
tradedate=20130628
tradedate=20130701
tradedate=20130702
tradedate=20130703
tradedate=20130704
tradedate=20130705
tradedate=20130708
tradedate=20130709
tradedate=20130710
tradedate=20130712
tradedate=20130715
tradedate=20130716
tradedate=20130719
Time taken: 0.099 seconds

在MySQL中，對5張表進行查詢。(5G數據量)

#單表：因爲PreClosePx不是索引列，第一次查詢
mysql> select SecurityID,20130719 as tradedate,count(1) as count from cb_hft_20130716 where PreClosePx>8.17 group by SecurityID limit 10;
+------------+-----------+-------+
| SecurityID | tradedate | count |
+------------+-----------+-------+
| 000001     |  20130719 |  5200 |
| 000002     |  20130719 |  5193 |
| 000003     |  20130719 |  1978 |
| 000004     |  20130719 |  3201 |
| 000005     |  20130719 |  1975 |
| 000006     |  20130719 |  1910 |
| 000007     |  20130719 |  3519 |
| 000008     |  20130719 |  4229 |
| 000009     |  20130719 |  5147 |
| 000010     |  20130719 |  2176 |
+------------+-----------+-------+
10 rows in set (24.60 sec)

#多表查詢
select t.SecurityID,t.tradedate,t.count 
from (
select SecurityID,20130710 as tradedate,count(1) as count from cb_hft_20130710 where PreClosePx>8.17 group by SecurityID
union
select SecurityID,20130712 as tradedate,count(1) as count from cb_hft_20130712 group by SecurityID
union
select SecurityID,20130715 as tradedate,count(1) as count from cb_hft_20130715 where PreClosePx>8.17 group by SecurityID
union
select SecurityID,20130716 as tradedate,count(1) as count from cb_hft_20130716 where PreClosePx>8.17 group by SecurityID
union
select SecurityID,20130719 as tradedate,count(1) as count from cb_hft_20130719 where PreClosePx>8.17 group by SecurityID ) as t 
limit 10

#超過3分鐘，無返回結果。
....

在Hive中，對一樣的5張表進行查詢。(5G數據量)

select SecurityID,tradedate,count(1) from t_hft_day where tradedate in (20130710,20130712,20130715,20130716,20130719) and PreClosePx>8.17 group by SecurityID,tradedate limit 10;

MapReduce Total cumulative CPU time: 3 minutes 56 seconds 540 msec
Ended Job = job_201307191356_0023
MapReduce Jobs Launched:
Job 0: Map: 25  Reduce: 7   Cumulative CPU: 236.54 sec   HDFS Read: 6577084486 HDFS Write: 1470 SUCCESS
Total MapReduce CPU Time Spent: 3 minutes 56 seconds 540 msec
OK
000001  20130710        5813
000004  20130715        3546
000005  20130712        1820
000005  20130719        2364
000006  20130716        1910
000008  20130710        2426
000011  20130715        2113
000012  20130712        3554
000012  20130719        3756
000013  20130716        1646
Time taken: 66.32 seconds

#對以上14張表的查詢
MapReduce Total cumulative CPU time: 8 minutes 40 seconds 380 msec
Ended Job = job_201307191356_0022
MapReduce Jobs Launched:
Job 0: Map: 53  Reduce: 15   Cumulative CPU: 520.38 sec   HDFS Read: 14413501282 HDFS Write: 3146 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 40 seconds 380 msec
OK
000001  20130716        5200
000002  20130715        5535
000003  20130705        1634
000004  20130704        2173
000005  20130703        996
000005  20130712        1820
000006  20130702        1176
000007  20130701        2973
000007  20130710        4084
000010  20130716        2176
Time taken: 119.161 seconds

咱們看到hadoop對以G爲單位量級的數據增加是不敏感的，多了3倍的數據(15G)，執行查詢的時間是原來(5G)的兩倍。而MySQL數據增加到5G，查詢時間幾乎是不可忍受的。

1G如下的數據是單機能夠處理的，MySQL會很是好的完成查詢任務。Hadoop只有在數據量大的狀況下才能發揮出優點，當數據量到達10G時，MySQL的單表查詢就顯得就會性能不足。若是數據量到達了100G，MySQL就已經解決不了了，要經過各類優化的程序才能完成查詢。

測試過程已經描述的很清楚了，咱們接下來的工做就是把過程自動化。