Mysql百萬數據查詢之時間查詢

Mysql百萬數據查詢之時間查詢

在校期間,一直沒有接觸過百萬級的數據,一直沒有百萬數據下查詢的感受。
近期作畢業設計,須要用到較大的數據來進行圖表分析和訂單推薦。
一直疑惑如何設計表才能讓普通的sql更高效,所以便如下嘗試並記錄於此。
本篇文章着重於表的設計對查詢sql的影響
小弟不才,文章僅作記錄,有更好的想法或有誤請留言指出,必當積極迴應。

優化設想

  1. 時間以int時間戳的形式保存,並創建相關索引
  2. 時間拆分,以year、month、day進行保存,並以year+month和year+month+day作索引(空間換時間)
  3. 以時間戳保存,並按月進行分區

測試環境

- 表引擎:innoDB
- 字符集:utf8mb4
- 數據庫版本:5.7.18
數據庫引擎爲默認的innoDB、由於實際業務考慮事務,所以就默認innoDB
爲了避免受其餘環節影響,用的是`騰訊雲mysql基礎版1核1000MB/50GB`

建表

建立相關Order表

-- 建立order表
CREATE TABLE tb_order (
    `id` BIGINT ( 20 ) NOT NULL AUTO_INCREMENT,
    `item_name` VARCHAR ( 255 ) NOT NULL,
    `item_price` INT ( 11 ) UNSIGNED NOT NULL,
    `item_state` TINYINT ( 1 ) NOT NULL,
    `create_time` INT ( 11 ) UNSIGNED NOT NULL,
    `time_year` CHAR ( 4 ) NOT NULL,
    `time_month` CHAR ( 2 ) NOT NULL,
    `time_day` CHAR ( 2 ) NOT NULL,
    PRIMARY KEY ( id ) 
) ENGINE = INNODB DEFAULT CHARSET = utf8mb4;
CREATE INDEX idx_order_ctime ON tb_order ( create_time );
CREATE INDEX idx_order_state ON tb_order ( `item_state` );
CREATE INDEX idx_order_day ON tb_order ( `time_year`, `time_month` );
CREATE INDEX idx_order_dmonth ON tb_order ( `time_year`, `time_month`, `time_day`);

建立Mock存儲過程

-- 隨機字符串函數
CREATE DEFINER=`root`@`%` FUNCTION `rand_string`(n INT) RETURNS varchar(255) CHARSET utf8
BEGIN 
    DECLARE chars_str varchar(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; 
    DECLARE return_str varchar(255) DEFAULT '' ;
    DECLARE i INT DEFAULT 0; 
WHILE i < n DO 
    SET return_str = concat(return_str,substring(chars_str , FLOOR(1 + RAND()*62 ),1)); 
    SET i = i +1; 
END WHILE; 
    RETURN return_str; 
END

-- 隨機範圍時間函數
CREATE DEFINER=`root`@`%` FUNCTION `rand_date`(`startDate` date,`endDate` date) RETURNS datetime
BEGIN
    #Routine body goes here...
    DECLARE sec INT DEFAULT 0;
    DECLARE ret DATETIME;
    SET sec = ABS(UNIX_TIMESTAMP(endDate) - UNIX_TIMESTAMP(startDate));
    SET ret = DATE_ADD(startDate, INTERVAL FLOOR( 1+RAND ( ) * ( sec-1))SECOND);
    RETURN ret;
END

-- 模擬訂單存儲過程
CREATE DEFINER=`root`@`%` PROCEDURE `mock_order`(IN `size` int UNSIGNED,IN `sd` date,IN `ed` date)
BEGIN
    #Routine body goes here...
    DECLARE i int UNSIGNED DEFAULT 1;
    DECLARE randOrderName VARCHAR(10);
    DECLARE randOrderTime DATETIME;
    WHILE i<= size DO
        SELECT rand_string(10) INTO randOrderName;
        SELECT rand_date(sd,ed) INTO randOrderTime;
        INSERT INTO tb_order(`item_name`,`item_price`,`item_state`,`create_time`,`time_year`,`time_month`,`time_day`)
        VALUES(randOrderName,RAND()*100,ROUND(RAND()),UNIX_TIMESTAMP(randOrderTime),DATE_FORMAT(randOrderTime,'%Y'),DATE_FORMAT(randOrderTime,'%m'),DATE_FORMAT(randOrderTime,'%d'));
        SET i = i+1;

    END WHILE;
END

-- 執行存儲過程
CALL mock_order(1200000,'2020-03-01','2020-12-31')

-- CALL mock_order(FLOOR(RAND() * 9999999),'2020-03-01','2020-12-31')

執行結果

數據總量:mysql

SELECT count(*) FROM tb_order
> OK
> 時間: 0.607s

執行結果.jpg)sql

數據分佈:數據庫

SELECT 
    count(*) as count,
    DATE_FORMAT(FROM_UNIXTIME(create_time),'%Y-%m') as date
FROM
    tb_order
GROUP BY 
    date desc
> OK
> 時間: 5.007s

數據分佈:

結果分佈express

測試過程

  1. 統計月訂單數量
  2. 統計月訂單總額
  3. 根據月份統計訂單狀態
  4. 查詢日訂單
  5. 統計日訂單總額
  6. 分頁查詢月訂單

時間戳

1. 統計月訂單數量

SELECT
    count( 1 ) 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 時間: 6.348s

20200308204156.png

插曲:由於不肯定 使用 UNIX_TIMESTAMP() 函數對本來的查詢結果是否有影響,決定多測測量使用 UNIX_TIMESTAMP() 與直接使用時間戳的時間對比
次數 使用轉換函數 直接使用時間戳
1 6.105s 5.768s
2 6.239s 6.199s
3 5.681s 6.161s
4 5.687s 5.605s
5 6.118s 5.621s
平均值 5.966s 5.870s
結果顯示 相差並非不少 0.09s 並且單純的執行 select UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 須要0.045s 因此下列均使用 UNIX_TIMESTAMP() 進行測試

2. 統計月訂單總額

SELECT
    sum(item_price) as total
FROM
    tb_order 
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 時間: 5.901s

20200308205618.png

3. 根據月份統計訂單狀態

SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
GROUP BY
    item_state
> OK
> 時間: 12.126s

20200308205805.png

時間慢的有點不可接受

4. 查詢日訂單

SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 時間: 0.54s

20200308212150.png

5. 統計日訂單總額

SELECT
    SUM(item_price) as total
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 時間: 0.154s

20200308212401.png

6. 分頁查詢月訂單

SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 時間: 0.107s

20200308211153.png

時間拆分的形式

1. 統計月訂單數量

SELECT
    count(1) 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND time_year = '2020' and time_month = '04'
> OK
> 時間: 1.329s

20200308211318.png

2. 統計月訂單總額

SELECT
    sum( item_price ) AS total 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND time_year = '2020' 
    AND time_month = '04'
> OK
> 時間: 1.23s

20200308211634.png

3. 根據月份統計訂單狀態

SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order 
WHERE
    time_year = '2020' 
    AND time_month = '04'
GROUP BY
    item_state
> OK
> 時間: 1.429s

20200308211725.png

4. 查詢日訂單

SELECT
    * 
FROM
    tb_order 
WHERE
    time_year = '2020' and time_month = '04' and time_day = '01'
> OK
> 時間: 0.663s

20200308212225.png

5. 統計日訂單總額

SELECT
    SUM(item_price) as total
FROM
    tb_order 
WHERE
    time_year = '2020' AND time_month = '04' AND time_day = '01'
> OK
> 時間: 0.091s

20200308212430.png

6. 分頁查詢月訂單

SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 時間: 0.107s

20200308212613.png

因爲 根據 時間戳來查詢的結果是按照時間戳索引排序的,所以是從小到大
而單純的拆分查詢並無有序
所以 添加一個 order by 來進行排序
SELECT
    * 
FROM
    tb_order 
WHERE
    time_year = '2020' 
    AND time_month = '04' 
ORDER BY
    create_time  
    LIMIT 10000,
    20
> OK
> 時間: 1.599s

20200308213113.png

時間分區

由於添加分區會改變表的文件結構,所以copy一個表

1. 新建主鍵

由於range分區創建的字段必須爲主鍵或惟一鍵,所以須要刪除原先主鍵並新建主鍵
ALTER TABLE tb_order_2 CHANGE COLUMN id id BIGINT(20) UNSIGNED NOT NULL;
alter table tb_order_2 DROP PRIMARY key;
alter table tb_order_2 add PRIMARY key(id,create_time);

2. 新建表分區

ALTER table tb_order_2 PARTITION BY RANGE(create_time)(
PARTITION  p_2020_01  VALUES LESS THAN (1580486400),
PARTITION  p_2020_02  VALUES LESS THAN (1582992000),
PARTITION  p_2020_03  VALUES LESS THAN (1585670400),
PARTITION  p_2020_04  VALUES LESS THAN (1588262400),
PARTITION  p_2020_05  VALUES LESS THAN (1590940800),
PARTITION  p_2020_06  VALUES LESS THAN (1593532800),
PARTITION  p_2020_07  VALUES LESS THAN (1596211200),
PARTITION  p_2020_08  VALUES LESS THAN (1598889600),
PARTITION  p_2020_09  VALUES LESS THAN (1601481600),
PARTITION  p_2020_10  VALUES LESS THAN (1604160000),
PARTITION  p_2020_11  VALUES LESS THAN (1606752000),
PARTITION  p_2020_12  VALUES LESS THAN MAXVALUE
)

20200312105829.png

3. 查看錶分區結構

SELECT
    partition_name part,
    partition_expression expr,
    partition_description descr,
    table_rows 
FROM
    information_schema.PARTITIONS 
WHERE
    table_schema = SCHEMA () 
    AND table_name = 'tb_order_2';

20200312105949.png

4. 統計月訂單

SELECT
    count( 1 ) 
FROM
    tb_order_2
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 時間: 0.133s
能夠看出時間相比之間查詢時間戳快不少。explain一下
20200312110146.png
用了分區進行查詢

5. 統計月訂單總額

SELECT
    sum(item_price) as total
FROM
    tb_order_2
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 時間: 0.526s

20200312110252.png

6. 根據月份統計訂單狀態

SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
GROUP BY
    item_state
> OK
> 時間: 0.273s

20200312112123.png

7. 查詢日訂單

SELECT
    * 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 時間: 0.587s
所有查詢的話 並 差距並不大
20200312112257.png

8. 統計日訂單總額

SELECT
    SUM(item_price) as total
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 時間: 0.068s

20200312112342.png

9. 分頁查詢月訂單

SELECT
    * 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 時間: 0.073s

20200312112415.png

圖表

測試類型 統計月訂單數 統計月訂單總額 根據月份統計訂單狀態 查詢日訂單 統計日訂單總額 分頁查詢月訂單
時間戳 6.348s 5.901s 12.126s 0.54s 0.154s 0.107s
時間拆分 1.329s 1.23s 1.429s 0.663s 0.091s 0.107s
分區 0.133s 0.526s 0.273s 0.587s 0.068s 0.073s

總結

整體看來,就時間查詢來講,效率是表分區>時間拆分>時間戳,時間戳的表結構爲int結構,由於之間看過其餘文章有對比過效率 int > datetime > timestamp,所以我在這裏設計表的時候就採用了int的形式。

其實怎麼選擇都應該按需求而定,當咱們的表不大,數量級還沒上去,哪一種結構查詢相差並非很大,先按時間戳的形式查詢,待表結構太大能夠進行表的分區,結構若仍是很大,能夠進行分表。函數

固然這全部都是我我的(菜鳥)的想法,統計次數也較少,沒法徹底表明真實狀況。但願你們多提出想法測試

相關文章
相關標籤/搜索