在校期間,一直沒有接觸過百萬級的數據,一直沒有百萬數據下查詢的感受。 近期作畢業設計,須要用到較大的數據來進行圖表分析和訂單推薦。 一直疑惑如何設計表才能讓普通的sql更高效,所以便如下嘗試並記錄於此。 本篇文章着重於表的設計對查詢sql的影響 小弟不才,文章僅作記錄,有更好的想法或有誤請留言指出,必當積極迴應。
- 表引擎:innoDB - 字符集:utf8mb4 - 數據庫版本:5.7.18 數據庫引擎爲默認的innoDB、由於實際業務考慮事務,所以就默認innoDB 爲了避免受其餘環節影響,用的是`騰訊雲mysql基礎版1核1000MB/50GB`
-- 建立order表 CREATE TABLE tb_order ( `id` BIGINT ( 20 ) NOT NULL AUTO_INCREMENT, `item_name` VARCHAR ( 255 ) NOT NULL, `item_price` INT ( 11 ) UNSIGNED NOT NULL, `item_state` TINYINT ( 1 ) NOT NULL, `create_time` INT ( 11 ) UNSIGNED NOT NULL, `time_year` CHAR ( 4 ) NOT NULL, `time_month` CHAR ( 2 ) NOT NULL, `time_day` CHAR ( 2 ) NOT NULL, PRIMARY KEY ( id ) ) ENGINE = INNODB DEFAULT CHARSET = utf8mb4; CREATE INDEX idx_order_ctime ON tb_order ( create_time ); CREATE INDEX idx_order_state ON tb_order ( `item_state` ); CREATE INDEX idx_order_day ON tb_order ( `time_year`, `time_month` ); CREATE INDEX idx_order_dmonth ON tb_order ( `time_year`, `time_month`, `time_day`);
-- 隨機字符串函數 CREATE DEFINER=`root`@`%` FUNCTION `rand_string`(n INT) RETURNS varchar(255) CHARSET utf8 BEGIN DECLARE chars_str varchar(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; DECLARE return_str varchar(255) DEFAULT '' ; DECLARE i INT DEFAULT 0; WHILE i < n DO SET return_str = concat(return_str,substring(chars_str , FLOOR(1 + RAND()*62 ),1)); SET i = i +1; END WHILE; RETURN return_str; END -- 隨機範圍時間函數 CREATE DEFINER=`root`@`%` FUNCTION `rand_date`(`startDate` date,`endDate` date) RETURNS datetime BEGIN #Routine body goes here... DECLARE sec INT DEFAULT 0; DECLARE ret DATETIME; SET sec = ABS(UNIX_TIMESTAMP(endDate) - UNIX_TIMESTAMP(startDate)); SET ret = DATE_ADD(startDate, INTERVAL FLOOR( 1+RAND ( ) * ( sec-1))SECOND); RETURN ret; END -- 模擬訂單存儲過程 CREATE DEFINER=`root`@`%` PROCEDURE `mock_order`(IN `size` int UNSIGNED,IN `sd` date,IN `ed` date) BEGIN #Routine body goes here... DECLARE i int UNSIGNED DEFAULT 1; DECLARE randOrderName VARCHAR(10); DECLARE randOrderTime DATETIME; WHILE i<= size DO SELECT rand_string(10) INTO randOrderName; SELECT rand_date(sd,ed) INTO randOrderTime; INSERT INTO tb_order(`item_name`,`item_price`,`item_state`,`create_time`,`time_year`,`time_month`,`time_day`) VALUES(randOrderName,RAND()*100,ROUND(RAND()),UNIX_TIMESTAMP(randOrderTime),DATE_FORMAT(randOrderTime,'%Y'),DATE_FORMAT(randOrderTime,'%m'),DATE_FORMAT(randOrderTime,'%d')); SET i = i+1; END WHILE; END -- 執行存儲過程 CALL mock_order(1200000,'2020-03-01','2020-12-31') -- CALL mock_order(FLOOR(RAND() * 9999999),'2020-03-01','2020-12-31')
數據總量:mysql
SELECT count(*) FROM tb_order > OK > 時間: 0.607s
.jpg)sql
數據分佈:數據庫
SELECT count(*) as count, DATE_FORMAT(FROM_UNIXTIME(create_time),'%Y-%m') as date FROM tb_order GROUP BY date desc > OK > 時間: 5.007s
結果分佈express
SELECT count( 1 ) FROM tb_order WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 時間: 6.348s
插曲:由於不肯定 使用UNIX_TIMESTAMP()
函數對本來的查詢結果是否有影響,決定多測測量使用UNIX_TIMESTAMP()
與直接使用時間戳的時間對比
次數 | 使用轉換函數 | 直接使用時間戳 |
---|---|---|
1 | 6.105s | 5.768s |
2 | 6.239s | 6.199s |
3 | 5.681s | 6.161s |
4 | 5.687s | 5.605s |
5 | 6.118s | 5.621s |
平均值 | 5.966s | 5.870s |
結果顯示 相差並非不少 0.09s 並且單純的執行select UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
須要0.045s 因此下列均使用UNIX_TIMESTAMP()
進行測試
SELECT sum(item_price) as total FROM tb_order WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 時間: 5.901s
SELECT count( * ) AS count, item_state AS state FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) GROUP BY item_state > OK > 時間: 12.126s
時間慢的有點不可接受
SELECT * FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 時間: 0.54s
SELECT SUM(item_price) as total FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 時間: 0.154s
SELECT * FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) LIMIT 10000,20 > OK > 時間: 0.107s
SELECT count(1) FROM tb_order WHERE item_state = 1 AND time_year = '2020' and time_month = '04' > OK > 時間: 1.329s
SELECT sum( item_price ) AS total FROM tb_order WHERE item_state = 1 AND time_year = '2020' AND time_month = '04' > OK > 時間: 1.23s
SELECT count( * ) AS count, item_state AS state FROM tb_order WHERE time_year = '2020' AND time_month = '04' GROUP BY item_state > OK > 時間: 1.429s
SELECT * FROM tb_order WHERE time_year = '2020' and time_month = '04' and time_day = '01' > OK > 時間: 0.663s
SELECT SUM(item_price) as total FROM tb_order WHERE time_year = '2020' AND time_month = '04' AND time_day = '01' > OK > 時間: 0.091s
SELECT * FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) LIMIT 10000,20 > OK > 時間: 0.107s
因爲 根據 時間戳來查詢的結果是按照時間戳索引排序的,所以是從小到大
而單純的拆分查詢並無有序
所以 添加一個order by
來進行排序
SELECT * FROM tb_order WHERE time_year = '2020' AND time_month = '04' ORDER BY create_time LIMIT 10000, 20 > OK > 時間: 1.599s
由於添加分區會改變表的文件結構,所以copy一個表
由於range分區創建的字段必須爲主鍵或惟一鍵,所以須要刪除原先主鍵並新建主鍵
ALTER TABLE tb_order_2 CHANGE COLUMN id id BIGINT(20) UNSIGNED NOT NULL; alter table tb_order_2 DROP PRIMARY key; alter table tb_order_2 add PRIMARY key(id,create_time);
ALTER table tb_order_2 PARTITION BY RANGE(create_time)( PARTITION p_2020_01 VALUES LESS THAN (1580486400), PARTITION p_2020_02 VALUES LESS THAN (1582992000), PARTITION p_2020_03 VALUES LESS THAN (1585670400), PARTITION p_2020_04 VALUES LESS THAN (1588262400), PARTITION p_2020_05 VALUES LESS THAN (1590940800), PARTITION p_2020_06 VALUES LESS THAN (1593532800), PARTITION p_2020_07 VALUES LESS THAN (1596211200), PARTITION p_2020_08 VALUES LESS THAN (1598889600), PARTITION p_2020_09 VALUES LESS THAN (1601481600), PARTITION p_2020_10 VALUES LESS THAN (1604160000), PARTITION p_2020_11 VALUES LESS THAN (1606752000), PARTITION p_2020_12 VALUES LESS THAN MAXVALUE )
SELECT partition_name part, partition_expression expr, partition_description descr, table_rows FROM information_schema.PARTITIONS WHERE table_schema = SCHEMA () AND table_name = 'tb_order_2';
SELECT count( 1 ) FROM tb_order_2 WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 時間: 0.133s
能夠看出時間相比之間查詢時間戳快不少。explain一下
用了分區進行查詢
SELECT sum(item_price) as total FROM tb_order_2 WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 時間: 0.526s
SELECT count( * ) AS count, item_state AS state FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) GROUP BY item_state > OK > 時間: 0.273s
SELECT * FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 時間: 0.587s
所有查詢的話 並 差距並不大
SELECT SUM(item_price) as total FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 時間: 0.068s
SELECT * FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) LIMIT 10000,20 > OK > 時間: 0.073s
測試類型 | 統計月訂單數 | 統計月訂單總額 | 根據月份統計訂單狀態 | 查詢日訂單 | 統計日訂單總額 | 分頁查詢月訂單 |
---|---|---|---|---|---|---|
時間戳 | 6.348s | 5.901s | 12.126s | 0.54s | 0.154s | 0.107s |
時間拆分 | 1.329s | 1.23s | 1.429s | 0.663s | 0.091s | 0.107s |
分區 | 0.133s | 0.526s | 0.273s | 0.587s | 0.068s | 0.073s |
整體看來,就時間查詢來講,效率是表分區>時間拆分>時間戳,時間戳的表結構爲int結構,由於之間看過其餘文章有對比過效率 int > datetime > timestamp,所以我在這裏設計表的時候就採用了int的形式。其實怎麼選擇都應該按需求而定,當咱們的表不大,數量級還沒上去,哪一種結構查詢相差並非很大,先按時間戳的形式查詢,待表結構太大能夠進行表的分區,結構若仍是很大,能夠進行分表。函數
固然這全部都是我我的(菜鳥)的想法,統計次數也較少,沒法徹底表明真實狀況。但願你們多提出想法測試