前段時間,將線上MySQL數據庫升級到了5.7。考慮到可能產生的不兼容性,在升級以前,確實也是戰戰兢兢,雖然測試環境,開發環境早在半年前就已提早升級。php
基於前期的調研和朋友的反饋,與開發相關的主要有兩點:mysql
sql_modegit
MySQL 5.6中,其默認值爲"NO_ENGINE_SU BSTITUTION",可理解爲非嚴格模式,譬如,對自增主鍵插入空字符串'',雖然提示warning,但並不影響自增主鍵的生成。github
但在MySQL 5.7中,其就調整爲了嚴格模式,對於上面這個,其不會提示warning,而是直接報錯。sql
分組求最值數據庫
分組求最值的某些寫法在MySQL5.7中得不到預期結果,這點,相對來講比較隱蔽。express
其中,第一點是可控的,畢竟能夠調整參數。而第二點,倒是不可控的,沒有參數與之相關,須要開發Review代碼。函數
測試數據性能
mysql> select * from emp; +-------+----------+--------+--------+ | empno | ename | sal | deptno | +-------+----------+--------+--------+ | 1001 | emp_1001 | 100.00 | 10 | | 1002 | emp_1002 | 200.00 | 10 | | 1003 | emp_1003 | 300.00 | 20 | | 1004 | emp_1004 | 400.00 | 20 | | 1005 | emp_1005 | 500.00 | 30 | | 1006 | emp_1006 | 600.00 | 30 | +-------+----------+--------+--------+ 6 rows in set (0.00 sec)
其中,empno是員工編號,ename是員工姓名,sal是工資,deptno是員工所在部門號。測試
業務的需求是,求出每一個部門中工資最高的員工的相關信息。
在MySQL5.6中,咱們能夠經過下面這個SQL來實現,
SELECT deptno,ename,sal FROM ( SELECT * FROM emp ORDER BY sal DESC ) t GROUP BY deptno;
結果以下,能夠看到,其確實實現了預期效果。
+--------+----------+--------+ | deptno | ename | sal | +--------+----------+--------+ | 10 | emp_1002 | 200.00 | | 20 | emp_1004 | 400.00 | | 30 | emp_1006 | 600.00 | +--------+----------+--------+
再來看看MySQL5.7的結果,居然不同。
+--------+----------+--------+ | deptno | ename | sal | +--------+----------+--------+ | 10 | emp_1001 | 100.00 | | 20 | emp_1003 | 300.00 | | 30 | emp_1005 | 500.00 | +--------+----------+--------+
實際上,在MySQL5.7中,對該SQL進行了改寫,改寫後的SQL可經過explain(extended) + show warnings查看。
mysql> explain select deptno,ename,sal from (select * from emp order by sal desc) t group by deptno; +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+ | 1 | SIMPLE | emp | NULL | ALL | NULL | NULL | NULL | NULL | 6 | 100.00 | Using temporary | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+ 1 row in set, 1 warning (0.00 sec) mysql> show warnings\G *************************** 1. row *************************** Level: Note Code: 1003 Message: /* select#1 */ select `slowtech`.`emp`.`deptno` AS `deptno`,`slowtech`.`emp`.`ename` AS `ename`,`slowtech`.`emp`.`sal` AS `sal` from `slowtech`.`emp` group by `slowtech`.`emp`.`deptno` 1 row in set (0.00 sec)
從改寫後的SQL來看,其消除了子查詢,致使結果未能實現預期效果,官方也證明了這一點,https://bugs.mysql.com/bug.php?id=80131
不少人可能不覺得然,認爲沒人會這樣寫,但在大名鼎鼎的stackoverflow中,該實現的點贊數就有116個-因而可知其受衆之廣,僅次於後面提到的「方法二」(點贊數206個)。
https://stackoverflow.com/questions/12102200/get-records-with-max-value-for-each-group-of-grouped-sql-results
須要注意的是,該SQL在5.7中是不能直接運行的,其會提示以下錯誤:
ERROR 1055 (42000): Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 't.ename' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
這個與sql_mode有關,在MySQL 5.7中,sql_mode調整爲了
ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
其中,ONLY_FULL_GROUP_BY與group by語句有關,其要求select列表裏只能出現分組列(即group by後面的列)和聚合函數(sum,avg,max等),這也是SQL92的標準。
但在工做中,卻常常看到開發寫出下面這種SQL。
mysql> select deptno,ename,max(sal) from emp group by deptno; +--------+----------+----------+ | deptno | ename | max(sal) | +--------+----------+----------+ | 10 | emp_1001 | 200.00 | | 20 | emp_1003 | 400.00 | | 30 | emp_1005 | 600.00 | +--------+----------+----------+ 3 rows in set (0.01 sec)
實在不明白,這裏的ename在業務層有何意義,畢竟,他並非工資最高的那位員工。
其實分組求最值是一個很廣泛的需求。在工做中,也常常被開發同事問到。 下面具體來看看,MySQL中有哪些實現方式。
方法1
SELECT e.deptno, ename, sal FROM emp e, ( SELECT deptno, max( sal ) maxsal FROM emp GROUP BY deptno ) t WHERE e.deptno = t.deptno AND e.sal = t.maxsal;
方法2
SELECT a.deptno, a.ename, a.sal FROM emp a LEFT JOIN emp b ON a.deptno = b.deptno AND a.sal < b.sal WHERE b.sal IS NULL;
這兩種實現方式,實際上是通用的,不只適用於MySQL,也適用於其它主流關係型數據庫。
方法3
MySQL 8.0推出了分析函數,其也可實現相似功能。
SELECT deptno, ename, sal FROM ( SELECT deptno, ename, sal, LAST_VALUE ( sal ) OVER ( PARTITION BY deptno ORDER BY sal ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) maxsal FROM emp ) a WHERE sal = maxsal;
因上面測試案例的數據量過小,三種實現方式的結果都是秒出,僅憑執行計劃很難直觀地看出實現方式的優劣。
下面換上數據量更大的測試數據,官方示例數據庫employees中的dept_emp表,https://github.com/datacharmer/test_db
表的相關信息以下,其中emp_no是員工編號,dept_no是部門編號,from_date是入職日期。
mysql> show create table dept_emp\G *************************** 1. row *************************** Table: dept_emp Create Table: CREATE TABLE `dept_emp` ( `emp_no` int(11) NOT NULL, `dept_no` char(4) NOT NULL, `from_date` date NOT NULL, `to_date` date NOT NULL, KEY `dept_no` (`dept_no`,`from_date`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec) mysql> select count(*) from dept_emp; +----------+ | count(*) | +----------+ | 331603 | +----------+ 1 row in set (0.09 sec) mysql> select * from dept_emp limit 1; +--------+---------+------------+------------+ | emp_no | dept_no | from_date | to_date | +--------+---------+------------+------------+ | 10001 | d005 | 1986-06-26 | 9999-01-01 | +--------+---------+------------+------------+ 1 row in set (0.00 sec)
方法1
mysql> select d.dept_no,d.emp_no,d.from_date from dept_emp d, (select dept_no,max(from_date) max_hiredate from dept_emp group by dept_no) t where d.dept_no=t.dept_no and d.from_date=t.max_hiredate; … 12 rows in set (0.00 sec) mysql> explain select d.dept_no,d.emp_no,d.from_date from dept_emp d, (select dept_no,max(from_date) max_hiredate from dept_emp group by dept_no) t where d.dept_no=t.dept_no and d.from_date=t.max_hiredate; +----+-------------+------------+------------+-------+---------------+---------+---------+--------------------------+------+----------+---------------------- | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra +----+-------------+------------+------------+-------+---------------+---------+---------+--------------------------+------+----------+---------------------- | 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 9 | 100.00 | Using where | 1 | PRIMARY | d | NULL | ref | dept_no | dept_no | 19 | t.dept_no,t.max_hiredate | 5 | 100.00 | NULL | 2 | DERIVED | dept_emp | NULL | range | dept_no | dept_no | 16 | NULL | 9 | 100.00 | Using index for group-by +----+-------------+------------+------------+-------+---------------+---------+---------+--------------------------+------+----------+----------------------
方法2
mysql> explain select a.dept_no,a.emp_no,a.from_date from dept_emp a left join dept_emp b on a.dept_no=b.dept_no and a.from_date < b.from_date where b.from_date is null; +----+-------------+-------+------------+------+---------------+---------+---------+--------------------+--------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+--------------------+--------+----------+--------------------------+ | 1 | SIMPLE | a | NULL | ALL | NULL | NULL | NULL | NULL | 331008 | 100.00 | NULL | | 1 | SIMPLE | b | NULL | ref | dept_no | dept_no | 16 | slowtech.a.dept_no | 41376 | 19.00 | Using where; Using index | +----+-------------+-------+------------+------+---------------+---------+---------+--------------------+--------+----------+--------------------------+ 2 rows in set, 1 warning (0.00 sec)
方法3
mysql> select dept_no,emp_no,from_date from ( select dept_no,emp_no,from_date,last_value(from_date) over(partition by dept_no order by from_date rows between unbounded preceding and unbounded following) max_hiredate from dept_emp) a where from_date=max_hiredate; … 12 rows in set (1.57 sec) mysql> desc select dept_no,emp_no,from_date from ( select dept_no,emp_no,from_date,last_value(from_date) over(partition by dept_no order by from_date rows between unbounded preceding and unbounded following) max_hiredate from dept_emp) a where from_date=max_hiredate; +----+-------------+------------+------------+------+---------------+------+---------+------+--------+----------+----------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+------+---------------+------+---------+------+--------+----------+----------------+ | 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 331008 | 100.00 | Using where | | 2 | DERIVED | dept_emp | NULL | ALL | NULL | NULL | NULL | NULL | 331008 | 100.00 | Using filesort | +----+-------------+------------+------------+------+---------------+------+---------+------+--------+----------+----------------+ 2 rows in set, 2 warnings (0.00 sec)
從執行時間上看,
方法1的時間最短,在有複合索引(deptno, fromdate)的狀況下,結果瞬間就出來了,即便在沒有索引的狀況下,也只消耗了0.75s。
方法2的時間最長,3個小時仍是沒出結果。一樣的數據,一樣的SQL,放到Oracle查,也消耗了87分49秒。
方法3的時間比較固定,不管是否存在索引,都維持在1.5s左右,比方法1的耗時要久。
這裏,對以前提到的,MySQL 5.7中再也不兼容的實現方式也作了個測試,在沒有任何索引的狀況下,其穩定在0.7s(性能並不弱,怪不得有人使用),而同等狀況下,方法1穩定在0.5s(哈,MySQL 5.6居然比8.0還快)。但與方法1不一樣的是,其沒法經過索引進行優化。
從執行計劃上看,
方法1, 先將group by的結果放到臨時表中,而後再將該臨時表做爲驅動表,來和dept_emp表進行關聯查詢。驅動表小(只有9條記錄),關聯列又有索引,無怪乎,結果能秒出。
方法2, 兩表關聯。其犯了SQL優化中的兩個大忌。
1. 驅動表太大,其有331603條記錄。
2. 被驅動表雖然也有索引,但從執行計劃上看,其只使用了複合索引 (dept_no, from_date)中的dept_no,而dept_no的選擇率又過低,畢竟只有9個部門。
方法3, 先把分析的結果放到一個臨時表中,而後再對該臨時表進行處理。其進行了兩次全表掃描,一次是針對dept_emp表,一次是針對臨時表。
因此,對於分組求最值的需求,建議使用方法1,其不只符合SQL規範,查詢性能上也是最好的,尤爲是在聯合索引的狀況下。
PS:
經大神指點,對以前提到的,MySQL 5.7中再也不兼容的實現方式,實際能夠經過調整optimizer_switch來加以規避
set optimizer_switch='derived_merge=off';
derived_merge是MySQL 5.7引入的,其會試圖將Derived Table(派生表,from後面的子查詢),視圖引用,公用表表達式(Common table expressions)與外層查詢進行合併。如,
SELECT * FROM t1 JOIN (SELECT t2.f1 FROM t2) AS derived_t2 ON t1.f2=derived_t2.f1 WHERE t1.f1 > 0;
改寫爲
SELECT * FROM t1 JOIN (SELECT DISTINCT f1 FROM t2) AS derived_t2 ON t1.f1=derived_t2.f1;