這段時間一直在用kettle作數據抽取和報表,寫SQL即是屢見不鮮了,200行+SQL常常要寫。甚至寫過最長的一個SQL500多行將近600行。這麼長的SQL估計大部分人連看的意願都沒有,讀起來也比較坑爹,我通常是把這種長SQL分紅幾個子SQL,測試好了再組裝起來。SQL語句寫的越多也就越可能出現性能問題。優化SQL能夠從不少細節入手,好比加索引,但也不是萬能的,當SQL達到必定規模,從結構上優化纔是根本解決問題的辦法,固然前提是改加的索引已經加了,大部分能夠從局部優化的細節已經注意到了。html
和往常同樣,一個新的需求須要從大概10個表中抽取數據,大部分表數據量都在四十萬左右,最多的表有100萬左右。說真的數據並不算多,可是這麼多遍鏈接後,若是SQL有的有問題查詢效率也是很是低的。一開始我按照本身的思路寫了一個SQL,只考慮需求和最短期內實現。mysql
部分SQL以下圖,SQL已經超過200行了:sql
執行結果以下圖:性能
只查詢了38行記錄,盡然花了將近10s,感受已經很慢了。學習
此時我精簡SQL的大概結構以下:測試
SELECT * FROM (SELECT * FROM A m INNER JOIN B pm ON pm.id_sour = m.pk_id LEFT JOIN (SELECT * FROM C WHERE is_bring IS NULL OR is_bring = 0 GROUP BY id_m) pd ON m.pk_id = pd.id_m LEFT JOIN (SELECT * FROM D sd INNER JOIN E si ON sd.id_ser = si.pk_id GROUP BY sd.id_m) sd ON m.pk_id = sd.id_m WHERE pm.status = '' AND pm.is_del = 0 AND pm.m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('') UNION ALL SELECT * FROM F m INNER JOIN G pm ON pm.id_sour = m.pk_id LEFT JOIN (SELECT * FROM H WHERE is_bring IS NULL OR is_bring = 0 GROUP BY id_m) pd ON m.pk_id = pd.id_m LEFT JOIN (SELECT * FROM I sd INNER JOIN E si ON sd.id_ser = si.pk_id GROUP BY sd.id_m) sd ON m.pk_id = sd.id_m WHERE pm.status = '' AND pm.is_del = 0 AND pm.time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('') UNION ALL SELECT * FROM F m INNER JOIN G pm ON pm.id_sour = m.pk_id LEFT JOIN (SELECT * FROM H WHERE is_bring IS NULL OR is_bring = 0 GROUP BY id_m) pd ON m.pk_id = pd.id_m LEFT JOIN (SELECT * FROM I sd INNER JOIN E si ON sd.id_ser = si.pk_id GROUP BY sd.id_m) sd ON m.pk_id = sd.id_m WHERE pm.status = '' AND pm.is_del = 0 AND pm.time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('')) t1 LEFT JOIN (SELECT * FROM J sb INNER JOIN (SELECT m.pk_id AS pk_id, pm.m_time AS m_time FROM A m INNER JOIN B pm ON pm.id_sour = m.pk_id WHERE pm.m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND pm.status = '' UNION ALL SELECT m.from_mid_sn AS pk_id, pm.m_time AS m_time FROM F m INNER JOIN G pm ON pm.id_sour = m.pk_id WHERE pm.time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND pm.status = '') mp ON mp.pk_id = sb.id_sour WHERE sb.c_time <= mp.m_time GROUP BY sb.id_sour , mp.m_time) t2 ON t1.id_m = CAST(t2.id_sour AS CHAR) AND t1.m_time_cost = t2.m_time
再精簡一下結構以下:優化
SELECT * FROM (SELECT * FROM A UNION ALL SELECT * FROM B UNION ALL SELECT * FROM C) t1 LEFT JOIN ((SELECT * FROM D) INNER JOIN (SELECT * FROM E UNION ALL SELECT * FROM F) t2 ON t1.id = t2.id) t3 ON t1.tid = t3.id
其中上面的A、B、C、D、E、F都是10個表中多個表的鏈接查詢的結果。其實以上SQL在咱們實現的時候就作過簡單的優化了,t3其實能夠放進t1中分別和A、B、C鏈接。但其實A、B、C、已經鏈接好多表了,在分別鏈接t3性能會產生更多的數據,效率會更低。spa
因爲是數據抽取,數據只是存儲到指定的事實表中。所以對效率沒過高的要求,一分鐘以內都是能夠接受的。原本想這樣就算了,還有堆事要幹。剛好手裏有一段相似邏輯的SQL,可是不徹底同樣。而後我就跑了一下。發現比我寫的快一個數量級,大吃一驚之餘我決定探索一下緣由。3d
精簡優化過的SQL代碼以下:code
SELECT * FROM (SELECT * FROM A m INNER JOIN (SELECT * FROM B where is_del = 0 AND m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND pm.m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('') UNION ALL SELECT * FROM F m INNER JOIN (SELECT * FROM G where is_del = 0 AND m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND pm.s_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('') UNION ALL SELECT * FROM F m INNER JOIN (SELECT * FROM G where is_del = 0 AND m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND pm.s_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('')) mm LEFT JOIN (SELECT * FROM J sb INNER JOIN (SELECT m.pk_id AS pk_id, pm.m_time AS m_time FROM A m INNER JOIN B pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND m.type IN ('') AND m.is_del = 0 AND m.is_mig = 0 AND pm.m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' UNION ALL SELECT m.from_mid_sn AS pk_id, pm.m_time AS m_time FROM F m INNER JOIN (SELECT * FROM G where is_del = 0 AND m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND m.type IN ('') AND m.is_del = 0 AND m.is_mig = 0 AND pm.s_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') mp ON mp.pk_id = sb.id_sour WHERE sb.c_time <= mp.m_time GROUP BY sb.id_sour , mp.m_time) cost ON cost.id_sour = mm.id_m AND cost.m_time = mm.m_time_cost LEFT JOIN (SELECT * FROM D sd INNER JOIN E si ON sd.id_ser = si.pk_id INNER JOIN (SELECT DISTINCT * FROM A m INNER JOIN B pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND m.type IN ('') AND m.is_del = 0 AND m.is_mig = 0 AND pm.m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') ms ON sd.id_m = ms.pk_id GROUP BY sd.id_m UNION ALL SELECT * FROM I sd INNER JOIN E si ON sd.id_ser = si.pk_id INNER JOIN (SELECT DISTINCT m.pk_id, from_mid_sn, pm.m_time FROM F m INNER JOIN G pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND m.type IN ('') AND m.is_del = 0 AND m.is_mig = 0 AND pm.s_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') ms ON sd.id_m = ms.pk_id GROUP BY sd.id_m) ser ON ser.id_m = mm.id_m AND ser.m_time = mm.m_time_cost LEFT JOIN (SELECT * FROM C pd INNER JOIN (SELECT DISTINCT m.pk_id, pm.m_time FROM A m INNER JOIN B pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND m.type IN ('') AND m.is_del = 0 AND m.is_mig = 0 AND pm.m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') ms ON ms.pk_id = pd.id_m WHERE is_bring IS NULL OR is_bring = 0 GROUP BY pd.id_m , ms.m_time UNION ALL SELECT * FROM H pd INNER JOIN (SELECT DISTINCT m.pk_id, pm.m_time, from_mid_sn FROM F m INNER JOIN G pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND m.type IN ('') AND m.is_del = 0 AND m.is_mig = 0 AND pm.s_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') ms ON ms.pk_id = pd.id_m WHERE is_bring IS NULL OR is_bring = 0 GROUP BY pd.id_m) part ON part.id_m = mm.id_m AND part.m_time = mm.m_time_cost
運行此代碼結果以下:
一樣的結果,效率整整提高了一個數量級,哇咔咔。。。其實寫出以前讓我參考的效率較高的SQL的一位妹子。在我公司,你們稱之爲SQL女神,果真名不虛傳。佩服之餘我要要要學習一下。
仔細分析以上優化過的SQL,實際上是巧妙的使用了某種規律,我稱之爲---SQL分配率和結合律。
最左側的子SQL(或者臨時表:mm)以下:
SELECT * FROM A m INNER JOIN (SELECT * FROM B where is_del = 0 AND m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND pm.m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('') UNION ALL SELECT * FROM F m INNER JOIN (SELECT * FROM G where is_del = 0 AND m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND pm.s_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('') UNION ALL SELECT * FROM F m INNER JOIN (SELECT * FROM G where is_del = 0 AND m_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59') pm ON pm.id_sour = m.pk_id WHERE pm.status = '' AND pm.is_del = 0 AND pm.s_time BETWEEN '2017-05-25 00:00:00' AND '2017-05-25 23:59:59' AND m.type IN ('')
其實38條數據的結果,在以上子SQL就已經肯定了,所以後面的LEFT JOIN或INNER JOIN,JOIN的數據都會比較少,效率天然高。相對於優化前的寫法,以上子SQL各自還鏈接了一堆相同的表。如今把這堆相同的表提到最外面作一次鏈接。這裏體現的是SQL結合律。
總結:當SQL規模比較龐大時,良好的SQL結構能大大提高執行的效率。而且SQL的優化也不是一蹴而就,也是一個按部就班不斷嘗試的過程。以上SQL不必定就是最優,此處並無談SQL語法最佳使用細節。具體可參考如下連接。