數據庫性能分析和優化是一個難題,做者Pivotal Greenplum工程技術經理王昊所在的Greenplum研發部門近期正好在解決一個實際用戶的全局性能問題,本文記錄了分析過程和解決思路。數據庫
【實錄】首次利用GPCC歷史數據調優Greenplum 第一部分幫助你們瞭解了GPDB集羣的總體性能特徵,如今爲你們帶來第二部分——分析查詢負載總體狀況的乾貨內容。segmentfault
第二部分,分析查詢負載總體狀況性能
先介紹和對比GPCC的查詢歷史表優化
對比GPPerfmon,查詢歷史記錄提供的信息以下:網站
首先須要作的是對升級先後的查詢數量進行定量分析。因爲GP4上的GPPerfmon只能採集到20秒以上的查詢,這給對比分析帶來了必定的困難。spa
下面SQL分別對GPPerfmon和GPCC 4.8的歷史各選取一週的數據進行統計,將執行時間按照0-20秒、20-40秒、40-60秒、60秒-2分鐘、2分鐘-5分鐘、5分鐘-10分鐘、10分鐘以進行分類統計。3d
-- GPPERFMON SELECT sum(CASE WHEN tfinish - tsubmit < INTERVAL '20s' THEN 1 ELSE 0 END) AS dur0_20 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' AND tfinish - tsubmit < INTERVAL '40s' THEN 1 ELSE 0 END) AS dur20_40 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s' AND tfinish - tsubmit < INTERVAL '60s' THEN 1 ELSE 0 END) AS dur40_60 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s' AND tfinish - tsubmit < INTERVAL '120s' THEN 1 ELSE 0 END) AS dur60_120 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s' AND tfinish - tsubmit < INTERVAL '300s' THEN 1 ELSE 0 END) AS dur120_300 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s' AND tfinish - tsubmit < INTERVAL '600s' THEN 1 ELSE 0 END) AS dur300_600 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN 1 ELSE 0 END) AS dur600plus FROM public.queries_history WHERE ctime >= '2019-09-01' AND ctime < '2019-09-08'; -- 統計結果 dur0_20 | 0 -- GPPerfmon沒有統計20秒如下的查詢 dur20_40 | 79649 dur40_60 | 22204 dur60_120 | 20452 dur120_300 | 11122 dur300_600 | 68062 dur600plus | 18
-- GPCC 4.8 SELECT sum(CASE WHEN tfinish - tsubmit < INTERVAL '1s' THEN 1 ELSE 0 END) AS dur0_1 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '1s' AND tfinish - tsubmit < INTERVAL '20s' THEN 1 ELSE 0 END) AS dur1_20 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' AND tfinish - tsubmit < INTERVAL '40s' THEN 1 ELSE 0 END) AS dur20_40 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s' AND tfinish - tsubmit < INTERVAL '60s' THEN 1 ELSE 0 END) AS dur40_60 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s' AND tfinish - tsubmit < INTERVAL '120s' THEN 1 ELSE 0 END) AS dur60_120 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s' AND tfinish - tsubmit < INTERVAL '300s' THEN 1 ELSE 0 END) AS dur120_300 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s' AND tfinish - tsubmit < INTERVAL '600s' THEN 1 ELSE 0 END) AS dur300_600 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN 1 ELSE 0 END) AS dur600plus FROM gpmetrics.gpcc_queries_history WHERE ctime >= '2019-10-09' AND ctime < '2019-10-16'; -- 統計結果 dur0_1 | 33370333 -- GPCC4.8歷史數據表示短查詢很是多 dur1_20 | 1072167 dur20_40 | 77928 dur40_60 | 23796 dur60_120 | 20230 dur120_300 | 21130 dur300_600 | 59711 dur600plus | 21
由於用戶反映的問題是「總體性能下降」,所以除了查詢數量,有必要進一步分析查詢的平均時間,期待平均的查詢時間可以佐證用戶的反饋。單個查詢的tfinish - tsubmit就獲得執行時間,代入到前一個查詢中就能夠計算出查詢的平均耗時。用下面查詢對不一樣時長區間的查詢分別統計平均耗時。code
-- GPPERFMON SELECT elp20_40 , elp20_40 / cnt20_40 avg20_40 , elp40_60 , elp40_60 / cnt40_60 avg40_60 , elp60_120 , elp60_120 / cnt60_120 avg60_120 , elp120_300 , elp120_300 / cnt120_300 avg120_300 , elp300_600 , elp300_600 / cnt300_600 avg300_600 , elp600plus , elp600plus / cnt600plus avg600plus FROM ( SELECT sum(CASE WHEN tfinish - tsubmit < INTERVAL '20s' THEN 1 ELSE 0 END) AS cnt0_20 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' AND tfinish - tsubmit < INTERVAL '40s' THEN 1 ELSE 0 END) AS cnt20_40 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s' AND tfinish - tsubmit < INTERVAL '60s' THEN 1 ELSE 0 END) AS cnt40_60 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s' AND tfinish - tsubmit < INTERVAL '120s' THEN 1 ELSE 0 END) AS cnt60_120 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s' AND tfinish - tsubmit < INTERVAL '300s' THEN 1 ELSE 0 END) AS cnt120_300 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s' AND tfinish - tsubmit < INTERVAL '600s' THEN 1 ELSE 0 END) AS cnt300_600 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN 1 ELSE 0 END) AS cnt600plus , sum(CASE WHEN tfinish - tsubmit < INTERVAL '20s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp0_20 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' AND tfinish - tsubmit < INTERVAL '40s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp20_40 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s' AND tfinish - tsubmit < INTERVAL '60s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp40_60 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s' AND tfinish - tsubmit < INTERVAL '120s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp60_120 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s' AND tfinish - tsubmit < INTERVAL '300s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp120_300 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s' AND tfinish - tsubmit < INTERVAL '600s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp300_600 , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp600plus FROM public.queries_history WHERE ctime >= '2019-09-01' AND ctime < '2019-09-08' ) dt; -- 統計結果 elp20_40 | 588:50:52 --總時長 avg20_40 | 00:00:26.614923 --總時長/查詢個數 = 平均時長 elp40_60 | 297:40:04 avg40_60 | 00:00:48.261755 elp60_120 | 463:34:22 avg60_120 | 00:01:21.598963 elp120_300 | 589:21:26 avg120_300 | 00:03:10.764791 elp300_600 | 6398:58:00 avg300_600 | 00:05:38.460227 elp600plus | 05:11:19 avg600plus | 00:17:17.722222
-- GPCC 4.8 -- 省略 (只需將以上查詢替換成gpmetrics.gpcc_queries_history便可,再也不重複以節省篇幅) -- 統計結果 elp20_40 | 592:01:47.648825 --總時長 avg20_40 | 00:00:27.349703 --總時長/查詢個數 = 平均時長 elp40_60 | 323:27:40.247104 avg40_60 | 00:00:48.935126 elp60_120 | 462:06:35.617476 avg60_120 | 00:01:22.234089 elp120_300 | 1322:00:31.29859 avg120_300 | 00:03:45.235745 elp300_600 | 5535:28:23.424853 avg300_600 | 00:05:33.735885 elp600plus | 05:42:28.81901 avg600plus | 00:16:18.515191
以上分析反映出在全部超過20秒的查詢中,升級先後各個區間的查詢平均時間變化細微,合計的平均查詢耗時從2分29秒下降到2分26秒。這個結果不足以佐證用戶反饋的現象。blog
以上針對查詢個數和平均時長的統計彷佛沒有直接結論,抱着懷疑的態度,又對每一個數據庫角色的查詢進行了分析,統計每一個用戶提交的查詢個數、平均時長。md5
SELECT substring(md5(username) FROM 1 FOR 7) , cnt_queries , total_time , EXTRACT(EPOCH FROM total_time/cnt_queries) avg_seconds FROM ( SELECT username , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' THEN 1 ELSE 0 END) AS cnt_queries , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' THEN tfinish - tsubmit ELSE interval '0s' END) AS total_time FROM public.queries_history WHERE ctime >= '2019-09-01' AND ctime < '2019-09-08' GROUP BY username ) dt ORDER BY cnt_queries DESC; -- GP4 username | cnt_queries | total_time | avg_seconds ----------+-------------+------------+------------- a4fde70 | 182250 | 8062:16:23 | 159.254776 550b111 | 7884 | 115:21:15 | 52.673135 f8da676 | 5033 | 40:15:05 | 28.79098 b6c1345 | 3210 | 50:40:41 | 56.835202 83a41d3 | 905 | 09:48:52 | 39.040884 ba9ae16 | 880 | 42:09:31 | 172.467045 4287401 | 744 | 09:35:23 | 46.401882 0636d5d | 318 | 09:46:21 | 110.632075 239c70a | 237 | 01:46:05 | 26.85654 bd0fcb7 | 28 | 00:16:40 | 35.714286 04ba18a | 9 | 00:04:33 | 30.333333 3a96750 | 5 | 00:02:32 | 30.4 8f1681a | 2 | 01:31:40 | 2750 807a26e | 1 | 00:00:22 | 22 a96f2c8 | 1 | 00:00:40 | 40 -- GP5 username | cnt_queries | total_time | avg_seconds ----------+-------------+------------+------------- a4fde70 | 178959 | 7925:55:34 | 159.440618 550b111 | 8808 | 158:16:12 | 64.687987 83a41d3 | 8013 | 71:28:50 | 32.114037 f8da676 | 2863 | 25:13:51 | 31.725721 b6c1345 | 2841 | 36:52:19 | 46.722593 4287401 | 438 | 08:05:10 | 66.460471 ba9ae16 | 370 | 03:38:07 | 35.370272 0636d5d | 328 | 09:28:34 | 104.006288 239c70a | 105 | 00:46:04 | 26.324939 27b686a | 49 | 00:36:27 | 44.634513 bd0fcb7 | 32 | 00:18:48 | 35.242934 3a96750 | 6 | 00:05:00 | 49.959448 15340c2 | 2 | 00:01:31 | 45.54025 807a26e | 1 | 00:00:40 | 39.741755 a96f2c8 | 1 | 00:00:22 | 21.997138
經過對比得出,只有550b1十一、f8da67六、4287401三個用戶的查詢在升級後平均耗時增長了。
遺憾的是GP4的GPPerfmon數據並無短查詢的記錄,並且記錄的性能指標也不足,例如沒有磁盤IO的指標,因此沒法與GP5的歷史記錄進行深刻的對比分析。根據當前的分析結果,咱們進一步跟客戶進行了溝通確認,澄清認定了查詢數量基本一致,20秒以上慢查詢的平均時長沒有增長,只有少部分用戶的查詢的確略微變慢等事實。對於GP5上實用GPCC4.8收集的查詢數據,不包含20秒的限制,因此能夠針對GP5的歷史數據專門分析一下各用戶的總體的查詢特徵。這裏咱們以1秒爲界,分別統計一秒之內的查詢和超過1秒的查詢:
(未完待續)
得到更多關於Greenpum的技術乾貨,請訪問Greenplum中文社區網站。