【實錄】首次利用GPCC歷史數據調優Greenplum 第二部分

數據庫性能分析和優化是一個難題,做者Pivotal Greenplum工程技術經理王昊所在的Greenplum研發部門近期正好在解決一個實際用戶的全局性能問題,本文記錄了分析過程和解決思路。數據庫

【實錄】首次利用GPCC歷史數據調優Greenplum 第一部分幫助你們瞭解了GPDB集羣的總體性能特徵,如今爲你們帶來第二部分——分析查詢負載總體狀況的乾貨內容。segmentfault

第二部分,分析查詢負載總體狀況性能

先介紹和對比GPCC的查詢歷史表優化

對比GPPerfmon,查詢歷史記錄提供的信息以下:網站

首先須要作的是對升級先後的查詢數量進行定量分析。因爲GP4上的GPPerfmon只能採集到20秒以上的查詢,這給對比分析帶來了必定的困難。spa

下面SQL分別對GPPerfmon和GPCC 4.8的歷史各選取一週的數據進行統計,將執行時間按照0-20秒、20-40秒、40-60秒、60秒-2分鐘、2分鐘-5分鐘、5分鐘-10分鐘、10分鐘以進行分類統計。3d

-- GPPERFMON
SELECT sum(CASE WHEN tfinish - tsubmit < INTERVAL '20s' THEN 1 ELSE 0 END)  AS dur0_20
    , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s'
               AND tfinish - tsubmit < INTERVAL '40s' THEN 1 ELSE 0 END)    AS dur20_40
    , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s'
               AND tfinish - tsubmit < INTERVAL '60s' THEN 1 ELSE 0 END)    AS dur40_60
    , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s'
               AND tfinish - tsubmit < INTERVAL '120s' THEN 1 ELSE 0 END)   AS dur60_120
    , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s'
               AND tfinish - tsubmit < INTERVAL '300s' THEN 1 ELSE 0 END)   AS dur120_300
    , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s'
               AND tfinish - tsubmit < INTERVAL '600s' THEN 1 ELSE 0 END)   AS dur300_600
    , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN 1 ELSE 0 END) AS dur600plus
FROM public.queries_history
WHERE ctime >= '2019-09-01' AND ctime < '2019-09-08';

-- 統計結果
dur0_20    | 0          -- GPPerfmon沒有統計20秒如下的查詢
dur20_40   | 79649
dur40_60   | 22204
dur60_120  | 20452
dur120_300 | 11122
dur300_600 | 68062
dur600plus | 18
-- GPCC 4.8
SELECT sum(CASE WHEN tfinish - tsubmit < INTERVAL '1s' THEN 1 ELSE 0 END)  AS dur0_1
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '1s'
              AND tfinish - tsubmit < INTERVAL '20s' THEN 1 ELSE 0 END)  AS dur1_20
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s'
              AND tfinish - tsubmit < INTERVAL '40s' THEN 1 ELSE 0 END)    AS dur20_40
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s'
              AND tfinish - tsubmit < INTERVAL '60s' THEN 1 ELSE 0 END)    AS dur40_60
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s'
              AND tfinish - tsubmit < INTERVAL '120s' THEN 1 ELSE 0 END)   AS dur60_120
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s'
              AND tfinish - tsubmit < INTERVAL '300s' THEN 1 ELSE 0 END)   AS dur120_300
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s'
              AND tfinish - tsubmit < INTERVAL '600s' THEN 1 ELSE 0 END)   AS dur300_600
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN 1 ELSE 0 END) AS dur600plus
FROM gpmetrics.gpcc_queries_history
WHERE ctime >= '2019-10-09' AND ctime < '2019-10-16';

-- 統計結果
dur0_1     | 33370333  -- GPCC4.8歷史數據表示短查詢很是多
dur1_20    | 1072167
dur20_40   | 77928
dur40_60   | 23796
dur60_120  | 20230
dur120_300 | 21130
dur300_600 | 59711
dur600plus | 21

剖析

  • 經過GP5上的歷史數據來看,一週內發生的小於1秒的短查詢3000萬次以上,同時混合5-10分鐘的分析型查詢,屬於較典型的HTAP混合負載的使用場景,並且系統資源一直處於高負荷運行水平。
  • 用戶自述因爲性能考慮關閉了ORCA,也符合短查詢較多的用戶場景。
  • 因爲只能對比20秒以上的查詢,經過上圖咱們看到這部分查詢數量在升級先後基本持平,GP4共計201507查詢對比GP5的202816個,差距在1%之內。
  • 2分鐘-5分鐘檔位下,GP5的查詢增長了一倍,但20秒-40秒檔位和5分鐘到10分鐘檔位,GP5都下降了,整體差距不明顯。
  • 總體而言,升級先後用戶的工做負載沒有質的變化,基本排除了工做負載增長致使系統響應下降的問題。

由於用戶反映的問題是「總體性能下降」,所以除了查詢數量,有必要進一步分析查詢的平均時間,期待平均的查詢時間可以佐證用戶的反饋。單個查詢的tfinish - tsubmit就獲得執行時間,代入到前一個查詢中就能夠計算出查詢的平均耗時。用下面查詢對不一樣時長區間的查詢分別統計平均耗時。code

-- GPPERFMON
SELECT
     elp20_40
   , elp20_40 / cnt20_40     avg20_40
   , elp40_60
   , elp40_60 / cnt40_60     avg40_60
   , elp60_120
   , elp60_120 / cnt60_120   avg60_120
   , elp120_300
   , elp120_300 / cnt120_300 avg120_300
   , elp300_600
   , elp300_600 / cnt300_600 avg300_600
   , elp600plus
   , elp600plus / cnt600plus avg600plus
FROM (
SELECT
     sum(CASE WHEN tfinish - tsubmit < INTERVAL '20s' THEN 1 ELSE 0 END)   AS cnt0_20
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s'
              AND tfinish - tsubmit < INTERVAL '40s' THEN 1 ELSE 0 END)    AS cnt20_40
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s'
              AND tfinish - tsubmit < INTERVAL '60s' THEN 1 ELSE 0 END)    AS cnt40_60
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s'
              AND tfinish - tsubmit < INTERVAL '120s' THEN 1 ELSE 0 END)   AS cnt60_120
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s'
              AND tfinish - tsubmit < INTERVAL '300s' THEN 1 ELSE 0 END)   AS cnt120_300
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s'
              AND tfinish - tsubmit < INTERVAL '600s' THEN 1 ELSE 0 END)   AS cnt300_600
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN 1 ELSE 0 END) AS cnt600plus
   , sum(CASE WHEN tfinish - tsubmit < INTERVAL '20s' THEN tfinish - tsubmit ELSE interval '0s' END)   AS elp0_20
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s'
              AND tfinish - tsubmit < INTERVAL '40s' THEN tfinish - tsubmit ELSE interval '0s' END)    AS elp20_40
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '40s'
              AND tfinish - tsubmit < INTERVAL '60s' THEN tfinish - tsubmit ELSE interval '0s' END)    AS elp40_60
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '60s'
              AND tfinish - tsubmit < INTERVAL '120s' THEN tfinish - tsubmit ELSE interval '0s' END)   AS elp60_120
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '120s'
              AND tfinish - tsubmit < INTERVAL '300s' THEN tfinish - tsubmit ELSE interval '0s' END)   AS elp120_300
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '300s'
              AND tfinish - tsubmit < INTERVAL '600s' THEN tfinish - tsubmit ELSE interval '0s' END)   AS elp300_600
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '600s' THEN tfinish - tsubmit ELSE interval '0s' END) AS elp600plus
FROM public.queries_history
WHERE ctime >= '2019-09-01' AND ctime < '2019-09-08'
) dt;

-- 統計結果
elp20_40   | 588:50:52         --總時長
avg20_40   | 00:00:26.614923   --總時長/查詢個數 = 平均時長
elp40_60   | 297:40:04
avg40_60   | 00:00:48.261755
elp60_120  | 463:34:22
avg60_120  | 00:01:21.598963
elp120_300 | 589:21:26
avg120_300 | 00:03:10.764791
elp300_600 | 6398:58:00
avg300_600 | 00:05:38.460227
elp600plus | 05:11:19
avg600plus | 00:17:17.722222
-- GPCC 4.8
-- 省略 (只需將以上查詢替換成gpmetrics.gpcc_queries_history便可,再也不重複以節省篇幅)

-- 統計結果
elp20_40   | 592:01:47.648825  --總時長
avg20_40   | 00:00:27.349703   --總時長/查詢個數 = 平均時長
elp40_60   | 323:27:40.247104
avg40_60   | 00:00:48.935126
elp60_120  | 462:06:35.617476
avg60_120  | 00:01:22.234089
elp120_300 | 1322:00:31.29859
avg120_300 | 00:03:45.235745
elp300_600 | 5535:28:23.424853
avg300_600 | 00:05:33.735885
elp600plus | 05:42:28.81901
avg600plus | 00:16:18.515191

剖析

以上分析反映出在全部超過20秒的查詢中,升級先後各個區間的查詢平均時間變化細微,合計的平均查詢耗時從2分29秒下降到2分26秒。這個結果不足以佐證用戶反饋的現象。blog

以上針對查詢個數和平均時長的統計彷佛沒有直接結論,抱着懷疑的態度,又對每一個數據庫角色的查詢進行了分析,統計每一個用戶提交的查詢個數、平均時長。md5

SELECT
     substring(md5(username) FROM 1 FOR 7)
   , cnt_queries
   , total_time
   , EXTRACT(EPOCH FROM total_time/cnt_queries) avg_seconds
FROM (
SELECT
     username
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' THEN 1 ELSE 0 END) AS cnt_queries
   , sum(CASE WHEN tfinish - tsubmit >= INTERVAL '20s' THEN tfinish - tsubmit ELSE interval '0s' END) AS total_time
FROM public.queries_history
WHERE ctime >= '2019-09-01' AND ctime < '2019-09-08'
GROUP BY username
) dt
ORDER BY cnt_queries DESC;

-- GP4
 username | cnt_queries | total_time | avg_seconds
----------+-------------+------------+-------------
 a4fde70  |      182250 | 8062:16:23 |  159.254776
 550b111  |        7884 |  115:21:15 |   52.673135
 f8da676  |        5033 |   40:15:05 |    28.79098
 b6c1345  |        3210 |   50:40:41 |   56.835202
 83a41d3  |         905 |   09:48:52 |   39.040884
 ba9ae16  |         880 |   42:09:31 |  172.467045
 4287401  |         744 |   09:35:23 |   46.401882
 0636d5d  |         318 |   09:46:21 |  110.632075
 239c70a  |         237 |   01:46:05 |    26.85654
 bd0fcb7  |          28 |   00:16:40 |   35.714286
 04ba18a  |           9 |   00:04:33 |   30.333333
 3a96750  |           5 |   00:02:32 |        30.4
 8f1681a  |           2 |   01:31:40 |        2750
 807a26e  |           1 |   00:00:22 |          22
 a96f2c8  |           1 |   00:00:40 |          40

-- GP5
 username | cnt_queries | total_time | avg_seconds
----------+-------------+------------+-------------
 a4fde70  |      178959 | 7925:55:34 |  159.440618
 550b111  |        8808 |  158:16:12 |   64.687987
 83a41d3  |        8013 |   71:28:50 |   32.114037
 f8da676  |        2863 |   25:13:51 |   31.725721
 b6c1345  |        2841 |   36:52:19 |   46.722593
 4287401  |         438 |   08:05:10 |   66.460471
 ba9ae16  |         370 |   03:38:07 |   35.370272
 0636d5d  |         328 |   09:28:34 |  104.006288
 239c70a  |         105 |   00:46:04 |   26.324939
 27b686a  |          49 |   00:36:27 |   44.634513
 bd0fcb7  |          32 |   00:18:48 |   35.242934
 3a96750  |           6 |   00:05:00 |   49.959448
 15340c2  |           2 |   00:01:31 |    45.54025
 807a26e  |           1 |   00:00:40 |   39.741755
 a96f2c8  |           1 |   00:00:22 |   21.997138

剖析

經過對比得出,只有550b1十一、f8da67六、4287401三個用戶的查詢在升級後平均耗時增長了。

遺憾的是GP4的GPPerfmon數據並無短查詢的記錄,並且記錄的性能指標也不足,例如沒有磁盤IO的指標,因此沒法與GP5的歷史記錄進行深刻的對比分析。根據當前的分析結果,咱們進一步跟客戶進行了溝通確認,澄清認定了查詢數量基本一致,20秒以上慢查詢的平均時長沒有增長,只有少部分用戶的查詢的確略微變慢等事實。對於GP5上實用GPCC4.8收集的查詢數據,不包含20秒的限制,因此能夠針對GP5的歷史數據專門分析一下各用戶的總體的查詢特徵。這裏咱們以1秒爲界,分別統計一秒之內的查詢和超過1秒的查詢:

剖析

  • 紫紅色總查詢時長看出,第一位的用戶a4fde70貢獻了該系統絕大多數工做負載,其佔用的數據庫運行時間佔絕對地位,而且平均單個查詢的耗時也比較長,其工做負載以分析型爲主。
  • 藍色數字看到,查詢數量上前三位的用戶貢獻了大量的短查詢,第二位的f8da676,其平均時長最短且數量大,能夠推斷出是主要的短查詢爲主數據庫用戶。
  • 整體水平看,該系統短查詢偏多,這類系統對響應時間敏感,有必要進一步挖掘用戶的反饋。

(未完待續)

得到更多關於Greenpum的技術乾貨,請訪問Greenplum中文社區網站


相關文章
相關標籤/搜索