Presto 性能優化點

時間 2019-11-06

標籤 presto 性能優化欄目系統性能简体版

原文原文鏈接

一、指定須要返回的字段html

[GOOD]: SELECT time,user,host FROM tbl
[BAD]: SELECT * FROM tblnode

二、合理設置分區字段redis

當過濾條件做用在分區字段上面時，能夠減小數據掃描的範圍，有效提高查詢性能。session

這個須要結合OLAP業務進行考慮，將常規過濾字段設置成分區字段，例如：訂單時間（適用於時間範圍的統計分析）、租戶id（適用於多租戶平臺中各個租戶的統計分析）等。app

三、group by的時候考慮統計字段基數ide

字段基數：是指某字段擁有不一樣值的個數。例如：性別字段的基數通常是2，月份字段的基數是12。函數

group by的時候須要將基數大的字段放在前面。性能

[GOOD]: SELECT GROUP BY uid, gender
[BAD]: SELECT GROUP BY gender, uid優化

若是group by的字段是數值型，將比字符型更節省內存使用空間。ui

四、order by 和 limit 配合使用（topN）

[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100
[BAD]: SELECT * FROM tbl ORDER BY time

order by 須要將全部數據放到一個worker中進行排序，這將消耗大量的內存空間。配合limit使用將有效減少內存空間的使用，提高查詢性能。

topN 能夠只需使用size=N的優先級隊列便可完成，這隻佔用很是小的內存空間。

五、使用近似統計的功能（approximate aggregate functions）

presto提供了一些近似統計的函數，這顯著提升了查詢統計性能。固然，這是以犧牲準確性爲代價的。

例如：approx_distinct函數，咱們將獲得一個偏差在2.3%的近似值。

SELECT
approx_distinct(user_id)
FROM
access
WHERE
TD_TIME_RANGE(time,
TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d', 'PDT'),
TD_SCHEDULED_TIME())

上面事例表示：查詢前一天不一樣訪問用戶的數量（UV）。

六、使用regexp_like

SELECT
...
FROM
access
WHERE
method LIKE '%GET%' OR
method LIKE '%POST%' OR
method LIKE '%PUT%' OR
method LIKE '%DELETE%'

使用regexp_like優化處理：

SELECT
...
FROM
access
WHERE
regexp_like(method, 'GET|POST|PUT|DELETE')

七、join的時候把大表放在左邊

presto在join的時候採用的是broadcast join，意思是右邊的表將所有數據send到各個worker和左邊的表（每一個worker持有一部分左邊表的數據）進行關聯查詢。

例如：訂單表和用戶表，須要根據用戶維度對訂單的某些度量進行統計分析。通常狀況下，訂單的數據量遠大於用戶的數據量，所以order left join customer。

若是有10個worker，那麼10個worker將各持有1/10的訂單數據（假設數據分佈均勻），而後將全部用戶數據send到10個worker上進行join操做。

有的時候若是右邊的表確實很大，那麼有可能遇到「ERROR：Exceeded max memory xxGB」，這個xxGB是配置文件中指定的每次查詢worker使用的最大內存空間。超過這個閾值將報這個異常信息。這種狀況要不就調整參數，要不就使用「distributed hash join」。

The type of distributed join to use.

When set to PARTITIONED, presto will use hash distributed joins. When set to BROADCAST, it will broadcast the right table to all nodes in the cluster that have data from the left table.

Partitioned joins require redistributing both tables using a hash of the join key. This can be slower (sometimes substantially) than broadcast joins, but allows much larger joins. In particular broadcast joins will be faster if the right table is much smaller than the left. However, broadcast joins require that the tables on the right side of the join after filtering fit in memory on each node, whereas distributed joins only need to fit in distributed memory across all nodes.

When set to AUTOMATIC, Presto will make a cost based decision as to which distribution type is optimal. It will also consider switching the left and right inputs to the join. In AUTOMATIC mode, Presto will default to hash distributed joins if no cost could be computed, such as if the tables do not have statistics. This can also be specified on a per-query basis using the join_distribution_type session property.

參考：

1）Presto Performance Tuning Tips

2）Presto - Properties Reference