SQLAlchemy in 查詢空列表問題分析

時間 2020-01-18

標籤 sqlalchemy 查詢列表問題分析简体版

原文原文鏈接

問題場景

有model Account，SQLAlchemy 查詢語句以下：html

query = Account.query.filter(Account.id.in_(account_ids)).order_by(Account.date_created.desc())

這裏 uids 若是爲空，執行查詢會有以下警告：python

/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:35: SAWarning: The IN-predicate on "account.id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate.  Consider alternative strategies for improved performance.
  return o[0](self, self.expr, op, *(other + o[1:]), **kwargs)

這裏的意思是使用一個空的列表會花費較長的時間，須要優化以提升性能。

爲何會有這個提示呢？一個空列表爲何會影響性能呢？linux

首先打印 query 可獲得以下 sql 語句：sql

SELECT *   // 字段使用 「*」 代替
FROM account
WHERE account.id != account.id ORDER BY account.date_created DESC

會發現生成的語句中過濾條件是 WHERE account.id != account.id，使用 PostgreSQL Explain ANALYZE 命令，數據庫

EXPLAIN：顯示PostgreSQL計劃程序爲提供的語句生成的執行計劃。
ANALYZE：收集有關數據庫中表的內容的統計信息。

分析查詢成本結果以下：bash

postgres=> EXPLAIN ANALYZE SELECT *
FROM account
WHERE account.id != account.id ORDER BY account.date_created DESC;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Sort  (cost=797159.14..808338.40 rows=4471702 width=29) (actual time=574.002..574.002 rows=0 loops=1)
   Sort Key: date_created DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on account  (cost=0.00..89223.16 rows=4471702 width=29) (actual time=573.991..573.991 rows=0 loops=1)
         Filter: (id <> id)
         Rows Removed by Filter: 4494173
 Planning time: 0.162 ms
 Execution time: 574.052 ms
(8 rows)

先看Postgresql提供的語句生成的執行計劃，經過結果能夠看到，雖然返回值爲空，可是查詢成本卻仍是特別高，執行計劃部分幾乎全部的時間都耗費在排序上，可是和執行時間相比，查詢計劃的時間能夠忽略不計。（結果是先遍歷全表，查出全部數據，而後再使用 Filter: (id <> id) 把全部數據過濾。）less

按照這個思路，有兩種查詢方案：dom

1.若是 account_ids 爲空，那麼直接返回空列表不進行任何操做，查詢語句變爲：python2.7

if account_ids:
    query = Account.query.filter(Account.id.in_(account_ids)).order_by(Account.date_created.desc())

2.若是 account_ids 爲空，那麼過濾方式，查詢語句變爲：ide

query = Account.query
if account_ids:
    query = query.filter(Account.id.in_(account_ids))
else:
    query = query.filter(False)
    
query = query.order_by(Account.date_created.desc())

若是 account_ids 爲空，此時生成的 SQL 語句結果爲：

SELECT *
FROM account
WHERE 0 = 1 ORDER BY account.date_created DESC

分析結果爲：

postgres=> EXPLAIN ANALYZE SELECT *
FROM account
WHERE 0 = 1 ORDER BY account.date_created DESC;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Sort  (cost=77987.74..77987.75 rows=1 width=29) (actual time=0.011..0.011 rows=0 loops=1)
   Sort Key: date_created DESC
   Sort Method: quicksort  Memory: 25kB
   ->  Result  (cost=0.00..77987.73 rows=1 width=29) (actual time=0.001..0.001 rows=0 loops=1)
         One-Time Filter: false
         ->  Seq Scan on account  (cost=0.00..77987.73 rows=1 width=29) (never executed)
 Planning time: 0.197 ms
 Execution time: 0.061 ms
(8 rows)

能夠看到，查詢計劃和執行時間都有大幅提升。

一個測試

若是隻是去掉方案1排序，查看一下分析結果

使用 PostgreSQL Explain ANALYZE 命令分析查詢成本結果以下：

postgres=> EXPLAIN ANALYZE SELECT *
FROM account
WHERE account.id != account.id;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Seq Scan on account  (cost=0.00..89223.16 rows=4471702 width=29) (actual time=550.999..550.999 rows=0 loops=1)
   Filter: (id <> id)
   Rows Removed by Filter: 4494173
 Planning time: 0.134 ms
 Execution time: 551.041 ms

能夠看到，時間和有排序時差異不大。

如何計算查詢成本

執行一個分析，結果以下：

postgres=> explain select * from account where date_created ='2016-04-07 18:51:30.371495+08';
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Seq Scan on account  (cost=0.00..127716.33 rows=1 width=211)
   Filter: (date_created = '2016-04-07 18:51:30.371495+08'::timestamp with time zone)
(2 rows)

EXPLAIN引用的數據是：

0.00 預計的啓動開銷(在輸出掃描開始以前消耗的時間，好比在一個排序節點裏作排續的時間)。
127716.33 預計的總開銷。
1 預計的該規劃節點輸出的行數。
211 預計的該規劃節點的行平均寬度(單位：字節)。

這裏開銷(cost)的計算單位是磁盤頁面的存取數量，如1.0將表示一次順序的磁盤頁面讀取。其中上層節點的開銷將包括其全部子節點的開銷。這裏的輸出行數(rows)並非規劃節點處理/掃描的行數，一般會更少一些。通常而言，頂層的行預計數量會更接近於查詢實際返回的行數。
這裏表示的就是在只有單 CPU 內核的狀況下，評估成本是127716.33;

計算成本，Postgresql 首先看錶的字節數大小

這裏 account 表的大小爲：

postgres=> select pg_relation_size('account');

pg_relation_size
------------------
        737673216
(1 row)

查看塊的大小

Postgresql 會爲每一個要一次讀取的快添加成本點，使用 show block_size查看塊的大小：

postgres=> show block_size;

block_size
------------
 8192
(1 row)

計算塊的個數

能夠看到每一個塊的大小爲8kb，那麼能夠計算從表從讀取的順序塊成本值爲：

blocks = pg_relation_size/block_size = 90048

90048 是account 表所佔用塊的數量。

查看每一個塊須要的成本

postgres=> show seq_page_cost;
 seq_page_cost
---------------
 1
(1 row)

這裏的意思是 Postgresql 爲每一個塊分配一個成本點，也就是說上面的查詢須要從90048個成本點。

處理每條數據 cpu 所需時間

cpu_tuple_cost：處理每條記錄的CPU開銷（tuple：關係中的一行記錄）
cpu_operator_cost：操做符或函數帶來的CPU開銷。

postgres=> show cpu_operator_cost;
 cpu_operator_cost
-------------------
 0.0025
(1 row)

postgres=> show cpu_tuple_cost;
 cpu_tuple_cost
----------------
 0.01
(1 row)

計算

cost 計算公式爲：

cost = 磁盤塊個數 塊成本（1） + 行數 cpu_tuple_cost（系統參數值）+ 行數 * cpu_operator_cost

如今用全部值來計算explain 語句中獲得的值：

number_of_records = 3013466  # account 表 count

block_size = 8192  # block size in bytes

pg_relation_size=737673216

blocks = pg_relation_size/block_size = 90048

seq_page_cost = 1
cpu_tuple_cost = 0.01
cpu_operator_cost = 0.0025

cost = blocks * seq_page_cost + number_of_records * cpu_tuple_cost + number_of_records * cpu_operator_cost

如何下降查詢成本？

直接回答，使用索引。

postgres=> explain select * from account where id=20039;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Index Scan using account_pkey on account  (cost=0.43..8.45 rows=1 width=211)
   Index Cond: (id = 20039)
(2 rows)

經過這個查詢能夠看到，在使用有索引的字段查詢時，查詢成本顯著下降。

索引掃描的計算比順序掃描的計算要複雜一些。它由兩個階段組成。
PostgreSQL會考慮random_page_cost和cpu_index_tuple_cost 變量，並返回一個基於索引樹的高度的值。

參考連接

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。