分組後查詢每組的前2

時間 2019-12-25

標籤分組查詢每組简体版

原文原文鏈接

近日，工做中突遇一需求：將一數據表分組，然後取出每組內按必定規則排列的前N條數據。乍想來，這本是尋常查詢，無甚難處。可提筆寫來，終究是困住了好一下子。左思右想，遍查網絡，未曾想這居然是SQL界的一個經典話題。今日將我得來的若干方法列出，拋磚引玉，以期與衆位探討。網絡

　　正文以前，對示例表結構加以說明。性能

表SectionTransactionLog，用來記錄各部門各項活動的日誌表
　　　SectionId，部門Id
　　　SectionTransactionType，活動類型
　　　TotalTransactionValue，活動花費
　　　TransactionDate，活動時間測試

咱們設定的場景爲：選出每部門（SectionId）最近兩次舉行的活動。優化

用來測試的SectionTransactionLog表中數據超3,000,000。日誌

1、嵌套子查詢方式排序

1索引

1 SELECT * FROM SectionTransactionLog mLog2 where 3 (select COUNT(*) from SectionTransactionLog subLog4 wheresubLog.SectionId = mLog.SectionId and subLog.TransactionDate >= mLog.TransactionDate)<=25 order by SectionId, TransactionDate descit

　　運行時間：34秒io

　　該方式原理較簡單，只是在子查詢中肯定該條記錄是不是其Section中新近發生的2條之一。table

1 SELECT * FROM SectionTransactionLog mLog2 where mLog.Id in3 (select top 2 Id 4 from SectionTransactionLog subLog5where subLog.SectionId = mLog.SectionId6 order by TransactionDate desc)7 order by SectionId, TransactionDate desc

　　運行時間：1分25秒

　　在子查詢中使用TransactionDate排序，取top 2。並應用in關鍵字肯定記錄是否符合該子查詢。

2、自聯接方式

1 select mLog.* from SectionTransactionLog mLog2 inner join3 (SELECT rankLeft.Id, COUNT(*) as rankNum FROMSectionTransactionLog rankLeft4 inner join SectionTransactionLog rankRight 5 on rankLeft.SectionId =rankRight.SectionId and rankLeft.TransactionDate <= rankRight.TransactionDate6 group by rankLeft.Id7 having COUNT(*)<= 2) subLog on mLog.Id = subLog.Id8 order by mLog.SectionId, mLog.TransactionDate desc

　　運行時間：56秒

　　該實現方式較爲巧妙，但較之以前方法也稍顯複雜。其中，以SectionTransactionLog表自聯接爲基礎而構造出的subLog部分爲每一活動（以Id標識）計算出其在Section內部的排序rankNum（按時間TransactionDate）。

　　在自聯接條件rankLeft.SectionId = rankRight.SectionId and rankLeft.TransactionDate <= rankRight.TransactionDate的篩選下，查詢結果中對於某一活動（以Id標識）而言，與其聯接的只有同其在一Section並晚於或與其同時發生活動（固然包括其自身）。下圖爲Id=1的活動自聯接示意：

　　從上圖中一目瞭然能夠看出，基於此結果的count計算，便爲Id=1活動在Section 9022中的排次rankNum。

　　然後having COUNT(*) <= 2選出排次在2之內的，再作一次聯接select出所需信息。

3、應用ROW_NUMBER()（SQL SERVER 2005及以後）

1 select * from2 (3 select *, ROW_NUMBER() over(partition by SectionId order by TransactionDate desc) as rowNum4from SectionTransactionLog5 ) ranked6 where ranked.rowNum <= 27 order by ranked.SectionId, ranked.TransactionDatedesc

　　運行時間：20秒

　　這是截至目前效率最高的實現方式。ROW_NUMBER() over(partition by SectionId order by TransactionDate desc)完成了分組、排序、取行號的整個過程。

效率思考

　　下面咱們對上述的4種方法作一個效率上的統計。

方法	耗時（秒）	排名
應用ROW_NUMBER()	20	1
嵌套子查詢方式1	34	2
自聯接方式	56	3
嵌套子查詢方式2	85	4

　　4種方法中，嵌套子查詢2所用時最長，其效率損耗在什麼地方了呢？難道果然是使用了in關鍵字的緣故？下圖爲其執行計劃（execute plan）：

　　從圖中，咱們能夠看出優化器將in解析爲了Left Semi Join, 其損耗極低。而該查詢絕大部分性能消耗在子查詢的order by處（Top N Sort）。果真，若刪掉子查詢中的order by TransactionDate desc子句（固然結果不正確），其耗時僅爲8秒。

　　添加有效索引可提升該查詢方法的性能。

對於其中效率最高的一個，用下面的方式來進行驗證和應用
select * from

(

select *, ROW_NUMBER() over(partition by product_id order by fee desc) as rowNum

from (select [product_id] ,[account] ,sum([debit_share]) fee from [products].[dbo].[T_COUNTER_PRODUCT_HOLDER] where [debit_share] >0 group by [product_id] ,[account] ) t

) ranked

where ranked.rowNum <= 2

order by ranked.product_id, ranked.fee desc

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。