GreenPlum高效去除表重複數據

1.針對PostgreSQL數據庫表的去重複方法基本有三種,這是在網上查找的方法,在附錄1給出。可是這些方法對GreenPlum來講都無論用。數據庫

 

2.數據表分佈在不一樣的節點上,每一個節點的ctid是惟一的,可是不一樣的節點就有ctid重複的可能,所以GreenPlum必須藉助gp_segment_id來進行去重複處理。markdown

 

3.在網上找到了一個相對繁瑣的方法,在附錄2給出:app

 

4.最終的方法是:ide

delete from test where (gp_segment_id, ctid) not in (select gp_segment_id, min(ctid) from test group by x, gp_segment_id);post

 

驗證經過。this

 

附錄1:PostgreSQL數據表去重複的三種方法:spa

引用自:http://my.oschina.net/swuly302/blog/144933.net

 

採用PostgreSQL 9.2 官方文檔例子爲例: rest

CREATE TABLE weather (
city      varchar(80),
temp_lo   int,          -- low temperature
temp_hi   int,          -- high temperature
prcp      real,         -- precipitation
date      date
);

INSERT INTO weather VALUES
('San Francisco', 46, 50, 0.25, '1994-11-27'),
('San Francisco', 43, 57, 0, '1994-11-29'),
('Hayward', 37, 54, NULL, '1994-11-29'),
('Hayward', 37, 54, NULL, '1994-11-29');   --- duplicated row

 

這裏有3中方法: code

第一種:替換法 

-- 剔除重複行的數據轉存到新表weather_temp
SELECT DISTINCT city, temp_lo, temp_hi, prcp, date 
INTO weather_temp 
FROM weather; 
-- 刪除原表
DROP TABLE weather;
-- 將新表重命名爲weather
ALTER TABLE weather_temp RENAME TO weather;
或者 

-- 建立與weather同樣的表weather_temp
CREATE TABLE weather_temp (LIKE weather INCLUDING CONSTRAINTS);
-- 用剔除重複行的數據填充到weather_temp中
INSERT INTO weather_temp SELECT DISTINCT * FROM weather;
-- 刪除原表
DROP TABLE weather;
-- 將新重命名爲weather.
ALTER TABLE weather_temp RENAME TO weather;
通俗易懂,有不少毀滅性的操做如DROP,並且當數據量大時,耗時耗空間。不推薦。 

第二種: 添加字段法
-- 添加一個新字段,類型爲serial
ALTER TABLE weather ADD COLUMN id SERIAL;
-- 刪除重複行
DELETE FROM weather WHERE id 
NOT IN (
SELECT max(id) 
FROM weather 
GROUP BY city, temp_lo, temp_hi, prcp, date
);
-- 刪除添加的字段
ALTER TABLE weather DROP COLUMN id;
須要添加字段,「暫時不知道Postgres是如何處理添加字段的,是直接在原表追加呢,仍是複製原表組成新表呢?」,若是是原表追加,可能就會由於新字段的加入而致使分頁(通常block: 8k),若是是複製的話那就罪過了。很差。 

第三種:系統字段[查看 System Columns] 

DELETE FROM weather 
WHERE ctid 
NOT IN (
SELECT max(ctid) 
FROM weather 
GROUP BY city, temp_lo, temp_hi, prcp, date
);
針對性強[Postgres獨有],可是簡單。

 

 

 

----------------可是對GreenPlum的表來講,表分割在各個節點上,不能單純的用ctid來作去重複處理。

 

附錄2:

https://discuss.pivotal.io/hc/zh-cn/community/posts/206428018-What-is-the-most-efficient-way-of-deleting-duplicate-records-from-a-table-

What is the most efficient way of deleting duplicate records from a table?

Currently we use Primary Keys to avoid loading duplicate data into our tables, but PK brings many restrictions. Since we can’t easily identify or prevent duplicates arriving from the variety of 3rd party upstream systems, we wanted to investigate the ‘load everything, remove duplicates afterwards’ approach.

In Postgres, you can use an efficient method such as:

DELETE FROM test
WHERE ctid NOT IN (
SELECT min(ctid)
FROM test
GROUP BY x); 
(where 'x' is the unique column list)

 

However in Greenplum ‘ctid’ is only unique per segment.

One approach would be:

DELETE FROM test USING 
(select gp_segment_id, ctid from 
(select gp_segment_id, ctid, rank() over (partition by x order by gp_segment_id, ctid) as rk from test ) foo 
WHERE rk <> 1) rows_to_delete 
WHERE test.gp_segment_id=rows_to_delete.gp_segment_id 
AND test.ctid=rows_to_delete.ctid;

 

But the use of window functions, subqueries etc. feels pretty inefficient.

Is there a better form?

Note that in our use case our unique column list varies up to ~10 columns so we don’t have a single unique key field – hence the RANK in the example. I suppose adding a sequence column could be used, but how much overhead does this add when doing bulk data loading?

相關文章
相關標籤/搜索