1.針對PostgreSQL數據庫表的去重複方法基本有三種,這是在網上查找的方法,在附錄1給出。可是這些方法對GreenPlum來講都無論用。數據庫
2.數據表分佈在不一樣的節點上,每一個節點的ctid是惟一的,可是不一樣的節點就有ctid重複的可能,所以GreenPlum必須藉助gp_segment_id來進行去重複處理。markdown
3.在網上找到了一個相對繁瑣的方法,在附錄2給出:app
4.最終的方法是:ide
delete from test where (gp_segment_id, ctid) not in (select gp_segment_id, min(ctid) from test group by x, gp_segment_id);post
驗證經過。this
附錄1:PostgreSQL數據表去重複的三種方法:spa
引用自:http://my.oschina.net/swuly302/blog/144933.net
採用PostgreSQL 9.2 官方文檔例子爲例: rest
CREATE TABLE weather ( city varchar(80), temp_lo int, -- low temperature temp_hi int, -- high temperature prcp real, -- precipitation date date ); INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27'), ('San Francisco', 43, 57, 0, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'); --- duplicated row
這裏有3中方法: code
第一種:替換法 -- 剔除重複行的數據轉存到新表weather_temp SELECT DISTINCT city, temp_lo, temp_hi, prcp, date INTO weather_temp FROM weather; -- 刪除原表 DROP TABLE weather; -- 將新表重命名爲weather ALTER TABLE weather_temp RENAME TO weather; 或者 -- 建立與weather同樣的表weather_temp CREATE TABLE weather_temp (LIKE weather INCLUDING CONSTRAINTS); -- 用剔除重複行的數據填充到weather_temp中 INSERT INTO weather_temp SELECT DISTINCT * FROM weather; -- 刪除原表 DROP TABLE weather; -- 將新重命名爲weather. ALTER TABLE weather_temp RENAME TO weather; 通俗易懂,有不少毀滅性的操做如DROP,並且當數據量大時,耗時耗空間。不推薦。 第二種: 添加字段法 -- 添加一個新字段,類型爲serial ALTER TABLE weather ADD COLUMN id SERIAL; -- 刪除重複行 DELETE FROM weather WHERE id NOT IN ( SELECT max(id) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); -- 刪除添加的字段 ALTER TABLE weather DROP COLUMN id; 須要添加字段,「暫時不知道Postgres是如何處理添加字段的,是直接在原表追加呢,仍是複製原表組成新表呢?」,若是是原表追加,可能就會由於新字段的加入而致使分頁(通常block: 8k),若是是複製的話那就罪過了。很差。 第三種:系統字段[查看 System Columns] DELETE FROM weather WHERE ctid NOT IN ( SELECT max(ctid) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); 針對性強[Postgres獨有],可是簡單。
----------------可是對GreenPlum的表來講,表分割在各個節點上,不能單純的用ctid來作去重複處理。
附錄2:
https://discuss.pivotal.io/hc/zh-cn/community/posts/206428018-What-is-the-most-efficient-way-of-deleting-duplicate-records-from-a-table-
Currently we use Primary Keys to avoid loading duplicate data into our tables, but PK brings many restrictions. Since we can’t easily identify or prevent duplicates arriving from the variety of 3rd party upstream systems, we wanted to investigate the ‘load everything, remove duplicates afterwards’ approach.
In Postgres, you can use an efficient method such as:
DELETE FROM test WHERE ctid NOT IN ( SELECT min(ctid) FROM test GROUP BY x); (where 'x' is the unique column list)
However in Greenplum ‘ctid’ is only unique per segment.
One approach would be:
DELETE FROM test USING (select gp_segment_id, ctid from (select gp_segment_id, ctid, rank() over (partition by x order by gp_segment_id, ctid) as rk from test ) foo WHERE rk <> 1) rows_to_delete WHERE test.gp_segment_id=rows_to_delete.gp_segment_id AND test.ctid=rows_to_delete.ctid;
But the use of window functions, subqueries etc. feels pretty inefficient.
Is there a better form?
Note that in our use case our unique column list varies up to ~10 columns so we don’t have a single unique key field – hence the RANK in the example. I suppose adding a sequence column could be used, but how much overhead does this add when doing bulk data loading?