Postgresql Tip: Getting rid of redundant rows using DELETE USING

Steven NgDaily Debug BlogLeave a Comment

The Problem

I had a Pentaho ETL job that was supposed to be doing an upsert type of operation using an Insert/Update node. I discovered that it was doing inserts only, because one of my comparison fields happened to contain NULL values, which broke the task. In case you were wondering, I was using the “=” comparator, which apparently doesn’t handle null value comparisons. It turned out that I needed the “= ~NULL” comparator instead. That’s my punishment for not reading the docs, I guess (grin).

Because the data was inserting instead of updating, I needed to cleanse all the redundant data from my fact table. My data was coming from Google Analytics. The thing with Google Analytics is that you always want the latest record for a given date to be considered authoritative. All of my table’s rows have a serial identifier acting a surrogate key, even though I don’t actually use that key for any actual joins at this time. At the moment, I only use the key for troubleshooting the order of the inserts. Because I don’t have any joins dependent on that column, I can delete duplicates without any concerns of referential integrity.

The Solution

To delete all of the duplicate records having a serial ID lower than the most recent serial ID, I used Postgresql’s DELETE command with the USING parameter. This is obviously not rocket science, but if you’re not constantly writing DELETE statements in Postgresql, it might save you from having to muddle through some documentation.

My DELETE USING statement looked something like this:

DELETE FROM
fact.ga_page_visits a
USING fact.ga_page_visits b -- comparing against the same table
WHERE
a.id < b.id -- delete records with the smaller primary key values (i.e., the older redundant records)
-- below are all the conditions to match redundant records to be deleted
AND a.date_id = b.date_id 
AND a.user_type = b.user_type
AND a.page_path = b.page_path;

By using the same table in my USING parameter,  a self join is created that allows its own columns to be specified in the WHERE condition.

The first and most important condition, a.id < b.id – this deletes the oldest inserts. If you want to delete the newest inserts, flip the operator from < to > (this is assuming you have a serial numeric id like I do).

All the conditions below my ID check are the conditions that determine redundancy. The number of conditions you have will obviously vary from the number I have, but this gives you an idea of how to do it.

Other articles by this author:

Please follow and like us:
error

Leave a Reply

Your email address will not be published. Required fields are marked *