I had a Pentaho ETL job that was supposed to be doing an upsert type of operation using an Insert/Update node. I discovered that it was doing inserts only, because one of my comparison fields happened to contain NULL values, which broke the task. In case you were wondering, I was using the “=” comparator, which apparently doesn’t handle null value comparisons. It turned out that I needed the “= ~NULL” comparator instead. That’s my punishment for not reading the docs, I guess (grin).
Because the data was inserting instead of updating, I needed to cleanse all the redundant data from my fact table. My data was coming from Google Analytics. The thing with Google Analytics is that you always want the latest record for a given date to be considered authoritative. All of my table’s rows have a serial identifier acting a surrogate key, even though I don’t actually use that key for any actual joins at this time. At the moment, I only use the key for troubleshooting the order of the inserts. Because I don’t have any joins dependent on that column, I can delete duplicates without any concerns of referential integrity.
To delete all of the duplicate records having a serial ID lower than the most recent serial ID, I used Postgresql’s DELETE command with the USING parameter. This is obviously not rocket science, but if you’re not constantly writing DELETE statements in Postgresql, it might save you from having to muddle through some documentation.
My DELETE USING statement looked something like this:
DELETE FROM fact.ga_page_visits a USING fact.ga_page_visits b -- comparing against the same table WHERE a.id < b.id -- delete records with the smaller primary key values (i.e., the older redundant records) -- below are all the conditions to match redundant records to be deleted AND a.date_id = b.date_id AND a.user_type = b.user_type AND a.page_path = b.page_path;
By using the same table in my USING parameter, a self join is created that allows its own columns to be specified in the WHERE condition.
The first and most important condition, a.id < b.id – this deletes the oldest inserts. If you want to delete the newest inserts, flip the operator from < to > (this is assuming you have a serial numeric id like I do).
All the conditions below my ID check are the conditions that determine redundancy. The number of conditions you have will obviously vary from the number I have, but this gives you an idea of how to do it.
Other articles by this author:
- Get To Know Extrata, Episode 2: Database Writeback
- Use a Time Series to Quickly Generate A Date Dimension in Postgresql
- Quick Hit: Don’t Do This In Postgresql
- Postgresql Tip: A ranked list that includes some values irrespective of their actual rank
- Generating a Non-Expiring API Token for Facebook Insights