Postgresql Tip: Getting rid of redundant rows using DELETE USING

Steven NgDaily Debug BlogLeave a Comment

The Problem

I had a Pentaho ETL job that was supposed to be doing an upsert type of operation using an Insert/Update node. I discovered that it was doing inserts only, because one of my comparison fields happened to contain NULL values, which broke the task. In case you were wondering, I was using the “=” comparator, which apparently doesn’t handle null value comparisons. It turned out that I needed the “= ~NULL” comparator instead. That’s my punishment for not reading the docs, I guess (grin).

Because the data was inserting instead of updating, I needed to cleanse all the redundant data from my fact table. My data was coming from Google Analytics. The thing with Google Analytics is that you always want the latest record for a given date to be considered authoritative. All of my table’s rows have a serial identifier acting a surrogate key, even though I don’t actually use that key for any actual joins at this time. At the moment, I only use the key for troubleshooting the order of the inserts. Because I don’t have any joins dependent on that column, I can delete duplicates without any concerns of referential integrity.

The Solution

To delete all of the duplicate records having a serial ID lower than the most recent serial ID, I used Postgresql’s DELETE command with the USING parameter. This is obviously not rocket science, but if you’re not constantly writing DELETE statements in Postgresql, it might save you from having to muddle through some documentation.

My DELETE USING statement looked something like this:

DELETE FROM
fact.ga_page_visits a
USING fact.ga_page_visits b -- comparing against the same table
WHERE
a.id < b.id -- delete records with the smaller primary key values (i.e., the older redundant records)
-- below are all the conditions to match redundant records to be deleted
AND a.date_id = b.date_id 
AND a.user_type = b.user_type
AND a.page_path = b.page_path;

By using the same table in my USING parameter,  a self join is created that allows its own columns to be specified in the WHERE condition.

The first and most important condition, a.id < b.id – this deletes the oldest inserts. If you want to delete the newest inserts, flip the operator from < to > (this is assuming you have a serial numeric id like I do).

All the conditions below my ID check are the conditions that determine redundancy. The number of conditions you have will obviously vary from the number I have, but this gives you an idea of how to do it.

Quick Hit: Don’t Do This In Postgresql

Steven NgDaily Debug BlogLeave a Comment

The Postgresql wiki has an interesting page called “Don’t Do This“.

On this page, there’s a laundry list of things (some more curious than others) of things that you shouldn’t do.

For example, Postgresql has a neat feature (on paper) called rules. But the official stance is that you should probably use a trigger instead.

Also, you shouldn’t use uppercase characters in table names. If you do, you’re going to have to wrap all entity names in quotes. This is an annoying one if you’re coming from SQL Server, which is lenient about these things. The first time I got burned by this was moving data from SQL Server to Postgresql using Pentaho. I lazily used Pentaho’s target table creation functionality, and because the source had mixed case entity names, it caused all sorts of problems.

It seems like you’ll run into a lot of these no-no’s if you’ve got muscle memory from other database engines, but if you use Postgresql in any significant manner, you should read this page.

Quick Hit: Flutter to Support Desktop Applications

Steven NgDaily Debug BlogLeave a Comment

Google’s Flutter is an application development platform based on the Dart language that was originally designed to make it easier to build cross platform mobile applications.

A post on their Github wiki today announced the beginning of support for desktop applications (Mac/Windows/Linux) as an output target.

It will be interesting to see how this impacts Electron, given how the Dart language is so similar to Javascript. The implications are more long-term than short-term, given how the desktop library ecosystem for Flutter today is in its nascent stage of development.

I’ll definitely be keeping an eye on its progress.

Quick Hit: Metabase – An Open Source BI Solution

Steven NgDaily Debug BlogLeave a Comment

If you’re looking for a free, open source BI solution, you might want to check out Metabase.

The business model for the product is similar to other “free” solutions. You can use it for free, and if you want support or more features, you can pay for it.

I haven’t had the chance to try it yet, but it looks like an interesting product. If you’re a Docker fan, you’ll love that there’s a container available.

The only thing I’m not wild about so far is that it’s AGPL licensed, but if you’re just using it internally, it shouldn’t be a deal breaker.

Getting Started With Windows Subsystem for Linux

Steven NgDaily Debug BlogLeave a Comment

The Problem

My Windows text editor was choking while prettifying a complicated XML file, and I wanted to be able to use XMLint without having to set up a Linux VM.

One of the cooler features of Windows 10 is the Windows Subsystem for Linux, or WSL for short.

WSL basically lets you have a full, real Linux installation running on Windows, and you can access it from a command prompt. It’s has its own filesystem, but you can interact with your Windows filesystem and vice versa.

The WSL is especially handy for people switching from Mac to Windows, as it gives them a worthy substitute to OSX’s Terminal app (although I would argue CMDer/ConEMU do too).

So how it works is this. You need to enable the Windows Subsystem for Linux on your computer, and then you can pick the flavor(s) of Linux to install, many of which are available for free from the Windows store.

Once you’ve installed the WSL and a Linux distro for WSL, all you need to do is fire it up from the command line. Neat.

 

Enable WSL

In Windows, go to the Control Panel, and select Programs. In the Programs and Features section, click on “Turn on Windows features on or off”.

In the resulting dialog, make sure Windows Subsystem for Linux is ticked, and click OK. You’ll probably get another dialog telling you to reboot, so save any open files and reboot.

That’s all it takes to get setup, so now you need to add one or more Linux distributions.

 

Install a Linux Distribution

Open up the Windows Store app on your computer and search for “wsl” (without the quotes). You should get a list of available distributions for WSL. The big ones are free, but some people have put up other distributions for a fee.

Just pick a distro and click  on the “Get” button that appears on the page. For this example, I’m going to assume that you picked “Ubuntu”, which is probably the most popular Linux Distribution.

Note that I did not pick an Ubuntu LTS version, but if you choose one of those, the launch command may be different. The launch command is usually in the Windows Store description for the distro.

You will be prompted for a location to install the Linux distro. The distro needs to be installed on a system drive, so C:\ is usually a good destination.

It’s that simple.

 

Launching the Linux Shell

Once it’s done, fire up your favorite terminal app. I use CMDer, but the built in command prompt is fine.

Enter this command:

ubuntu

The command prompt will change to look like

{yourusername}@{machinename}: $

From there, you’re able to the command line as though you are SSH’ed into a Linux box. You can install packages and services etc. just as though it was running in a virtual machine. If you are planning on running services, make sure that you use port numbers that don’t conflict with any services you are running on Windows.

 

Common Activities

 

To access Windows files from WSL

All of your Windows drives are accessible from the /mnt path. For example, your C drive is at /mnt/c

 

To access your Linux Files from Windows

If you’re a version of Windows older than Windows 10 Version 1903*, you will have to set up Samba or something like SFTP to access your Linux files. It’s inconvenient, but not for long.

Microsoft has received a lot of feedback and fixed this in Windows 10 Version 1903 by allowing you to use Windows Explorer to access your files natively from the WSL command prompt. To access your Linux files from 1903, you open up an Explorer instance from your WSL command prompt using:

explorer.exe .

There is a gotcha with respect to accessing Linux files from Windows, irrespective of your chosen method. Your Linux instance must be running for Windows to be able to see the files.

* At the time of posting, Windows 10 Version 1903 is due to arrive by the end of May 2019.

To upgrade your Distro

In many cases, you can update your Linux distro from the WSL command line. For the Ubuntu distro, you can use this command:

do-release-upgrade

If you’re using another distro, the command may differ.