Postgresql Tip: A ranked list that includes some values irrespective of their actual rank

Steven NgDaily Debug BlogLeave a Comment

The Problem

You want to create a ranked list (like a top 10 list) that always includes one or more values.

What do I mean by this? The best example is during the Olympic Games. You’ll often see the top 10 countries with a medal count.

Let’s say you’re reading that list from the Toronto Star, and on that particular day, Canada is not in the top 10 for medal counts. Because the Toronto Star’s readership is mostly Canadian, they always want to include Canada in that list. Which is why you’ll generally still see a line for Canada as the 11th row, showing whatever ranking value Canada has with respect to medal counts.

For my particular use case, I’m looking at geographic origins for sessions on our web site.  For my ranking list, I always want Canada and the United States to show up on top 10 lists, irrespective of their actual rankings.

The Solution

The solution is pretty straightforward. My Google Analytics geographic data is stored in a table, and I want to rank hits by country for my entire data set. Before aggregation, it looks like this:

First, I need to create a query to rank all of the sessions by country.

select
  country,
  sum(sessions) as sessions,
  rank() over (order by sum(sessions) desc)
from
  fact.ga_geographic_visitors
group by country

That narrowed my output to something like this:

On a normal list, all I need to do is filter down the rank to be <=10 (you don’t necesarily want to use limit 10 because the rank() function in Postgresql spots ties).

Because my sample data includes the US and Canada in the top 10, I’m going to have to change my rules to better illustrate my example. Let’s force China and Mexico into my top ten list, since they fall at 13 and 17, respectively.

All I need to do is make my previous query a subquery and then add a couple of filters – one to limit the rank to 10 or below, and another to ensure that Mexico and China are always included (using an or boolean operator).

select * from
  (
    select
      country,
      sum(sessions) as sessions,
      rank() over (order by sum(sessions) desc)
    from
      fact.ga_geographic_visitors
    group by country
  ) ranked_query
  where
  ranked_query.rank <= 10
  or
  ranked_query.country in ('Mexico','China')

And that yields the following result:

And there you have it! If China and Mexico happen to be in the top 10, then only the values with a rank of 10 or lower will appear in the list (the number of actual items in the list will vary depending on whether there are any ties) . If China and Mexico are below the threshold, they’ll appear no matter what.

 

Other articles by this author:

Please follow and like us:
error

Leave a Reply

Your email address will not be published. Required fields are marked *