You want to create a ranked list (like a top 10 list) that always includes one or more values.
What do I mean by this? The best example is during the Olympic Games. You’ll often see the top 10 countries with a medal count.
Let’s say you’re reading that list from the Toronto Star, and on that particular day, Canada is not in the top 10 for medal counts. Because the Toronto Star’s readership is mostly Canadian, they always want to include Canada in that list. Which is why you’ll generally still see a line for Canada as the 11th row, showing whatever ranking value Canada has with respect to medal counts.
For my particular use case, I’m looking at geographic origins for sessions on our web site. For my ranking list, I always want Canada and the United States to show up on top 10 lists, irrespective of their actual rankings.
The solution is pretty straightforward. My Google Analytics geographic data is stored in a table, and I want to rank hits by country for my entire data set. Before aggregation, it looks like this:
First, I need to create a query to rank all of the sessions by country.
select country, sum(sessions) as sessions, rank() over (order by sum(sessions) desc) from fact.ga_geographic_visitors group by country
That narrowed my output to something like this:
On a normal list, all I need to do is filter down the rank to be <=10 (you don’t necesarily want to use limit 10 because the rank() function in Postgresql spots ties).
Because my sample data includes the US and Canada in the top 10, I’m going to have to change my rules to better illustrate my example. Let’s force China and Mexico into my top ten list, since they fall at 13 and 17, respectively.
All I need to do is make my previous query a subquery and then add a couple of filters – one to limit the rank to 10 or below, and another to ensure that Mexico and China are always included (using an or boolean operator).
select * from ( select country, sum(sessions) as sessions, rank() over (order by sum(sessions) desc) from fact.ga_geographic_visitors group by country ) ranked_query where ranked_query.rank <= 10 or ranked_query.country in ('Mexico','China')
And that yields the following result:
And there you have it! If China and Mexico happen to be in the top 10, then only the values with a rank of 10 or lower will appear in the list (the number of actual items in the list will vary depending on whether there are any ties) . If China and Mexico are below the threshold, they’ll appear no matter what.
Other articles by this author:
- Postgresql Tip: Sometimes the translate function is easier than replace or regexp_replace
- Handy Subreddits for the Business Intelligence Professional
- Get To Know Extrata, Episode 2: Database Writeback
- Use a Time Series to Quickly Generate A Date Dimension in Postgresql
- Get To Know Extrata, Episode 1: The Command Line Interface