Use a Time Series to Quickly Generate A Date Dimension in Postgresql

Steven NgDaily Debug BlogLeave a Comment

generate_series

A neat feature in Postgresql is the generate_series function. generate_series is classified as a “Set Returning Function”, which in plain English means that it returns a bunch of rows. How it works is very similar to a for..next loop. You basically set up a start and stop point, and optionally add a step interval.

The syntax for generate_series is generate_series(start, stop, [step_interval])

Let’s say you wanted three rows, with values 1 to 3. To do this, you would use this:

select * from generate_series(1,3)

Which would return:

generate_series
1
2
3

Let’s try an example with a step interval. Let’s say you wanted alternating numbers between 1 and 10.

select * from generate_series(1,10,2)
generate_series
1
3
4
7
9

By using a step interval of 2, it means that from the starting number (1), it will jump every other number (in this case, the even numbers) until it reaches the end number. The start value always appears, but depending on the step interval, the end value may not.

While the last two examples are using numbers, the generate_series function also works with dates.

Let’s try generating rows for every month in 2019:

select * from generate_series( ‘2019-01-01’::timestamp, ‘2019-12-31’::timestamp, ‘1 month’::interval)

generate_series
2019-01-01 00:00:00
2019-02-01 00:00:00
2019-03-01 00:00:00
2019-04-01 00:00:00
2019-05-01 00:00:00
2019-06-01 00:00:00
2019-07-01 00:00:00
2019-08-01 00:00:00
2019-09-01 00:00:00
2019-10-01 00:00:00
2019-11-01 00:00:00
2019-12-01 00:00:00

Now that is pretty cool. But you might be asking, how can this be useful?

Well, if you’re building a data warehouse, you can quickly create a date dimension for any date range. This is similar to Vince’s post on how to generate a Date Dimension Table with Pentaho, except that it is written in native Postgresql SQL.

Creating a Date Dimension with generate_series

To create a date dimension, all you need to do is use generate_series for a given start and end date in a subquery. For example:

SELECT
  to_char(dimension_date, 'YYYYMMDD')::integer as date_id,
  dimension_date,
  to_char(dimension_date, 'YYYY')::integer as year,
  to_char(dimension_date, 'Q')::integer as quarter,
  to_char(dimension_date, 'MM')::integer as month,
  to_char(dimension_date, 'DD')::integer as day,
  to_char(dimension_date, 'Q')::integer * 10000 + to_char(dimension_date, 'MMDD')::integer as quarter_month_day,
  to_char(dimension_date, 'MMDD')::integer as month_day,
  to_char(dimension_date, 'Month') as month_name,
  to_char(dimension_date, 'Mon') as month_abbreviation,
  to_char(dimension_date, 'Day') as weekday_name,
  to_char(dimension_date, 'Dy') as weekday_abbreviation,
  to_char(dimension_date, 'W')::integer as week_in_month,
  to_char(dimension_date, 'WW')::integer as week_in_year,
  to_char(dimension_date, 'D')::integer as day_in_week,
  (dimension_date + 1) - to_date((date_part('year', dimension_date)::integer * 10000 + 101)::char(8),'YYYYMMDD') as day_in_year,
  case when to_char(dimension_date - INTERVAL '1 day','D')::integer < 6 then 1 else 0 end as is_working_day,
  to_char(dimension_date, 'YYYY')::integer - to_char(dimension_date - (day_in_dimension - 1), 'YYYY')::integer + 1 as year_in_dimension,
  (to_char(dimension_date, 'YYYY')::integer - to_char(dimension_date - (day_in_dimension - 1), 'YYYY')::integer) * 12 + to_char(dimension_date, 'MM')::integer as month_in_dimension,
  day_in_dimension,
  to_char(dimension_date, 'IYYY')::integer as iso_year,
  to_char(dimension_date, 'IW')::integer as iso_week_in_year,
  (to_char(dimension_date, 'IW')::integer - 1) * 7 + to_char(dimension_date - INTERVAL '1 day','D')::integer as iso_day_in_year,
  to_char(dimension_date - INTERVAL '1 day','D')::integer iso_day_in_week
FROM
  (
    SELECT
      date_trunc('day', generated_date):: date as dimension_date,
      (row_number() OVER ())::integer as day_in_dimension
    FROM
      generate_series( '2010-01-01'::timestamp, '2030-12-31'::timestamp, '1 day'::interval) generated_date
  ) time_series

The SQL above creates a query for a sample date dimension. Note in the subquery, I’ve also added a (row_number() OVER ())::integer column. This is simply a counter, which gives you a serial integer for every row in the subquery that I use for the “in dimension” column. I’ll explain that in my list of all the columns produced by the date dimension.

In my main query, I basically perform simple calculations on the date to derive all of my columns. Feel free to add to or change the calculations to suit your own use case.

List of Date Dimension Columns

ColumnUsage
date_idThe date expressed as an integer in YYYYMMDD format. This can be used as the primary key for the table.
dimension_dateThe date as a date type.
yearThe year
quarterThe quarter
monthThe month
dayThe day of the month
quarter_month_dayThe date expressed as an integer in QMMDD format. This is here to make multi-year quarter-to-date comparisons easier in your where clauses.
month_dayThe date expressed as in integer in MDD format. This is here to make your multi-year year-to-date or month-to-date comparisons easier in your where clauses.
month_nameThe full name of the month (e.g., “February”)
month_abbreviationThe three letter abbreviation of the month (e.g., “Feb”)
weekday_nameThe full name of the weekday (e.g., “Tuesday”)
weekday_abbreviationThe three letter abbreviation of the weekday (e.g., “Tue”)
week_in_monthThe week number in the month
week_in_yearThe week number in the year
day_in_weekThe day in the current week. Sunday = 1
day_in_yearThe day in the year
is_working_day1 if the day is a weekday, 0 if the day is on the weekend
year_in_dimensionThe ordinal year in the dimension based on the start date. This is here for reference but can be handy for quick calculations between dates.
month_in_dimensionThe ordinal calendar month in the dimension based on the start date. This is here for reference but can be handy for quick calculations between dates.
day_in_dimensionThe ordinal day in the dimension based on the start date. This is here for reference but can be handy for quick calculations between dates.
iso_yearISO 8601 year. This may not be in sync with traditional calendar years.
iso_week_in_yearThe week number in the ISO 8601 year.
iso_day_in_yearThe day number in the ISO 8601 year.
iso_day_in_weekThe ISO day in the current week. Monday = 1

You’ll notice that I have a bunch of seemingly pointless columns, like the quarter_month_day, month_day and “in dimension” columns. I put columns like that in to make my SQL where statements cleaner, or to do integer math instead of date math.

Let me give you an example. Let’s say today is April 10, 2019 and I want to do a year to date for this year and last year.

My where statement condition would be:

select 
  * from some_table left join 
  date_dimension 
  on 
  some_table.date_id = date_dimension.date_id 
where
  date_dimension.year in (2018, 2019)
  and 
  date_dimension.month_day <= 410

That’s a lot simpler than:

select 
  * 
from 
  some_table left join 
  date_dimension 
  on some_table.date_id = date_dimension.date_id 
where 
  (date_dimension.date_id between 20180101 and 20180410)
or
  (date_dimension.date_id between 20190101 and 20190410)

Depending on the types of reporting you do, you may find those columns useless, so it might make sense to exclude them from your implementation of this date dimension.

Because you’re probably going to modify the SQL to your situation, I didn’t include a CREATE TABLE statement to accompany the date dimension SQL. It’s also because you may actually choose to implement it without a table.

You can also use the code to create a view (materialized or not) or even use it as a subquery. Generally speaking, I wouldn’t go with those two options, but I don’t know your data situation, so go with the solution that works best for your particular situation.

 

 

Other articles by this author:

Please follow and like us:
error

Leave a Reply

Your email address will not be published. Required fields are marked *