Generating a Date Dimension Table with Pentaho Data Integration

Vince IarusciBusiness Intelligence, Data Integration, Pentaho1 Comment

Overview

A good date dimension is essential for any data warehouse or reporting database. If you don’t have the right tools, you’ll often find yourself generating this date dimension in Excel, which can be tedious.

This reusable Pentaho Data Integration transformation lets you create a date dimension with a user-defined start date and the number of days.

Feel free to tweak to your heart’s content. The download link is at the bottom of the blog entry.

I’m aware that Pentaho does have a built-in date dimension function, but for this exercise, I wanted to show you the formulas in the event that you wanted to port it to another ETL tool.

Caveats

The first date of this date dimension starts in January 2005. You can change this date as indicated in the transformation instructions.
This transformation is for educational purposes only. It’s up to you to make sure it’s debugged if you’re thinking of putting it into a production system.
This transformation is based on a standard calendar year. It doesn’t factor anything like lunar dates, 4-5-4 calendars or fiscal years that don’t start on January 1.

How it Works

A quick view of the transformation is here:

There are 5 basic steps used in this transformation.

  1. Using a “Generate Rows” step, set the first day of the dimension to iterate an arbitrary number of times. In the sample, 5475 lines are created, which means there are 5475 days in the dimension or 15 years.

2. Next, use an “Add Sequence” step to create a counter for each row.

3. Use a “Formula” step to create each date in the dimension, using the counter from the previous step.

4. Create the rest of the date dimension elements using another “Formula” step.

5. Finally, output the dimension table to an Excel file. You can change this to any other output type that you would like.

Date Dimension

In the date dimension, “In Dimension Counters” are created. These are basically running counts of the Year, Quarter, Month and Day in the dimension. I also like to include Month and Weekday names and abbreviations.

These added elements are useful because some database/reporting engines do not have fully fleshed date functions, so these added elements allow you to use plain old SQL to do conversions and calculations.

Elements
  • DimensionDate – The date, as a date type.
  • DateSid – The date, as an integer representation of YYYYMMDD.
  • Year – The year, as an integer.
  • Quarter – The quarter, as an integer.
  • Month – The month in the year, as an integer.
  • Day – The day in the month, as an integer.
  • DayInYear – The day in the year, as an integer between 1 and 366.
  • Weekday – The day of the week, starting on Sunday indexed at 1.
  • MonthName – The name of the month, in English.
  • MonthAbbreviation – Three letter abbreviation of the month, in English.
  • DayName – The weekday name, in English.
  • DayAbbreviation – Three letter abbreviation of the weekday, in English.
  • YearInDimension – The year counter for the entire dimension, starting at 1.
  • QuartersInDimension – The quarter counter for the entire dimension, starting at 1.
  • MonthInDimension – The month counter for the entire dimension, starting at 1.
  • DayInDimension – The day counter for the entire dimension, starting at 1.

Notable Formulas and Gotchas

Calculating Month/Weekday Names and Abbreviations:
Since the Pentaho Formula step doesn’t have month or weekday name date functions, a variant of the tip was used to generate these.

DateDifs and In Dimension Counters:
Pentaho’s DateDif function in the “Formula” step appeared to be a little buggy, so many of the “In Dimension” element calculations are a little more complicated than they needed to be.

Because the first year start date is not January 1 (which appears to be the cause of the bug), I had to create an exception for the first year counters. For example, the month in dimension calculation is as follows.

if
(
  // calculations for the first year
  year([FirstDate])=year([DimensionDate]);
  
  datedif([FirstDate];[DimensionDate];"m")+1;
  datedif([FirstDate];date(year([FirstDate]);12;31);"m")+1) 
  +
  // calculated from second year forward
  if(year([FirstDate])=year([DimensionDate]);0;
  datedif(date(year([FirstDate])+1;1;1) ;
  [DimensionDate];
  
  "m")+1
)

If the DateDif worked properly (I was getting something like 21 months between 2016/10 and 2017/1 – what???) then I’d be able to drop the first part of the formula above.

In any case, the formula now only suffers from a little redundancy, but it can accommodate any date dimension that doesn’t start on New Year’s day.

Transformation File

DateDimensionGenerator.ktr

 

Luna – WYSIWYG Data Processing

Steven NgAnalytics, CodingLeave a Comment

The Luna 1.0 Beta was just released and is available for download, and it sounds like a very interesting project.

The Luna team describes their product as WYSIWYG Data Processing, but you can look at it as a visual data flow language. I haven’t really had any time to really dig into this, but I think data nerds might really like this. It’s interoperable with other languages, including Javascript.

It’s still an early work, so manage your expectations accordingly. From what I’ve seen and heard, it doesn’t look like there’s any PostgreSQL support yet.

Tufte CSS

Steven NgRandom, ReportingLeave a Comment

Are you a fan of Edward Tufte‘s work? I’m a huge fan of his work. I’ve pretty much bought all of his books, and I consider them required reading for anyone who makes reports and visualizations.

Dave Liepmann has created a CSS that mirrors the styling and layouts of Tufte’s books. Edward Tufte himself is even contributing to this open source Github repo. Pretty cool stuff, if you ask me.