A good date dimension is essential for any data warehouse or reporting database. If you don’t have the right tools, you’ll often find yourself generating this date dimension in Excel, which can be tedious.
This reusable Pentaho Data Integration transformation lets you create a date dimension with a user-defined start date and the number of days.
Feel free to tweak to your heart’s content. The download link is at the bottom of the blog entry.
I’m aware that Pentaho does have a built-in date dimension function, but for this exercise, I wanted to show you the formulas in the event that you wanted to port it to another ETL tool.
The first date of this date dimension starts in January 2005. You can change this date as indicated in the transformation instructions.
This transformation is for educational purposes only. It’s up to you to make sure it’s debugged if you’re thinking of putting it into a production system.
This transformation is based on a standard calendar year. It doesn’t factor anything like lunar dates, 4-5-4 calendars or fiscal years that don’t start on January 1.
How it Works
A quick view of the transformation is here:
There are 5 basic steps used in this transformation.
- Using a “Generate Rows” step, set the first day of the dimension to iterate an arbitrary number of times. In the sample, 5475 lines are created, which means there are 5475 days in the dimension or 15 years.
2. Next, use an “Add Sequence” step to create a counter for each row.
3. Use a “Formula” step to create each date in the dimension, using the counter from the previous step.
4. Create the rest of the date dimension elements using another “Formula” step.
5. Finally, output the dimension table to an Excel file. You can change this to any other output type that you would like.
In the date dimension, “In Dimension Counters” are created. These are basically running counts of the Year, Quarter, Month and Day in the dimension. I also like to include Month and Weekday names and abbreviations.
These added elements are useful because some database/reporting engines do not have fully fleshed date functions, so these added elements allow you to use plain old SQL to do conversions and calculations.
- DimensionDate – The date, as a date type.
- DateSid – The date, as an integer representation of YYYYMMDD.
- Year – The year, as an integer.
- Quarter – The quarter, as an integer.
- Month – The month in the year, as an integer.
- Day – The day in the month, as an integer.
- DayInYear – The day in the year, as an integer between 1 and 366.
- Weekday – The day of the week, starting on Sunday indexed at 1.
- MonthName – The name of the month, in English.
- MonthAbbreviation – Three letter abbreviation of the month, in English.
- DayName – The weekday name, in English.
- DayAbbreviation – Three letter abbreviation of the weekday, in English.
- YearInDimension – The year counter for the entire dimension, starting at 1.
- QuartersInDimension – The quarter counter for the entire dimension, starting at 1.
- MonthInDimension – The month counter for the entire dimension, starting at 1.
- DayInDimension – The day counter for the entire dimension, starting at 1.
Notable Formulas and Gotchas
Calculating Month/Weekday Names and Abbreviations:
Since the Pentaho Formula step doesn’t have month or weekday name date functions, a variant of the tip was used to generate these.
DateDifs and In Dimension Counters:
Pentaho’s DateDif function in the “Formula” step appeared to be a little buggy, so many of the “In Dimension” element calculations are a little more complicated than they needed to be.
Because the first year start date is not January 1 (which appears to be the cause of the bug), I had to create an exception for the first year counters. For example, the month in dimension calculation is as follows.
if ( // calculations for the first year year([FirstDate])=year([DimensionDate]); datedif([FirstDate];[DimensionDate];"m")+1; datedif([FirstDate];date(year([FirstDate]);12;31);"m")+1) + // calculated from second year forward if(year([FirstDate])=year([DimensionDate]);0; datedif(date(year([FirstDate])+1;1;1) ; [DimensionDate]; "m")+1 )
If the DateDif worked properly (I was getting something like 21 months between 2016/10 and 2017/1 – what???) then I’d be able to drop the first part of the formula above.
In any case, the formula now only suffers from a little redundancy, but it can accommodate any date dimension that doesn’t start on New Year’s day.
Other articles by this author:
- Track Automated ETL Jobs in Pentaho
- How to Send Reports to a Slack Channel
- Random Thoughts: Vince’s Best and Worst of 2017
- Not Now, Later.
- Extracting Stripe Transactions for Accounting Entries