Set up your own Business Intelligence/Data Warehousing/Reporting solution on the cheap

Steven NgDaily Debug BlogLeave a Comment

Overview

When I first started working in Business Intelligence around 15 years ago, solutions were incredibly expensive, and that was before you even spent a dime on the effort to build it out.

Software licenses were ridiculously expensive, and you often had to buy licenses for multiple environments.

Here’s a taste of what software you’d need to buy:

  • Server operating system (usually Windows Server)
  • Relational Database license (MS SQL, Oracle or DB2)
  • Source Control Software
  • Database Design Software (ERStudio, ERWin)
  • ETL Software (Cognos Data Manager, Informatica, etc.)
  • Reporting Server Software (Cognos, Microstrategy, etc.)
  • End user client licenses

Most environments we dealt with had a development, test/stage and production environment. So you’d have to buy hardware for all three. And depending on the licensing model used by the software vendor (which changed regularly), you’d have to buy licenses for all three environments. Also, you were looking at incremental costs for end user licenses too.

For a large installation, you could easily spend $100K on software licenses alone.

Well, the good news is that 2005 was a long time ago, and technology has changed. You can still spend a fortune if you want to, but if you’re a small to medium enterprise (SME), you’ve got options. Most of what I’m covering in this post is framed around SMEs, so if you’re in the Fortune 500, this article is probably not for you.

Servers

Physical servers aren’t your only options these days. You can rent computing time or virtual servers in the cloud, or you can virtualize on-premises. You can also use containers like Docker. I’m not going to dig very deep on any of those topics, since there are already plenty of articles on the Interwebs on those things.

The main thing of importance here is that Linux has much more acceptance today than it did a decade ago, even in traditional Windows shops. There are free and paid flavours of Linux, but if you’re on a tight budget, you can easily get away with using Ubuntu Server, or CentOS server. The communities using them are quite large, and you can get answers to your questions online from these communities pretty fast.

Provisioning Linux servers is also pretty easy these days. Chef, Puppet and Ansible are available in free, open-source versions, and can make life easy and predictable when managing many multi-tier environments.

 

Relational Databases

Enterprise relational databases are pretty expensive when you stick to one of the major vendors. I remember having clients that paid six figures for enterprise database licenses during the dot-com boom. Back in those days, few companies trusted open source relational databases.

Today, Postgresql is not only free, it’s reputable too. Unless you need very niche features, I would argue that Postgresql is more than enough for most SMEs wanting to build a data warehouse or reporting database.

 

Source Control Software

There are tons of options with respect to source control software. If you don’t have any existing software in place, I would recommend that you use git. It’s widely used in the open-source community, and there are hosted git services like Github and Bitbucket that let you have private repos for free or at low cost. In-house, we use git for everything.

 

Database Design and Schema Management

If you like visual database design applications like ERStudio, there are some free and inexpensive tools available like PgModeler, System Architect and SQL Power Architect.

I live mostly in text editors these days, so I don’t really use visual database design tools very often. Personally, I prefer to use database schema migration tools (not to be confused with data migration). My tool of choice for schema migrations is Flyway. Another excellent option is Liquibase, but it’s a little more involved. Both are free, but have paid support for those who need it.

For large databases, I would use a simple drawing tool or Treesheets to visually map out the tables, but for actual table and column management, I still prefer raw SQL. This is a very personal preference, and I know a lot of people would rather not do things this way.

Let me explain my own reasons for choosing this path (for what it’s worth, you should choose the path that works best for your organization/projects).

When it comes to source control, text files like SQL files work much better. It’s easier to catch changes and diffs in a commit. I like to use tags in my source control tool (git) to identify releases.

When you have databases across multiple environments (development, staging, production) – there’s a good chance that you have different releases of the database in each environment. For example, Development could be running release 10, which is in active development, staging could be running release 9, and production could be running release 6.

Your source control is going to have all of the latest code, but you will want to ensure that you can promote the correct release in each environment. This is easy to accomplish using git tags (you can use a different process, this is just a process I picked up after doing Ruby on Rails development). I usually have a local file that tracks the version in staging and production that would look something like this:

{
  staging: 9,
  production: 6
}

Let’s say that staging passed all tests, and we want to promote 9 to prod. I would change the value for production in above from 6 to 9, and run my promotion script.

My promotion script would basically ssh into my production database server, do a git pull for the hash of my release tagged as “9” and then execute Flyway to update the database schema to release 9.

Because I use a schema management tool like Flyway, each database has its current state maintained in a table, which is updated every time a migration is run.

As far as schema migrations go, I follow the forward-migrations-only philosophy, which means no rollbacks.

I don’t want to get too deep in the weeds on database schema migrations, as it merits its own post. Having said that, I consider schema migration tools to be essential to the development process.

 

ETL Software

There are some pretty good free ETL (Extract, Transform, Load) solutions out there. The two I like are Pentaho Data Integration (Community Edition) and Scriptella. Pentaho’s Community Edition is free to use, but you have a path to paid support if you choose to upgrade.

Pentaho and Scriptella are on opposite sides of the spectrum. Pentaho is a GUI based tool, and it’s very powerful. Scriptella is a text file based tool (using XML), and has fewer bells and whistles. Scriptella is, however, no slouch in the capabilities department.

Because of my bias towards text-file based source code, I prefer Scriptella. Since I generally prefer an ELT (Extract, Load, Transform) approach versus an ETL approach, Scriptella is more than good enough for most situations. Why I have that preference is a separate post, but in short, I find ELT to be easier to maintain.

 

Reporting Server Software

Of all topics, the Reporting Server is the trickiest to talk about. There are some very strong opinions in this area. Just a gentle reminder, this entire post is focused on SMEs, so bear that in mind. Some of my opinions here are based on how I’ve seen BI deployed in organizations.

First and foremost, I am of the opinion that the data warehouse/reporting database absolutely is the most important part of any solution. If you’ve done a good job with the database, reporting is simple.

Also, there is absolutely a place for enterprise reporting software, even in SME’s. It really comes down to separating imaginary requirements from real world requirements.

If you have no idea how your users will use reporting, and you want to give them ad-hoc reporting, an enterprise solution like Cognos might actually be the best solution. Enterprise solutions are designed to be everything to everyone, and as long as your reporting database is in order, you’ll be off and running in no time.

Okay, now that I’ve got that out of the way, if you:

  • actually know what you want from a reporting perspective
  • have finite and relatively unchanging reporting requirements
  • can decide how users get the reports (as opposed to management or the users deciding for you)

… then you might be able to get away with rolling your own reporting solution. In fact, a reporting server may not even be part of the solution.

If your users live in Excel (and this is more commonplace than you might think), why on earth do you need to force your users to log into a web portal to get them? There are easier ways to get Excel reports to your users. You can use automation to dump queries into Excel files and put them in secure network file shares, or you can e-mail them to your users as attachments. We use Extrata (shameless plug) to e-mail our in-house Excel reports from our data warehouse. Extrata’s not free, but the Enterprise Edition required to do this is only $99.

If you have users who already create reports from ad-hoc queries, why not give them a license for Microsoft Access (or R, or PowerBI, or whatever other tools they are already using) and let them build them there? Many of the analysts on client projects I’ve worked on were already using Access in addition to any reporting tools we deployed.

If you’ve got canned reports that are generated at designated times, why not e-mail them instead of using a reporting portal? Alternatively, you can send them to secure folders in Office 365 or Dropbox. Again, we use Extrata to build and send Excel and other Office formatted reports for our in-house reporting.

Noticing a trend here? With the advent of the smartphone and tablet, a web portal is actually a barrier that can slow users down from getting the information that they need.

Are canned web reports good enough for your users? Let’s say you do need a portal. Modern web application frameworks like Ruby on Rails are already report portals. The way they are designed is that they simply run SQL queries and dump out the results as HTML. Contrary to popular perception, it is not reinventing the wheel. Most of the coding is simple. If there is any complexity, it’s usually related to the implementation of some weird chart or graph, but you’ll probably have the same issues producing that chart on a commercial enterprise solution too. The big bonus of using a web application framework is that those skills are cheaper and much more widely available than for a person who knows a proprietary enterprise tool.

Now that I’ve gotten my opinionated take on reporting servers out of the way, there are some free/affordable solutions you can consider outside of rolling your own solution:

Depending on your requirements and technical expertise, your mileage will definitely vary.

 

Gotchas

Some of the solutions I’ve recommended, like Flyway, Liquibase, Pentaho and Scriptella require Java in order to run. Because of some recent changes to Java licensing, you may have to pay to use Java in a commercial setting. I’m no Java licensing expert, so you might want to investigate that before using those tools. Alternatively, you can use AdoptOpenJDK instead, but there may be a risk of compatibility issues.

 

Wrap Up

My opinionated take on doing BI/DW on the cheap may not be for everyone, but hopefully it does give you some ideas and insight into how you can get a reporting solution into your SME without the burden of high software costs. Generally speaking, the effort to build it out is about the same, so be aware of that.

 

 

 

 

 

 

 

 

 

 

Other articles by this author:

Leave a Reply

Your email address will not be published. Required fields are marked *