Get To Know Extrata, Episode 2: Database Writeback

Steven NgDaily Debug BlogLeave a Comment

Overview

In our continuing series on acquainting you with less obvious features in Extrata, I bring you our episode on database writeback. The big selling feature of Extrata is that we can dump data in Excel format. There’s a reason for this. Most people in business and academia have some familiarity with Excel. Also, advanced users of Excel find wonderful ways of making solutions that can find squares into round holes. It’s quite amazing what people can do with Excel.

How The Feature Came About

Not too long into Extrata’s life cycle, I added the ability to use some databases (MySQL, ODBC, Postgresql and SQLite) as a data source. While the cloud API service probably has a wider appeal, I figured there were probably some instances where people would want to dump the contents of SQL statements into an Excel without having to constantly harass the IT department.

Once the capability to read from a database was wired in to Extrata, the next logical step to me seemed to be the addition of database writeback capability. After some discussion, Vince and I figured this was more of a niche feature, so we restrict the writeback capability to the enterprise edition of Extrata.

I made a judgment call and still force users to save data in a file format before I write into a database. My reasoning is that a lot of data sources in Extrata are API calls, and some services are metered. I would rather users have an extra copy of the API data in file format than none at all in the event of a database write error.

Extrata offers two ways to write to a database. The first way is to dump each record into a JSON field (for the databases that support it). The second way is to dump each record into a traditional table, where each table column name has to match the corresponding data attribute name from the source query.

Since the data that Extrata gets from data sources can be unstructured, the first method is my preferred way to write back to a database, especially if your data source is a cloud API. Cloud APIs can change without notice, and this will ultimately cause problems in the event that an attribute has been renamed, added or deleted.

This is why I prefer to stage my data loads into a table that has a single JSON attribute and use a query or view to format the data in the way I want.

How To Use Database Writeback

As I mentioned earlier, there are two types of database writeback. The first and easiest way is to write to a JSON column. The second way is to write to a predefined table.

One important thing to note is that in the current version of Extrata, it will only insert records into a database table. There is no option to clear all records prior to a load (this will be remedied in a future version).

Create A Database

If you don’t already have one, create a new database on a platform supported by Extrata– MySQL,  Postgresql or SQLite. If your database supports ODBC, that will work too. Create the database and a corresponding ODBC DSN for that database.

Create A Database Source

To write to a database, Extrata needs to know what that database is. So in your Sources view, create a new source that points to the database you created in the last step. Note that any SQL sample code I provide here is for Postgresql. You may need to make adjustments for the database platform that you are using.

Create Your Database Table

Option 1: JSON Column

If you are using a JSON friendly database like Postgresql, you just need to create a simple table using code similar to below:

create table my_staging_table (
  id bigserial, -- this column is optional but handy 
  json_value json
)

The id column is not required, but I like having a serialized ID to help me differentiate multiple inserts.

Option 2: Predefined Table

If you want to create a predefined table, you’re going to have to dry run a normal task against your source to generate an Excel or CSV file. The header names in the file are the names you should use as your table’s column names. Here’s an example of a predefined table for Calendarific exports:

create table my_staging_table (   
  id bigserial, -- this column is optional but handy 
  name varchar(255), -- when setting varchar sizes, you should be familiar with the data first
  description text,
  type varchar(1024),
  country_code varchar(2),
  year integer,
  month integer,
  day integer,
  date_id integer
)

Predefined tables are a little more “fragile” than using a JSON column, but depending on what database engine you’re using, a JSON column might not be an option.

The reason why predefined tables are more fragile is that if your data source output headers change, you’re probably going to encounter errors. Not only do the column names need to match, the data types do as well.

Enable Writeback On Your Task

In the Output Options of your task, go to the DB Output tab.

In the Database Destination drop down, select the source that you created earlier.

In the Table Name field, enter the name of your destination table.

In the Table Type drop down, select either “Write to a predefined table” or “Write to a JSON column” depending on the type of table you created earlier.

Save your task and then Run it.

If all goes well, you should find that the table you created earlier will be populated with data!

But Wait…

Doesn’t the database writeback feature make Extrata an ETL tool? Sorta, I guess…

Let’s set the record straight, Extrata is not designed to replace an industrial strength ETL tool. Having said that, it can do some lightweight ETL. If you run a task with database writeback with the command line interface, it can integrate nicely with many ETL tools. In fact, we’ve got a client who runs Extrata on a server to stage data for more complicated loads with Pentaho Data Integration.

If you care to be creative, you can do multiple passes on the same data with the database writeback feature. You can use one task to do the initial extraction into a database table. You can then create another task to extract the extracted data into yet another database table, ad infinitum. But just because you can, doesn’t mean you should.

It’s not for me to tell you how to use or not to use Extrata.  Just use it in the way that maximizes value for you.

Set up your own Business Intelligence/Data Warehousing/Reporting solution on the cheap

Steven NgDaily Debug BlogLeave a Comment

Overview

When I first started working in Business Intelligence around 15 years ago, solutions were incredibly expensive, and that was before you even spent a dime on the effort to build it out.

Software licenses were ridiculously expensive, and you often had to buy licenses for multiple environments.

Here’s a taste of what software you’d need to buy:

  • Server operating system (usually Windows Server)
  • Relational Database license (MS SQL, Oracle or DB2)
  • Source Control Software
  • Database Design Software (ERStudio, ERWin)
  • ETL Software (Cognos Data Manager, Informatica, etc.)
  • Reporting Server Software (Cognos, Microstrategy, etc.)
  • End user client licenses

Most environments we dealt with had a development, test/stage and production environment. So you’d have to buy hardware for all three. And depending on the licensing model used by the software vendor (which changed regularly), you’d have to buy licenses for all three environments. Also, you were looking at incremental costs for end user licenses too.

For a large installation, you could easily spend $100K on software licenses alone.

Well, the good news is that 2005 was a long time ago, and technology has changed. You can still spend a fortune if you want to, but if you’re a small to medium enterprise (SME), you’ve got options. Most of what I’m covering in this post is framed around SMEs, so if you’re in the Fortune 500, this article is probably not for you.

Servers

Physical servers aren’t your only options these days. You can rent computing time or virtual servers in the cloud, or you can virtualize on-premises. You can also use containers like Docker. I’m not going to dig very deep on any of those topics, since there are already plenty of articles on the Interwebs on those things.

The main thing of importance here is that Linux has much more acceptance today than it did a decade ago, even in traditional Windows shops. There are free and paid flavours of Linux, but if you’re on a tight budget, you can easily get away with using Ubuntu Server, or CentOS server. The communities using them are quite large, and you can get answers to your questions online from these communities pretty fast.

Provisioning Linux servers is also pretty easy these days. Chef, Puppet and Ansible are available in free, open-source versions, and can make life easy and predictable when managing many multi-tier environments.

 

Relational Databases

Enterprise relational databases are pretty expensive when you stick to one of the major vendors. I remember having clients that paid six figures for enterprise database licenses during the dot-com boom. Back in those days, few companies trusted open source relational databases.

Today, Postgresql is not only free, it’s reputable too. Unless you need very niche features, I would argue that Postgresql is more than enough for most SMEs wanting to build a data warehouse or reporting database.

 

Source Control Software

There are tons of options with respect to source control software. If you don’t have any existing software in place, I would recommend that you use git. It’s widely used in the open-source community, and there are hosted git services like Github and Bitbucket that let you have private repos for free or at low cost. In-house, we use git for everything.

 

Database Design and Schema Management

If you like visual database design applications like ERStudio, there are some free and inexpensive tools available like PgModeler, System Architect and SQL Power Architect.

I live mostly in text editors these days, so I don’t really use visual database design tools very often. Personally, I prefer to use database schema migration tools (not to be confused with data migration). My tool of choice for schema migrations is Flyway. Another excellent option is Liquibase, but it’s a little more involved. Both are free, but have paid support for those who need it.

For large databases, I would use a simple drawing tool or Treesheets to visually map out the tables, but for actual table and column management, I still prefer raw SQL. This is a very personal preference, and I know a lot of people would rather not do things this way.

Let me explain my own reasons for choosing this path (for what it’s worth, you should choose the path that works best for your organization/projects).

When it comes to source control, text files like SQL files work much better. It’s easier to catch changes and diffs in a commit. I like to use tags in my source control tool (git) to identify releases.

When you have databases across multiple environments (development, staging, production) – there’s a good chance that you have different releases of the database in each environment. For example, Development could be running release 10, which is in active development, staging could be running release 9, and production could be running release 6.

Your source control is going to have all of the latest code, but you will want to ensure that you can promote the correct release in each environment. This is easy to accomplish using git tags (you can use a different process, this is just a process I picked up after doing Ruby on Rails development). I usually have a local file that tracks the version in staging and production that would look something like this:

{
  staging: 9,
  production: 6
}

Let’s say that staging passed all tests, and we want to promote 9 to prod. I would change the value for production in above from 6 to 9, and run my promotion script.

My promotion script would basically ssh into my production database server, do a git pull for the hash of my release tagged as “9” and then execute Flyway to update the database schema to release 9.

Because I use a schema management tool like Flyway, each database has its current state maintained in a table, which is updated every time a migration is run.

As far as schema migrations go, I follow the forward-migrations-only philosophy, which means no rollbacks.

I don’t want to get too deep in the weeds on database schema migrations, as it merits its own post. Having said that, I consider schema migration tools to be essential to the development process.

 

ETL Software

There are some pretty good free ETL (Extract, Transform, Load) solutions out there. The two I like are Pentaho Data Integration (Community Edition) and Scriptella. Pentaho’s Community Edition is free to use, but you have a path to paid support if you choose to upgrade.

Pentaho and Scriptella are on opposite sides of the spectrum. Pentaho is a GUI based tool, and it’s very powerful. Scriptella is a text file based tool (using XML), and has fewer bells and whistles. Scriptella is, however, no slouch in the capabilities department.

Because of my bias towards text-file based source code, I prefer Scriptella. Since I generally prefer an ELT (Extract, Load, Transform) approach versus an ETL approach, Scriptella is more than good enough for most situations. Why I have that preference is a separate post, but in short, I find ELT to be easier to maintain.

 

Reporting Server Software

Of all topics, the Reporting Server is the trickiest to talk about. There are some very strong opinions in this area. Just a gentle reminder, this entire post is focused on SMEs, so bear that in mind. Some of my opinions here are based on how I’ve seen BI deployed in organizations.

First and foremost, I am of the opinion that the data warehouse/reporting database absolutely is the most important part of any solution. If you’ve done a good job with the database, reporting is simple.

Also, there is absolutely a place for enterprise reporting software, even in SME’s. It really comes down to separating imaginary requirements from real world requirements.

If you have no idea how your users will use reporting, and you want to give them ad-hoc reporting, an enterprise solution like Cognos might actually be the best solution. Enterprise solutions are designed to be everything to everyone, and as long as your reporting database is in order, you’ll be off and running in no time.

Okay, now that I’ve got that out of the way, if you:

  • actually know what you want from a reporting perspective
  • have finite and relatively unchanging reporting requirements
  • can decide how users get the reports (as opposed to management or the users deciding for you)

… then you might be able to get away with rolling your own reporting solution. In fact, a reporting server may not even be part of the solution.

If your users live in Excel (and this is more commonplace than you might think), why on earth do you need to force your users to log into a web portal to get them? There are easier ways to get Excel reports to your users. You can use automation to dump queries into Excel files and put them in secure network file shares, or you can e-mail them to your users as attachments. We use Extrata (shameless plug) to e-mail our in-house Excel reports from our data warehouse. Extrata’s not free, but the Enterprise Edition required to do this is only $99.

If you have users who already create reports from ad-hoc queries, why not give them a license for Microsoft Access (or R, or PowerBI, or whatever other tools they are already using) and let them build them there? Many of the analysts on client projects I’ve worked on were already using Access in addition to any reporting tools we deployed.

If you’ve got canned reports that are generated at designated times, why not e-mail them instead of using a reporting portal? Alternatively, you can send them to secure folders in Office 365 or Dropbox. Again, we use Extrata to build and send Excel and other Office formatted reports for our in-house reporting.

Noticing a trend here? With the advent of the smartphone and tablet, a web portal is actually a barrier that can slow users down from getting the information that they need.

Are canned web reports good enough for your users? Let’s say you do need a portal. Modern web application frameworks like Ruby on Rails are already report portals. The way they are designed is that they simply run SQL queries and dump out the results as HTML. Contrary to popular perception, it is not reinventing the wheel. Most of the coding is simple. If there is any complexity, it’s usually related to the implementation of some weird chart or graph, but you’ll probably have the same issues producing that chart on a commercial enterprise solution too. The big bonus of using a web application framework is that those skills are cheaper and much more widely available than for a person who knows a proprietary enterprise tool.

Now that I’ve gotten my opinionated take on reporting servers out of the way, there are some free/affordable solutions you can consider outside of rolling your own solution:

Depending on your requirements and technical expertise, your mileage will definitely vary.

 

Gotchas

Some of the solutions I’ve recommended, like Flyway, Liquibase, Pentaho and Scriptella require Java in order to run. Because of some recent changes to Java licensing, you may have to pay to use Java in a commercial setting. I’m no Java licensing expert, so you might want to investigate that before using those tools. Alternatively, you can use AdoptOpenJDK instead, but there may be a risk of compatibility issues.

 

Wrap Up

My opinionated take on doing BI/DW on the cheap may not be for everyone, but hopefully it does give you some ideas and insight into how you can get a reporting solution into your SME without the burden of high software costs. Generally speaking, the effort to build it out is about the same, so be aware of that.

 

 

 

 

 

 

 

 

 

 

Get To Know Extrata, Episode 1: The Command Line Interface

Steven NgDaily Debug BlogLeave a Comment

Overview

We’ve been a little remiss in keeping up our marketing efforts for Extrata, and we’re going to try to change that. This post is going to be the first of hopefully many articles to come.

Extrata is a pretty nifty product. If I’m being honest, it’s not an easy product to explain, because it can do so much. There’s no one singular, pithy sentence that can be used to describe it. Why, because it can be different things to different people. Either way, there are a lot of features that have been added in the past months to Extrata that haven’t even been posted to our web site. So this “series” of posts is an attempt to remedy that.

All right. Let’s go. I don’t have any particular order for these posts, so I pick whatever’s at the top of my head. This week, it’s the Extrata Command Line Interface.

The Superhero Origin Story

The Extrata Command Line Interface (CLI) is available to Enterprise users of Extrata. It’s a Windows executable that lets you run Extrata tasks headless on a server and/or integrate it with other tools that have shell execution capability.

The CLI does have its own origin story. In the early days of Extrata, Vince and I would have regular back and forths about what the app could be able to do. Our initial release was a simple proof-of-concept. Pluck some API data from Stripe and dump it into Excel. Ideas germinated. Ideas grew. As with any new product, if you give users an ounce of functionality, they’ll be asking for a pound in short order. Vince had ideas abound. Could we schedule it? (Yes, but.. that’s an explanation for another episode of Get To Know Extrata) Could you run it on a server? With the CLI, yes!

As we started to build out functionality, I added the ability to back up your Extrata content into an archive. Once I wrote this code, I took a further step of letting you export an Extrata project and its tasks into a package. And with this feature, I was finally able to create the command line interface.

An Extrata package file is basically a password-protected archive that is portable. My original intention was the ability to create a sample project and be able to share it with other Extrata users. Because the package file is self-contained, it also allowed me to compile a command line version of Extrata so the tasks could be run headless. Cool!

How To Use The Command Line Interface

From any Extrata project information page, click on the Export Project Package icon (it looks like a box with a down arrow).

A pop-up will appear.

In the Save To field, browse to a location on your computer and give the file a name with the .zip extension.

Click on the Include Secrets checkbox. This will put the passwords and/or API keys you need to run the tasks in your project into the package file.

In the Password and Password Confirmation boxes, enter a long secure password. You will need to provide this package file to the command line interface to run the task.

If you go back to the project information page, you should see a section called “Example Command Line Invocations”. In that section is a list of sample commands that you can use to execute the task from the command line.

You can use these as a template to fire off commands. For example:

extrata_cli.exe -z E:/Extrata/warehouse_sample.zip -p {package password} -t 6814bdab-89fb-413c-9ca3-f2c4a43015c3 &&:: Calendarific Canada

In the example above, you would replace {package password} with the password you used to save the package file. Also, you may have to provide the full path to the extrata_cli.exe file. Note that the &&:: Calendarific Canada text above is a comment to help you identify which task the command applies to. You can leave it out at run time.

If, for example, your password for the package was test_password, then the command would be:

extrata_cli.exe -z E:/Extrata/warehouse_sample.zip -p test_password -t 6814bdab-89fb-413c-9ca3-f2c4a43015c3

If you’re an advanced user and you’re wondering what the command line switches are, they are as follows:

-zThe path to the Project package zip file
-pThe password for the Project package zip file
-tThe ID for the task to be run. You can find this in the Task Info page or the Project info page’s Example Command Line Invocations list

Real World Use Cases

We’ve used the command line interface in different ways. We’ve used it with Windows Scheduler. We’ve also used them in batch files and in Pentaho Data Integration as part of a larger ETL solution. You’re basically limited to your imagination. We have implemented Extrata-based solutions on client servers using the command line interface.

No matter how you you choose to use it, the Command Line Interface does let you take Extrata to the next level. It transforms Extrata from an End User tool to an Enterprise tool, which is why it’s available only with the Enterprise Edition of Extrata.

 

Handy Subreddits for the Business Intelligence Professional

Steven NgDaily Debug BlogLeave a Comment

Reddit has a bit of a bad reputation for being a hotbed of trolls and troublemakers, but Reddit is a lot more than its reputation.

While there are indeed large numbers of horrible people on Reddit (as there are on Twitter and Facebook), many of the specialized communities are helpful, informative, and civil.

If you’re a business intelligence pro, I’ve compiled a list of useful subreddits where you can find information for your areas of interest. The list is by no means exhaustive, and it’s in no particular order.

Here’s the list:

 

 

 

Easy Database Version Control (a.k.a. migrations) with Flyway

Steven NgDaily Debug BlogLeave a Comment

Managing database schemas across multiple environments can be a chore if you don’t have a schema management tool (sometimes referred to as database migrations, but not to be confused with data migrations) in place.

I first became familiar with the notion of database migrations with Ruby on Rails. Rails Active Record uses its own domain-specific-language (DSL) so you don’t have to write SQL by hand. A typical piece of migration code looked like this:

class CreateProducts < ActiveRecord::Migration[5.0]
  def change
    create_table :products do |t|
      t.string :name
      t.text :description
      t.timestamps
    end
  end
end

The code above would create a products table with an autonumbered ID column, name, text and autopopulated timestamps for update and creation time. The equivalent SQL in Postgresql for the above would be:

create table products (
  id bigserial,
  name varchar(255),
  description text,
  created_at timestamp,
  updated_at timestamp
)

So why not just use SQL instead of a DSL? That’s a good question. In the particular case of Rails, ActiveRecord is database platform agnostic. It will create the table using the specific datatypes used by the particular database you’re using. Raw SQL doesn’t offer you this advantage, because there are always some datatypes that are specific to particular vendor.

Having said that, for any given project, you’re using one particular database platform anyways. Unless you’re making a cross-platform product, the need for a database agnostic migration tool probably isn’t a deal breaker.

Since Rails database migrations are really for Rails applications, it obviously doesn’t make sense to use Rails database migrations for non-Rails applications.

 

Why Use A Database Migration Tool?

Version Control

Database migration tools let you version your database easily. Ideally, your changes are coded into text files like SQL files, so that they work well with commonly used version control systems like Git. Every change you apply to the database is essentially a “version”.

With most migration tools, every time you apply a change to a database by running the migration, a special table in that database is updated so that version state is tracked. This way, the next time the migration tool is run, it will skip over any changes that were applied to the database already.

 

Promotion Across Environments

Because each database has its own version tracking, deployment across multiple environments (i.e., development, testing, and production) is a snap.

For example, let’s say your development database is on version 5, test database is on version 3 and production is on version 1.

Now that your test environment has been validated by QA, you want to promote production to version 3.

Assuming you tagged version 3 in your source control system, all you need to do is a pull of version 3 onto your production system and then run the migration.

When you run the migration for version 3 on production, the changes for version 2 and version 3 are applied to production to bring it from version 1 to 3. Once the promotion is complete, the local database version is updated to note what changes were made to the database.

 

A Note on Forward Only Migrations

Most migration tools have a mechanism to allow you to roll-back database changes to a previous version, but rolling-back can have negative side effects.

Stack Overflow’s Nick Craver has an excellent explanation for the rationale for forward-only database migrations in a blog post he wrote explaining how Stack Overflow handled deployments.

Generally speaking, it’s easier to fix a problem with a new migration than it is to roll-back and reapply a fixed migration.

 

Why Flyway?

While there are database migration libraries for every major coding platform, the first big standalone migrations tool I used was Liquibase. Liquibase is a freemium tool built on Java. It’s fairly straightforward if you’re technically minded, and it’s powerful. It also gives you the flexibility of defining your database changes in XML, YAML, JSON or SQL. And, it gives you the added benefit of reverse engineering a database. It’s an excellent tool, but there’s a little bit of a learning curve with it.

Flyway is also a Java based freemium tool. The thing I like about Flyway is that it’s easy to install, and easy to pick up. There’s less lingo to learn compared to Liquibase, and if you were writing SQL scripts for table creation, there’s not a lot you have to do to make it work with Flyway.

To get started with Flyway, you basically just make sure you have Java 6+ installed, install the Flyway command line app, make sure your table creation SQL files are named properly, and run the migration tool. It’s really that easy. For most common databases, you don’t even need to install drivers or know where they are located.

I have been building a demo data warehouse for Extrata and using SQL files to create and manage the schema. To get those SQL files to work with Flyway, all I had to do was rename and order my files using Flyway’s naming convention, and I was done in a matter of minutes.

I highly recommend that you give Flyway a shot, especially if you are not using a tool to handle the versioning of your database schemas. It’s really easy to integrate into your toolchain, and it’s a piece of cake to use.

Be sure to subscribe to our RSS feed, as I plan to do more in-depth writeups on how to use Flyway.