Facebook Open Sources Detectron, An Object Detection Library

Steven NgAnalytics, Business Intelligence, CodingLeave a Comment

So this one is interesting, especially for those who want to get data out of images. Facebook’s research team has open sourced Detectron under the Apache license, and it looks like it does some very cool things. According to their blog post, they’ve also got 70+ pre-trained models you can play with from their model zoo.

Don’t get me wrong, while I still think of Facebook as being incredibly creepy and sometimes socially irresponsible, there’s no doubt that they also do some a lot of good and that they give back to the community at large in a meaningful way.

Are good people really that hard to find?

Steven NgBusiness Intelligence, Daily Debug BlogLeave a Comment

Spoiler Alert – They’re Not That Hard to Find

A common complaint I’ve heard in the business intelligence industry is how hard it is to find good people. I personally don’t think it’s that true, but from a consulting company’s perspective, the problem is twofold. First, your customers want to know how many years of experience your people have with a particular tool, and having those experience years with a piece of software that isn’t easily available to the public means it can be a seller’s market.

These days, when someone in the industry tells me how hard it is to find good people, it usually means one of two things: “I’m too cheap to spend the money on someone with the resume I need” or “I’m too focused on finding the people with the name brand skills (i.e., Cognos, Microstrategy) than people who have the equivalent skills”.

It’s Not Rocket Science, People

I don’t want to sound like I’m dismissing the number of years of experience a person has with a particular tool, but honestly, business intelligence is not rocket science. It’s mostly about moving data, querying data, and presenting data. In most cases, the data I speak of lives in relational databases. And here’s the thing, while there are a ton of flavors of relational databases, they’re fundamentally the same. They all pretty much speak the same language (SQL), and do the same core things (create, read, update, delete — aka CRUD). Sure, there are nuances and details a good business intelligence consultant needs to know, but if you are pretty good at SQL and know how a relational database works, you’ve got pretty strong fundamentals to start with.

A lot (but not all) of the business intelligence people I’ve come across didn’t start off as business intelligence people. In most cases (like Vince), fell into it when an employer bought the software and needed an internal resource to learn it. In rarer cases (like me), an employer hires a person with the right background to fast-track them into a position. When Vince first hired me over a decade ago, he saw that while I had a little direct business intelligence experience, but I also had a lot of enterprise software and web development experience to fill in the gaps.

When it comes to “home-grown” business intelligence people, I find there are two types. People who go through the motions, and people who really get into it (I’ll let you guess where Vince and I fall into that grouping). Too many people, in my opinion, just go through the motions. They know that access to the software is generally scarce, and don’t really bother expanding or deepening their skill sets too much. Vince is one of the most knowledgeable Cognos/Data Warehouse guys I know. That’s because he loves the problems he’s solving and he’s a pit bull when it comes to attacking those problems. He also loves learning new methods and techniques. Unlike Vince, my love for business intelligence is a little different. I love systems and visualizations. I was a huge fan of Edward Tufte’s information design principles before I even did anything called “business intelligence”. I love coding and doing root cause analysis. Because we come from different backgrounds, when Vince and I work together on business intelligence problems, we end up being greater than the sum of our parts.

Before we started Assign It To Us, I worked at a Cognos shop that Vince co-owned. To maintain our “partner” status with our software vendor, we had to have a certain number of “certified” consultants. Certification meant that you memorized parts of the training guide and were able to pick them out in a multiple choice exam. Neither Vince or myself bothered writing those exams (and speaking for myself only, I’m pretty sure I would have trouble passing them, as I’m not a great test-taker), but we were always the go-to people to solve the problems our certified people could not figure out themselves. How could this be? Vince generally knows how to use the tools better than me, and because of my background in enterprise software development, I generally have a better understanding of how the tools work under the hood. Because of that difference, we attack problems from two different directions, and more often than not, we figured out problems faster than we could get Cognos resolve an open ticket for us.

While we still do Cognos today, we’ve been able to take our past experience and offer our customers the option of considering free or low-cost alternatives to business intelligence. Software costs used to be a barrier to deployment and skill acquisition, but they don’t have to be today. Pick the right approach, and you can have business intelligence solution at a fraction of the cost a decade ago. After all, it’s not rocket science, people.

So You Think You Can Be A Business Intelligence Expert

In my opinion, there are plenty of people out there who have what it takes to be great business intelligence people, they just don’t appear to be on paper. It’s really hard to be a Cognos expert if you don’t work at a Cognos shop or have tens of thousands of dollars to buy a license.

On the other hand, there are plenty of open source relational (and non-relational) databases, ETL tools and libraries, and reporting tools. It’s easy to get your feet wet with those, provided you have a core set of skills. I tend to think that a good full-stack web developer (someone familiar with LAMP – Linux, Apache, MySQL, PHP or similar) usually has a good set of fundamentals to make a fantastic business intelligence consultant.

To me, the most critical skill in business intelligence is a solid knowledge of SQL. Yes, I get that most major enterprise business intelligence software is designed so you don’t need to know SQL, but I have found that skill to be invaluable during my time working with Cognos. To me, SQL represents the one true way to know if your data source is right. Debugging the data in an abstraction layer like Cognos Framework Manager first is wasted time since you need to know if it’s a garbage-in-garbage-out situation before investigating any of the software between the database and the end user. I can’t tell you how many times my knowledge of SQL has saved significant time during the root cause analysis process.

Because I had full stack development experience, I was pretty familiar with the ins and outs of installing software like web servers and databases, and the side benefit to that was the knowledge on how to troubleshoot networking issues as well as communication issues between web servers, application servers and database servers. Knowing HTML and Javascript taught me how to identify issues on the client side. When I worked with Cognos people who didn’t have that background, the most common problem description I had to deal with was “It’s not working”. My background in all the aforementioned areas helped me identify what was not working so that I could have a clear definition of the problem and where it was happening. If it was under my control, then I’d fix it myself. If it was a problem with Cognos’ code, I’d have to either figure out a workaround myself or open a ticket with Cognos and hope that a fix would appear in a future patch. But the key thing my development background provided was that I never felt helpless whenever a problem occurred.

Of course, not everything was smooth sailing at the beginning. The technical stuff was always fairly straightforward, but the hard part was learning how to map business intelligence terminology to the terminology I already knew. Every domain has its own language, so I had to quickly learn about facts and dimensions and map them to the types of data structures I was already familiar with in my life before business intelligence. Once that was done, things were pretty easy. The main frustration after that was the feeling that things weren’t in your control. When you’re a full stack developer, you generally have very low-level access to all sorts of things. Something broken? I’ll just fix my code. This was no longer the case. A lot of enterprise business intelligence tools are designed to be turn-key systems. Instead of developing your own reporting system with access control, etc., they give you a portal out of the box. Instead of designing your own reports in raw HTML, they give you a WYSIWYG report designer that has drag and drop. In other words, you’re given a black box GUI to do a lot of work. That sense of control you get when you have when you’re a coder is gone at this point. Even without that control, however, you’re still getting the satisfaction of solving customer problems and building stuff. To me, it’s just as rewarding, only different.

How To Find People

I’ve always had the (somewhat controversial) opinion that a business intelligence expert isn’t about how good they are at [insert enterprise brand name tool here], but at knowing what to do when the software misbehaves. It doesn’t take a genius to do a web search and find hundreds of articles on why “enterprise software sucks” to know that enterprise software will not always behave the way it’s supposed to. Knowing all the pitfalls of a brand of software comes with years of experience. The absence of this knowledge, however, doesn’t prevent a relatively skilled person from doing the day-to-day development work though.

Because Vince and I are a two-man shop right now, we haven’t had to hire any new people, but we do generally agree on the approach.

It’s more important to hire a person with the right attitude and core technical skills than it is to have them with certain brands of software on their resume. I will grant that sometimes it’s unavoidable on a project-by-project basis, but if you’re building a team for the long term, I think the better approach is to focus on general competency than niche competency. I am personally biased, but I tend to think that full-stack web developers are a great resource to tap into when it comes to hiring newbies for business intelligence. They have the raw skills, and you can fill in the knowledge gaps quite easily. Because full-stack web developers are plentiful, you can find more affordable team members, while bearing in mind that you do get what you pay for.

Of course, when you’re specifically hiring for senior level people, that strategy may not work. The difficulty in hiring senior level resources is where the complaint of not being able to find good people comes from. But I said earlier, that you get what you pay for. The best experienced people are usually already working somewhere. Getting them to work for you isn’t as simple as saying “hey join us, we’re cool!” You need to give them a reason to leave an otherwise highly paid, comfortable position. You have to have something to offer that the person’s current employer does not. And if you’re being truly honest with yourself, you’ll probably realize that all other things being equal, the only thing you have to offer is more money. If you’re not willing to pony up the cash it takes to get a top prospect, your problem may not be much that “good people are hard to find” than it is “my company doesn’t have what it takes to attract good people”.

 

 

Track Automated ETL Jobs in Pentaho

Vince IarusciData Integration, PentahoLeave a Comment

Overview

You’ve set-up your Pentaho jobs and schedule them from the task scheduler or cron scheduler. The next day and each day after that, you get a flood of success and failure emails from your jobs that run overnight or every hour. We found that our developers spent just as much time wrangling these emails than troubleshooting the run issues.

So we developed an easier way to track automated ETL jobs in Pentaho.  This article explains how the pieces come together.

What this ETL tracker is:  It’s a high-level tool that monitors all the jobs run against your task scheduling tool and notifies the administrator when jobs don’t run on their expected schedule.

What this ETL tracker is not: This tool is not a complete logging and error trapping solution.  If you need overall traceability of failure points, step errors or system bottlenecks, you would store the detailed logging for all jobs into one central log file/database table.  You could then drill down from the ETL schedule tracking reports to the detailed logging and monitoring info.  A more detailed look at PDI logging and monitoring will be discussed in a future post.

You can download all the components needed to automate your Pentaho jobs by visiting the Assign It To Us GitHub Pentaho Job Automation repo.  A download link is also included at the end of this blog post.

How it Works

There are 4 components used to track the jobs:

1. Schedule the Pentaho job in the Microsoft Task Scheduler or cron job if you’re using a Unix based OS.

The scheduled job will call a batch script that runs a Pentaho job.

 

The script that runs the Pentaho Job.  Basic logging is written to the Master_Job.log file

cd "/Program Files (x86)/Pentaho/data-integration"
kitchen.bat /rep:dw_repo_f /file:C:\Cognos\ETL\Pentaho_Repo\Backup_File_Mgmt\Master_Job.kjb /level:Basic > C:\Cognos\ETL\logs\Master_Job.log
2. Add the ETL Job Log step to the end of each job flow. 

This step will add a timestamp for each run to the target ETL_Job_Log table.

ETL_Job_Log steps:

  1. Get System data – sets the system date (current date/time) for the StartTime field.
  2. Add Constants. Manual values set for the JobID and JobName.
  3. Set Values
  4. Table Output – JobCheck_Log – Creates and populates the ETL_Job_Log table.  Each job run timestamp is appended to the table.  We can see that there are jobs that run daily, monthly (Marketing_List_Data_Load) and hourly (Master_Conflicts_Check_Load).

 

3. Run the Schedule_Job_Check to record the daily schedule info for each job.

This job is scheduled to run every 15 minutes and creates one record for each job showing the last time it was run.  The output targets an Excel file or to a data warehouse table that we can report on.  In our case, we do both.

Schedule_Job_Check steps:

  1. Add Constant Rows – Schedule – Defines the constants for each job.  The JobID and JobName will match each respective record from the ETL_Job_Log table. The frequency runtime and day are set to match the settings in the task scheduler. The RuntimeMinutes is set to match the runtime in minutes and the tolerance is set in minutes to match the time between scheduled runs.  These fields are needed for calcs further down the transformation flow.
  2. Table Input – ETL Job Log – This is input from previous timestamp component.  SQL is run on the table and returns the latest record max(“StartTime”) for each JobID.
  3. Merge Join – Steps 1 & 2 are joined on JobID
  4. Select values
  5. Get System Info – Here we set the current run DateTime, today, first day of week and first day of month. For the last 3 fields, we use the calc that sets the time to 00:00:00.  Later downstream, we’ll add the day and minutes to define the scheduled run date to match what was set in our task scheduler.
  6. Formula – Sets the period start for each job record.
  7. Calculator – Add Runtime Minutes –  There are 2 calcs defined.  The first calc adds the period start defined in the previous step and the RuntimeMinutes set in the first step. This does the final calc that defines the expected job scheduled run datetime.  The second calc shows the time difference in minutes between the current date and the last run time.  We’ll use this as one of the checks to see if the schedule if running late.
  8. Calculator – Rundiff – This shows the difference in minutes between the LastRunTime and the ScheduleRunDateTime that was calculated in the previous step. The RunDiff is our second check to see if the scheduled run has failed.
  9. Select Values – Rename and set values for fields
  10. Formula – Email Config – This step defines all the settings for our end notifications.  It also has a calculation that looks at our 2 check fields and sets the job with a status of pass or fail.  We use an OR for the logical expression.  Here is the formula – IF(OR([RunDiff]<0;[TimeDiff]>[Tolerance]);“FAIL”;“PASS”)
  11. Formula – Email Message – Creates the custom text message for the notification.
  12. Select Values – Update metadata.  The flow splits into 2 outputs and 1 notification flow.  Data is written to the ETL_Schedule_Log table in our data warehouse and an excel file for reporting purposes. A screenshot of the table is shown at the end this section.
  13. The notifications steps first look whether the job has a status of Pass or Fail.  If it Passed, nothing is done for that record.  If it Failed, it moves to the next step “Case Email?”, which checks the email flag.  If the flag is set to “N” then nothing is done otherwise it moves to the last step “Mail” and sends out a notification with all the parameter settings from the “Email Config” step.  For our purpose, we had mission-critical jobs that ran every 1/2 hour and we needed to have a way to track those individual jobs whenever they failed.

4. Email/View the ETL Job Run Status report.

A Cognos report was created with exception highlighting on the status field so that any jobs that failed would stick out.  In our case, we scheduled this job to run at 8:00 am each morning after all the morning scheduled tasks have run.  You could also create the same report using the excel file or the OXI MS XML injector from our open source toolbox.

Download the code: Assign It To Us GitHub Pentaho Job Automation

Generating a Date Dimension Table with Pentaho Data Integration

Vince IarusciBusiness Intelligence, Data Integration, Pentaho1 Comment

Overview

A good date dimension is essential for any data warehouse or reporting database. If you don’t have the right tools, you’ll often find yourself generating this date dimension in Excel, which can be tedious.

This reusable Pentaho Data Integration transformation lets you create a date dimension with a user-defined start date and the number of days.

Feel free to tweak to your heart’s content. The download link is at the bottom of the blog entry.

I’m aware that Pentaho does have a built-in date dimension function, but for this exercise, I wanted to show you the formulas in the event that you wanted to port it to another ETL tool.

Caveats

The first date of this date dimension starts in January 2005. You can change this date as indicated in the transformation instructions.
This transformation is for educational purposes only. It’s up to you to make sure it’s debugged if you’re thinking of putting it into a production system.
This transformation is based on a standard calendar year. It doesn’t factor anything like lunar dates, 4-5-4 calendars or fiscal years that don’t start on January 1.

How it Works

A quick view of the transformation is here:

There are 5 basic steps used in this transformation.

  1. Using a “Generate Rows” step, set the first day of the dimension to iterate an arbitrary number of times. In the sample, 5475 lines are created, which means there are 5475 days in the dimension or 15 years.

2. Next, use an “Add Sequence” step to create a counter for each row.

3. Use a “Formula” step to create each date in the dimension, using the counter from the previous step.

4. Create the rest of the date dimension elements using another “Formula” step.

5. Finally, output the dimension table to an Excel file. You can change this to any other output type that you would like.

Date Dimension

In the date dimension, “In Dimension Counters” are created. These are basically running counts of the Year, Quarter, Month and Day in the dimension. I also like to include Month and Weekday names and abbreviations.

These added elements are useful because some database/reporting engines do not have fully fleshed date functions, so these added elements allow you to use plain old SQL to do conversions and calculations.

Elements
  • DimensionDate – The date, as a date type.
  • DateSid – The date, as an integer representation of YYYYMMDD.
  • Year – The year, as an integer.
  • Quarter – The quarter, as an integer.
  • Month – The month in the year, as an integer.
  • Day – The day in the month, as an integer.
  • DayInYear – The day in the year, as an integer between 1 and 366.
  • Weekday – The day of the week, starting on Sunday indexed at 1.
  • MonthName – The name of the month, in English.
  • MonthAbbreviation – Three letter abbreviation of the month, in English.
  • DayName – The weekday name, in English.
  • DayAbbreviation – Three letter abbreviation of the weekday, in English.
  • YearInDimension – The year counter for the entire dimension, starting at 1.
  • QuartersInDimension – The quarter counter for the entire dimension, starting at 1.
  • MonthInDimension – The month counter for the entire dimension, starting at 1.
  • DayInDimension – The day counter for the entire dimension, starting at 1.

Notable Formulas and Gotchas

Calculating Month/Weekday Names and Abbreviations:
Since the Pentaho Formula step doesn’t have month or weekday name date functions, a variant of the tip was used to generate these.

DateDifs and In Dimension Counters:
Pentaho’s DateDif function in the “Formula” step appeared to be a little buggy, so many of the “In Dimension” element calculations are a little more complicated than they needed to be.

Because the first year start date is not January 1 (which appears to be the cause of the bug), I had to create an exception for the first year counters. For example, the month in dimension calculation is as follows.

if
(
  // calculations for the first year
  year([FirstDate])=year([DimensionDate]);
  
  datedif([FirstDate];[DimensionDate];"m")+1;
  datedif([FirstDate];date(year([FirstDate]);12;31);"m")+1) 
  +
  // calculated from second year forward
  if(year([FirstDate])=year([DimensionDate]);0;
  datedif(date(year([FirstDate])+1;1;1) ;
  [DimensionDate];
  
  "m")+1
)

If the DateDif worked properly (I was getting something like 21 months between 2016/10 and 2017/1 – what???) then I’d be able to drop the first part of the formula above.

In any case, the formula now only suffers from a little redundancy, but it can accommodate any date dimension that doesn’t start on New Year’s day.

Transformation File

DateDimensionGenerator.ktr

 

Luna – WYSIWYG Data Processing

Steven NgAnalytics, CodingLeave a Comment

The Luna 1.0 Beta was just released and is available for download, and it sounds like a very interesting project.

The Luna team describes their product as WYSIWYG Data Processing, but you can look at it as a visual data flow language. I haven’t really had any time to really dig into this, but I think data nerds might really like this. It’s interoperable with other languages, including Javascript.

It’s still an early work, so manage your expectations accordingly. From what I’ve seen and heard, it doesn’t look like there’s any PostgreSQL support yet.