August 18, 2015

Transformational data analysis

Single step analytics
The vast majority of BI or Data Discovery tools follow the same approach for data visualization and analysis. I call it 1-step analytics or single pass analytics. In the foundation of this approach there is a single logical data set consisting of one table, or several linked tables. Sometimes it can span across several data sources, but from a user's perspective it's still a single data set or cube. In order to perform analysis a user interacts with different parts of this data set. Visual objects of a report or dashboard are usually built on top of a subset of this data model, or microcube. In another words, under any chart or table in a report or dashboard there is a microcube. In a standard case, every microcube is obtained by filtering or aggregating data from the main cube, usually on the fly. Some BI tools allow users freely change measures and dimensions of microcubes on the fly (query&analysis tools like BusinessObjects or Tableau). Sometimes microcube metadata is fixed during design time and users can only filter data in it (Qlik). But in any case these microcubes are derived from the main cube. So there is only one step (pass) -- derive a microcube from the main cube.

Single-pass analytics

To derive a micro-cube typically 4 operations are used:
  • Select columns -- take only certain columns from the main cube
  • Filter -- keep only rows that satisfy some rule or condition
  • Aggregate -- calculate aggregates like sum or count, grouped by some dimensions
  • Calculate new columns -- calculate new measures or dimensions using a formula
 When users perform analysis they actually force BI tools to recalculate microcubes on the fly, using different parameters. For instance, when you drill down from year to month you tell your BI tool to select different columns (add month), apply a different filter (fix year), and aggregate by a different dimension (month instead of year). The power of BI is in the fact that all these operations are performed automatically by a single click. Whatever you, as user, do -- your BI tool automatically does these 4 operations under the hood.

So far, so good. Is anything wrong with this?

Multi-step analytics

For many users and cases, single step analytics is totally sufficient. Remember the times when all users had was just static reports. Probably we could've called that zero step analytics :) So the current state of affairs is a great improvement compared to those times.

But for many cases (and users) it's not enough. In the real life there are many business rules and types of analysis that can not be covered with only single pass. They need two or more passes in which microcubes are derived from another microcubes and not only from the main cube. For instance different kinds of customer behavior analysis require analyzing customer performance over a period of time (one pass), then grouping them based on this performance (another pass), and finally obtaining performance of the groups over the same or different time period (third pass). Example: for a retail chain it's important to see growth of their customer base and not just growth of sales. For this, the organization might analyze how many new customers every month became permanent buyers. But in order to understand if a customer that joined in, say, March is permanent, it's necessary to analyze his/her purchases from April to, say, October. And only then we can come to conclusion that, for example, 3% of new customers joined in March became permanent. And then calculate the same for all months. And then let the user experiment. For instance, modify the rule how we consider a customer permanent -- change it from "1 purchase over 3 months" to "2 purchases over 6 months", and see what happens.

Multi-step analytics

Number of cases when users need ad hoc multi-pass analytics will only grow because as organizations learn more about their customers, they slowly move from basic metrics like number of orders or total revenue to more sophisticated performance indicators that monitor key drivers of business.

You might argue that the type of tasks I described above is not unheard of, and is typically solved by the means of ETL. That's true. The common wisdom across BI developers is try to pre-calculate everything that cannot be calculated on the fly. Move it into ETL. But this doesn't really work because ETL is static. Well, it's not static for a developer, but for a user it's static. The reason it's static for the user is because designing ETL requires technical skills and users don't have them (and shouldn't have). You know how users get around this problem? They do ad hoc ETL in Excel. Yes, it's horribly inconvenient, very error prone, and requires a lot of tedious manual work, but what else can they do? They simply don't have a choice. At least Excel is something they are comfortable with.

ETL today is the same thing as static reporting was before BI started to prosper. I believe it will change with transformational data analysis.

Transformational data analysis

We've used to the fact that ETL and BI are different beasts. It's like a dogma. But what if the borderline between BI and ETL is somewhat artificial and shouldn't be that distinct? I'm not talking about low-level ETL that for instance integrates legacy systems. I'm talking about these parts of business logic that are moved to ETL simply because current BI tools can't handle it.

So, I'm proposing a new term -- "transformational data analysis". It's a way to analyze data through ad hoc data manipulation and multi-step analytics built as series of transformations. It effectively eliminates the difference between BI and ETL (in its business logic part).


EasyMorph (http://easymorph.com), the tool I've been working on since 2013, follows the concept of transformational data analysis. It allows non-technical users to build multi-step analytics using more than 20 various transformations (not just 4), and then re-run it with different parameters and modify it on the fly. At this point EasyMorph doesn't look like a traditional BI tool in that sense that it doesn't have interactive charts or pivot tables (yet!), but we're getting there.

PS.

Can Qlik do multi-pass analytics?
QlikView (and Qlik Sense) is somewhat different from most BI tools (including Tableau) in that sense that it allows going slightly beyond the 1-step limit. The set analysis feature allows building rather complex filtering conditions that might include the element functions ( P() and E() ) which can be considered a 2-step logic. Also, with the help of aggr() function you can add an extra aggregation step. But there are plenty cases where they can't help. And even if they can help, designing expressions with set analysis and aggr() is a huge hassle (I did it for years) and hardly can be considered user friendly or suitable for ad hoc workflow. To be fair we can assume that Qlik allows 1.5-step analytics.

UPDATE 16/3/2016
Starting from version 2.8 EasyMorph offers dynamic drag-and-drop charts for interactive data visualization and more than 50 transformations.

UPDATE 9/8/2015
Interesting comment from Christophe Vogne:

Hi Dmitry,

Reading again your article I understand better what you mean.
From my perspective you are highlighting the fact that BI is moving from multidimensional database to graph database.
Multidimensional database need to be strongly structured and this design break the hidden links between information if they haven't been considered.
With kimball/Inmon BI method, there's always somebody that decide for you what you are supposed to discover: the datawarehouse team and their cascading cubes, datamarts...
QlikView is 1.5 because it's a tabular database (the fields values storage) on top of a graph database (the pointers layer that make the links between values). So some lost associations can be rebuilt by users.

So one aspect of the BI future is graph database because users need to link any information in any context at any time.
But to aggregate values on graph database is a tough challenge (the Set Analysis) and to manage multiple edges between vertex (the synthetic keys) is tougher.

Interesting article indeed