January 30, 2016

The long tail of the information explosion

You have probably heard a lot about Big Data and everything related to it. However, Big Data is only one side of the explosive growth of digital information which we have been observing. The other side is often overlooked, while it can have no less disruptive influence on traditional BI/DWH landscape than Big Data.

The information explosion (it's a lame term but I'll stick to it in this post for simplicity) is usually perceived as and associated with rapidly growing size of data sets (transactional and semi-structured) up to the point where organizing it and querying it using traditional technologies becomes very inefficient or too costly.

Although the other side of the story here is amount of data sets. Let me illustrate it:

(click to zoom)

Not only are data sets growing in volume -- they also are growing in number. New data sets are spawning with exponential rate. For every new system with large data volume there are tens of small data sets in spreadsheets, text files, web-pages and whatnot. That's why I call it The Long Tail -- these are myriads of small data sources, many of which are human-generated rather then machine-generated. What used to be a single number somewhere in email is becoming a list of numbers. What used to be a list is becoming a table. What used to be a table is becoming a data mart. These new data sets are relatively small, but there are lots of them and this represents a number of challenges:

First, building a single data warehouse is becoming less and less relevant because by the time when you have designed a data model and ETL processes to upload a new data set into a data warehouse there are two more new data sets and your data warehouse is obsolete and incomplete again. By the time when you upload these two there will be another four. Therefore, responsibility for data transformation should be more and more often given to business users (hint: EasyMorph can help with it).

Second, complexity of traditional ETL tools is becoming an obstacle because they were designed for relatively small number of transformation steps on large volumes of data, assuming that a lot of business logic will be handled by source systems. However, the growing number of new data sources requires designing exponentially more transformations in less amount of time with increasing share of embedded business logic.

Third, most BI tools by design require a single data model, usually a snowflake-type one. Users are not allowed to change the data model, they can only work with what was basically hardcoded by developers. While some tools allow users to merge new data sources this capability is usually very limited as it doesn't allow any data transformation prior to merging or after it, only basic manual data preparation. So no business logic can be applied on the fly. I wrote about this problem in more details in "Transformational Data Analysis".

Fourth, traditional BI/ETL tools are poorly suitable for dealing with Excel spreadsheets. Things like multi-line table headers, inconsistent sheet names, variable number of sheets often render these tools useless. For those who fought "spreadsheet hell" I have bad news -- the "spreadsheet era" is not over, despite all predictions. Instead, there will be more spreadsheets than ever, simply because there is nothing else that would allow non-technical users easily create and maintain small data sets. It's the most convenient and popular way so far. Microsoft may rejoice.


The same chart using linear scale instead of exponential one. This explains why Long Tail.
(click to zoom)