September 20, 2015

The Data Warehouse Myth

We all know the concept of data warehousing: take data from various sources, cleanse it, transform it, load it into a relational database, make it uniform and accessible by SQL-based BI-tools in order to create a single, holistic view of an organization's performance. Thousands of projects teams try to build this every year. Thousands of consulting companies offer this service. I've been doing this for more than 12 years.

Now I realize it's a myth.

It's never possible to bring everything needed into a data warehouse. If your business is not big, you probably can bring together 90% of your data. Some 10% will always remain somewhere outside. There are many reasons why important data stays outside: it could simply not exist when you started building your data warehouse, or business requirements changed while you were building it, or one of your source systems got replaced with another one, or a new business line opened, and so on. This makes data warehouses expensive never-ending projects that never reach their goal of becoming a single, holistic data source. If your organization is large, your data warehouse might even not cover 40% of your organization data analysis needs.

Having expensive long-term projects that never reach their goals isn't very reasonable, is it? But why is this happening?

First, the assumption that "if it works for some our data then it should work for all our data" turns out to be wrong. Indeed, once you've successfully built a small analytical database that covers some of company's business (which is totally doable) it's very tempting to start thinking "now let's buy a bigger server and put everything into this thing". But what works locally doesn't always work globally. A data warehouse is a tightly coupled system. The more you try to extend tight connections, the more effort and money required for it, and it grows exponentially. At some point it's not reasonable anymore to keep everything in a tightly coupled system. This happens earlier than many expect.

Second, the reliance of existing BI tools on SQL doesn't leave much choice to information system architects. Popular BI systems like BusinessObjects, Cognos or Tableau can do well only one type of querying -- generate an SQL SELECT query with aggregation and filters (where/having conditions) and send it to a relational database. I call this single-step analytics (see "Transformational data analysis"). That's the only type of querying users can do with these tools. There are no good, user-friendly querying tools that would step out of this concept. The entire BI industry was tailored to the myth of data warehousing. Now it has become a power that forces architects for follow the myth.

OK, how to escape it? Here is my hypothesis:

If something big can't work as a tightly coupled system -- make it loose. Instead of forcing everything into strict global order, create order locally. Build small databases for subject areas (e.g. general ledger or customer loyalty programs) if needed, create collections of uniform files (without loading them into a database), pre-aggreggate data when necessary, etc. Make sure data is kept in order locally. Yes, this is very much like those "silos" that the data warehousing prescribes to avoid. Except that they are ordered, explorable, and ready for querying.

Second part of the hypothesis is more difficult as it requires something that doesn't exist yet -- user-friendly, visual , not SQL-based tools that can query, combine in multiple steps, and visualize data from various types of "silos". At small scale this can be done with EasyMorph. But for large amounts of data such tools should work in a distributed environment.

Finally, embrace the concept of "data kitchen". It's not possible to cover all data analysis needs of a big user group with a single BI system. The idea of BI standardization which was popular 5-7 years ago has failed. Having a single tool will never be enough. Instead, users should be offered on demand a choice of tools for various kinds of tasks. Think of "spoon", "fork" and "knife" of data analysis.

PS. Some readers can mention Apache Hadoop. While it looks like a step in the right direction (a loosely coupled system) there still is a big issue with tools. For instance, the PIG scripting language is powerful, but the whole idea of writing scripts is totally not user friendly. And no, putting Tableau on top of Hadoop won't make the trick as Tableau (or any other SQL based tool) is a tool from the Data Warehouse Myth.