November 9, 2014

Meet EasyMorph

A few days ago EasyMorph, my new data transformation tool, went out from stealth mode and now Technical Preview is publicly available for downloading. It's time to tell a bit more about what this is.

EasyMorph is the result of an attempt to rethink data transformation from scratch which gained momentum after my post in May 2013. There is a stereotypical understanding of what an ETL should be and it's inherited from traditional heavy-weight ultra-complex monsters like Informatica or Ab Initio. The thing is that this stereotypical approach poorly works for people without technical background -- those who we used to call business users. They don't know anything about join types and honestly they don't have to -- their heads are busy with business problems. At the same time the need to manipulate data becomes even more urgent as data discovery tools have become more popular and analytic systems become increasingly more often owned by business departments rather than IT staff.

Excel, which is typically used for data transformations by business folks, has its own drawbacks that you're perfectly aware of:

1) Excel is poorly suited for dealing with tables. Let me put it straight -- there are no tables in Excel. Tables are actually emulated with groups of cells. From Excel's standpoint there is no difference between column header, a column value or any random cell. Even worse -- Excel has no idea where a column starts and where it ends, which cells belong to a table and which don't. No wonder such emulation is prone to errors caused for instance by inaccurate stretching a formula over a range of cells.

2) Debugging and maintaining table operations in Excel is a torture. Since table as an abstraction is missing in Excel it's hard to address table columns simply by names. You won't see something like Price = Amount * Qty in Excel when it comes to table operations. When you deal with an Excel formula you have to constantly decode what every reference actually means (e.g. column AT is Qty and column BC is Price). While formulas is a rather questionable way of defining a calculation (but that's another story) it would only be a half of the problem. Things get really bad when formulas are applied from within a Visual Basic macro (which is far not a rare situatuion) -- in this case you might not even be able to see the actual column, since it could be created or modified somewhere else earlier in the macro. So you have to decode the formula while keeping in mind program logic of the macro! Something tells me that this is not something that people enjoy doing.

3) Excel is barely capable of multi-step transformations. Again, since the concept of table is missing in Excel, there are no table operations except maybe sorting and deduplication. At the same time sequentially applied transformations is the most natural way of transforming tabular data. This is what business users are trying to emulate in Excel. Just listen how they talk about it -- "Okay, Bob, what do we do with this data? Bob: First, get rid of all transactions with missing product names -- they're synthetic and we don't need them. Then for all the transactions with missing prices use the prices from the last year file. Once you get all prices in place -- calculate amounts. Don't forget to convert non-USD amounts to USD." and so on.

Inconvenience of traditional ETL tools and the drawbacks of Excel led to several cornerstone concepts behind EasyMorph:

1) Visual design. No SQL queries, no scripting or programming. All transformations are set up by only pointing-n-clicking and drag-n-dropping.

2) Everything should be easily debuggable. Result of any transformation should be just one click away. Result of any expression should be easily tracked down to actual values (not done yet).

3) Readability and  and maintainability. Business users should clearly see how this or that value is obtained. It shouldn't require decoding. It should be explained visually and automatically documented in plain English (not done yet).

4) Fit technology to business users and not vice verse. If users don't understand SQL -- get rid of SQL. If users don't understand various types of joins -- don't stick to traditional joins, find a simple way how to represent the same operation.

The screenshot below shows sample description of transformations in EasyMorph.
Another interesting observation about multi-step transformations I made when I wrote loading scripts in QlikView. As a QlikView developer I wrote possibly tens of thousands lines of QlikView script. The more I wrote the more I noticed that all my scripts tend to have the same structure (a good topic for another blog post, btw). In particular, I noticed that it's natural to split loading script into rather isolated logical blocks (usually one block is placed in one tab) where each block generates a table with meaningful result. Temporary tables and fields used for producing this table are created and dropped withing the block. These resulting tables are not necessarily final. What's important about it that they encapsulate some logic, making it more or less isolated from other parts of script. Putting it differently, for other parts of the script it's not important how exactly I produced this resulting table, they just consume the result.


This observation led to the idea of tables as transformation groups in EasyMorph which would represent such logical block. Effectively, every table in EasyMorph is a result of a group of transformations. In the picture above you can see that table "Census data" is result of a group consisting of 8 transformations. Clicking on any transformation would display intermediate output produced by this transformation (output of the 7th transformation is shown above).

Another powerful abstraction in EasyMorph is derived table. Derived table allows re-using already calculated tables. Producing a derived table is the same as loading everything from a resident table in QlikView. Apparently you can produce as many derived tables as necessary. A derived table is just a regular table so it can be transformed as well.

In the screenshot above you can see two derived tables produced from "Census data". Each of derived tables is further transformed in its own way.

With the help of derived tables you can program rather complex logic. For instance you can split rows from the originating table into three groups based on some criteria, transform every group in its own way and then append all three groups back into one table. Another example - you can perform self-joins using derived tables.

By the way, if you're familiar with functional programming, it could strike you that transformations in EasyMorph resemble function pipelining, and derived tables is the same as variable binding. To make it even more interesting, I can tell you that it is possible to introduce closures and iterations in EasyMorph which would effectively make it a functional machine.

A few technical facts about first version of EasyMorph:
  • It's a 64-bit Windows desktop application with its own in-memory data processing engine.
  • It doesn't require external database.
  • Its first version comfortably deals with datasets up to 3-5mln rows.
  • No data compression yet.
  • 15 transformations that can run in parallel utilizing several CPU cores.
  • Data sources include delimited text and QVD files. Support for Excel files is expected in 2-3 weeks.
  • Expression syntax with 40+ functions.
  • It can run from command line
Technical Preview is available at http://easymorph.com.

UPDATE
EasyMorph version 1.0 was released in February 2015 and is now actively developing. See http://easymorph.com.