June 13, 2016

EasyMorph 3.0: A combine for your data kitchen

EasyMorph v3.0 is about to be released, its beta version is already available for downloading. As the tool matures its product concept solidifies. Version 3.0 is a major milestone in this regard, because the long process of product re-positioning started last year is now complete, and a long-term vision has been formed. In this post I would like to explain a bit more what EasyMorph has morphed into (pun intended).

To put it simply, EasyMorph has become a "combine for data kitchen" (if you've never heard about the data kitchen concept check out this post). The analogy with kitchen combine is not a coincidence -- just as real-life kitchen combines, EasyMorph has several distinct functions that all utilize the same engine. In our case it's a reactive transformation-based calculation engine with four major functions built on top of it:
  • Data transformation
  • Data profiling and analysis
  • File conversion and mass operations with files
  • Reporting

Data transformation
This is probably the most obvious function as people usually know EasyMorph as an ETL tool. In this role everything is more or less typical -- tabular data from databases and files is transformed using a set of configurable transformations. What's less typical is support for numbers and text in one column, non-relational transformations (e.g. creating column names from first N rows, or filling down empty cells), and the concept of iterations inspired by functional programming.

Data profiling and analysis
This function it usually less obvious because typically data profiling tools don't do transformations, but rather show some hard-coded statistics on data -- counts, uniqueness, distribution histograms, etc. QViewer is a typical example of such tool.

Data profiling with EasyMorph is different, because instead of using a fixed set of pre-defined hard-coded metrics you can calculate such metrics on the fly, and visualize them using drag-and-drop charts. While this approach sacrifices some simplicity (you might need, say, 3 clicks instead of 1 to calculate a metric) it enables much broader analysis and more precisely selected subsets of data thus providing way more flexibility than typical data profiling tools.

I can say that in my work I use EasyMorph for data analysis and profiling much more often that for ETL simply because new data transformations need to be designed once in a while (then they're just scheduled), but I do data analysis every day.

File conversion and mass operations with files
While file conversion is a rather obvious function (read a file in one format, write in another), the ability to conveniently perform mass manipulations with files (copying, renaming, archiving, uploading, etc.) is a surprising and underestimated function of EasyMorph. Really, who would expect that what supposedly is an ETL tool can be used for things like that? But since EasyMorph is a "data kitchen combine" rather than a typical ETL tool, this is exactly what it can be used for.

Since creating a list of files in a folder is as simple as dragging the folder into EasyMorph (recursive subfolder scanning is supported from version 3.0) you can get a list of, say, 40'000 files in 1000 folders in literally 5 seconds, sorted by size, creation time, folder and whatnot. Finding the biggest file? Just two clicks. Filter only spreadsheets? One more click. Exclude read-only? Another click.

Now add the capability of running an external application (or a Windows shell command) for each file (using iterations) with command line composed using a formula, and you get a perfect replacement for batch scripts to do mass renaming, copying, archiving, sending e-mails or anything else that can be done from the command line. And, just like batch scripts, EasyMorph projects themselves can be executed from the command line, so they can be triggered by a 3rd party application (e.g. scheduler).

EasyMorph has a number of workflow transformations, that actually don't transform anything but perform various actions like launching an external program, or even taking a pause for N seconds. Therefore, it's basically a visual workflow design tool with the capability of designing parallelized (e.g. parallel mass file conversion) and asynchronous processes.

Reporting
PDF reporting is the headline feature of version 3.0 and a new function of EasyMorph. The idea behind it was simple: sometimes a result of data analysis has to be shared, and PDF is the most universally used format for sharing documents. At this point, PDF reporting in EasyMorph is not meant to be pixel-perfect and its customization capabilities are rather limited. Instead, the accent was made on quickness of report creation in order to make it less time-consuming. We're testing the waters with this release, and the direction of future development will depend on feedback received from users.



Resume
While it might not count as a distinct function sometimes it's convenient to keep EasyMorph open just for ad hoc calculations, e.g. paste a list and find duplicates in it, or even dynamically generate parts of some script -- e.g. comma-separated lists of fields or values. As a "data kitchen combine" EasyMorph is not an application for a broad audience, but rather a professional tool for data analysts who work with data every day. And like with real-life kitchen combines some people use one function more often, some another. Pick yours.