June 23, 2015

12 similarities between EasyMorph and QlikView loading script

Here are a few analogies that can help QlikView users become more familiar with data transformation concepts in EasyMorph:

In-memory data transformation

Both QlikView and EasyMorph don't require an external database to transform data. Instead, they rely on their own in-memory engines. In QlikView the data transformation part of the engine is single threaded. EasyMorph's engine is multi-threaded but less memory efficient. Both EasyMorph and QlikView use data compression.

Mixed value types in one column

Like QlikView, EasyMorph can mix text values and numbers in the same column, which makes it an excellent visual tool for parsing Excel spreadsheets. Although, type system in EasyMorph is a bit different.

Transformations in EM are similar to preceding loads in QV

In QlikView's preceding loads one load statement is based directly on the result of another load statement, which sometimes is convenient. Transformations in EasyMorph are somewhat similar to preceding loads in QlikView in that sense that a transformation directly uses the output of its previous transformation as input, without any intermediate (resident) tables. While in QlikView you can't do much in preceding loads -- only calculate new columns, and de-duplicate data (using DISTINCT), in EasyMorph you can use any of 25+ transformations (including filters, sorts, and even joins) in any order. Basically, all transformation process in EasyMorph is designed like super-powerful preceding loads.



Little things like the ability to replace existing column with an expression (instead of calculating new column, dropping the old one, and renaming) make life easier. And no annoying script errors because of a missed comma that cause aborting and reloading entire application.

Variables

Variables are ubiquitous in QlikView loading scripts and sometimes used up to the point where entire parts of script are replaced with a variable. The closest (although not exact!) analogue to variables in QlikView are project parameters in EasyMorph. Parameters are global. They don't change from transformation to transformation. Parameters can be used in transformations instead of file names and other properties, and also in expressions as constants.

Loops and iterations

Loading several files or processing a list of dates can be done using loops in QlikView, or iterations in EasyMorph. Although, the loops are imperative (as is the loading script syntax), while the iterations follow the functional programming style.

QVDs

No wonder QlikView can read/write QVD files. But so does EasyMorph. QlikSense QVDs are not supported yet (as of version 1.8) -- we're expecting Qlik to publish a respective API.

Calendars

Almost every QlikView / QlikSense application requires a calendar. If you're lucky to use Rob Wunderlich's QlikView Components then calendars might be easy for you. But for EasyMorph users it's even easier as it has built-in Calendar transformation that creates a calendar in a few clicks. See below an example of Calendar transformation, and its result.

Click to zoom

Mapping tables

Mapping table is a special temporary two-column table in QlikView. In EasyMorph any table can be a mapping table when it's used in Lookup transformation (equivalent to MAP USING... in QlikView). At this point (ver.1.8) in EasyMorph there is no analogy to ApplyMap function in QlikView.

Resident tables

In QlikView you can create new tables based on previously loaded, so called resident tables. In EasyMorph for the same purpose you can use derived tables. A derived table is like a dynamic copy (or view) of another table. It updates automatically if the original table changes. You can derive multiple tables from one table.

Derived tables in EasyMorph. Click to zoom.

Keep / Exists

Sometimes it's necessary to filter one table based on values in another table. In QlikView it can be done using either KEEP statement or WHERE clause with EXISTS() function. In EasyMorph the same can be achieved using Keep Matching or Keep Mismatching transformations.

Crosstable Load

Unpivot transformation in EasyMorph is the equivalent of CROSSTABLE LOAD in QlikView and can be used for transforming matrix tables into straight tables.

Command-line mode

Both QlikView and EasyMorph can be run in command line mode. Both applications allow defining variables (parameters) from command line. Here is EasyMorph's command-line syntax.

As you can see there are quite a few similarities. While in general QlikView loading script is more flexible (as you would expect from a programming language), designing transformations in EasyMorph is simpler, faster, and can be done by people without programming skills at all.

To learn more how EasyMorph works take a look at our illustrated PDF tutorials.

PS. One more similarity will appear in future releases -- the analogue of subroutines.

PPS. Totally forgot about resident/derived tables. Now it's 12 similarities :)

June 17, 2015

The world needs an open source columnar data file format

It's now common wisdom that storing data in columns is more convenient for analytical systems, than dealing with row-based data. Columnar databases like Vertica or VectorWise are getting widely adopted. Analytical in-memory engines are almost always columnar. QlikView stores data in QVD files in a columnar format, and I suspect that Tableau Data Extracts are columnar as well. EasyMorph is also running a columnar in-memory engine.

Usually, we see ETL processes that move data from row-based transactional systems into either row-based databases, or into columnar databases or analytical engines. In other words, the transfers are row-based to row-based, or row-based to columnar. But now, when columnar data stores become more and more widespread, the need to move data from one columnar system to another columnar system (i.e. columnar-columnar) will arise more and more often. And here is a problem -- if we do it in the usual way it's highly inefficient.

The reason it's inefficient is that columnar data is usually compressed. Therefore, passing a columnar dataset from one system to another system usually requires decompressing it entirely, than exporting it into some row-based format like CSV and then compressing again in the target system. Doesn't sound very efficient, does it?

To my knowledge (correct me if I'm wrong) at this point there is no any publicly available columnar data interchange format. For some reasons (which I don't really understand), all major vendors keep their columnar data formats proprietary, as if it was rocket science. I believe that if, for instance, Qlik made their QVD format open-source, then it could become a de-facto standard, simply because there is no any other common columnar data format. But as the vendors keep ignoring this opportunity, I believe there is a need in an open source format. Here is what it could be:

1. Columnar. Values are grouped into columns. A dataset can consist of one or more columns.

2. Extendable. Every column can be of one of predefined types, and these types encapsulate all logic required to read/write a column. Types can be:
  • Uncompressed
  • Vocabulary compression
  • RLE compression
  • Zipped
  • etc.
New versions of the format can include new column types, and preserve existing ones thus enabling backward compatibility.

3. Mixed/Strict typing. Some columns can have values of mixed types (e.g. text and numbers), some can be restricted to one type only.

4. Single-file/multi-file. Metadata and data can be in the same file (which is good for portability). Or metadata and columns (or parts of columns such as vocabularies) can be separate files (which can allow, for instance, sharing vocabularies between different columns and even tables).

5. Null-friendly. CSVs can't store nulls. This format should be able to support nulls natively.

I believe such format would make columnar-to-columnar data transfers more efficient. It won't eliminate decompressing/compressing entirely, but:
  • It would be possible to transfer only needed columns
  • In case where both source and target systems support the same column types (e.g. vocabulary compression) encoding/decoding can be either greatly optimized or avoided entirely
  • Metadata would be pre-calculated and available at low cost (e.g. number of rows, existence of nulls, etc.).
  • Columnar format makes parallel processing much simpler and more efficient

As the result -- data transfers can be 10x-100x times faster. One common format means we need not N x N data converters, but only N. Standardization would also foster a tool ecosystem around it -- viewers, profilers, converters, etc. Single documentation. Single API.

At this point it's just an idea. But if we at EasyMorph ever decide to make a persistent storage for our in-memory engine, I don't see any reasons why not make it open-source.

It's not rocket science after all.

UPDATE 17 Feb 2016
Two open-source projects aimed to standardize columnar data exchange - Apache Parquet (persistent store) and recently announced Apache Arrow (in-memory store).

June 14, 2015

EasyMorph as a visual functional programming language

If you're familiar with functional programming you might have noticed that EasyMorph is actually a functional programming language. Unlike general-purpose functional languages like Lisp or OCaml/F# it doesn't have writable code because it's visual. But it has many of the core features of functional languages:

Type system

EasyMorph's type system is rather primitive:
  • Dataset -- a set of columns that represents a table state before or after a transformation
  • Transformation properties -- a group of various transformation properties like file name, expression, boolean flag, etc.

    Immutable identifiers

    As in any other functional language, EasyMorph doesn't have variables. Instead it binds values to immutable identifiers. In our case identifiers can only be bound to datesets and project parameters which substitute some transformation properties.

    Functions

    Every transformation in EasyMorph is basically a function. It takes one or two datasets and/or a set of transformation properties as arguments, and returns one dataset. For instance:

    let A = Deduplicate B

    Since an EasyMorph project can have parameters, that makes entire project a function as well, where the parameters are the function arguments. In this case, the returned value is a dataset. While in programming languages a returned value is usually the last expression, in EasyMorph project the resulting table is simply marked with a flag, and its final dataset is returned to the calling project.

    Function pipelining

    Functions (transformations) can be pipelined (chained):

    let A = ImportText fileName
            |> Filter (fun x -> x.Year = 2015)
            |> Deduplicate
            |> Trim 10 topRows

    Iterate, map and fold

    Iterations in EasyMorph actually perform iterations, mapping and folding:

    When Iterate transformation is used in Iterate mode it's basically similar to Seq.iter or Seq.map in OCaml/F#

    When Iterate transformation is run in Iterate & Append mode it's somewhat similar to Seq.map |> Seq.fold as it creates multiple datasets, and then concatenates them into one dataset.

    Function values

    In functional languages functions can be passed as parameters. In EasyMorph one project can run another project and name of that other project can be defined by a parameter, which basically allows passing functions (projects) as parameters to other functions (projects).

    Resume

    Of course, this analogy is not very strict and it has many limitations. For instance there is no any equivalent of recursion which is a corner stone feature of functional languages. Nevertheless, it gives some interesting ideas on how a transformation process can be organized in EasyMorph. For instance, it should be convenient to compose transformation processes from smaller reusable projects (should we rather call them modules?). Also such functional approach makes developing and debugging easier, as it's possible to split the logic into small relatively isolated functions that don't have side effects.

    If you find this analogy interesting and would like to discuss it -- feel free to shoot me an email.

    UPDATE (8/30/2015)
    One more similarity appeared recently:

    Assert 

    Halt on condition transformation is basically the same as assert statement in some languages. It stops project execution if some condition is unfulfilled and throws a customized error message.