June 17, 2015

The world needs an open source columnar data file format

It's now common wisdom that storing data in columns is more convenient for analytical systems, than dealing with row-based data. Columnar databases like Vertica or VectorWise are getting widely adopted. Analytical in-memory engines are almost always columnar. QlikView stores data in QVD files in a columnar format, and I suspect that Tableau Data Extracts are columnar as well. EasyMorph is also running a columnar in-memory engine.

Usually, we see ETL processes that move data from row-based transactional systems into either row-based databases, or into columnar databases or analytical engines. In other words, the transfers are row-based to row-based, or row-based to columnar. But now, when columnar data stores become more and more widespread, the need to move data from one columnar system to another columnar system (i.e. columnar-columnar) will arise more and more often. And here is a problem -- if we do it in the usual way it's highly inefficient.

The reason it's inefficient is that columnar data is usually compressed. Therefore, passing a columnar dataset from one system to another system usually requires decompressing it entirely, than exporting it into some row-based format like CSV and then compressing again in the target system. Doesn't sound very efficient, does it?

To my knowledge (correct me if I'm wrong) at this point there is no any publicly available columnar data interchange format. For some reasons (which I don't really understand), all major vendors keep their columnar data formats proprietary, as if it was rocket science. I believe that if, for instance, Qlik made their QVD format open-source, then it could become a de-facto standard, simply because there is no any other common columnar data format. But as the vendors keep ignoring this opportunity, I believe there is a need in an open source format. Here is what it could be:

1. Columnar. Values are grouped into columns. A dataset can consist of one or more columns.

2. Extendable. Every column can be of one of predefined types, and these types encapsulate all logic required to read/write a column. Types can be:
  • Uncompressed
  • Vocabulary compression
  • RLE compression
  • Zipped
  • etc.
New versions of the format can include new column types, and preserve existing ones thus enabling backward compatibility.

3. Mixed/Strict typing. Some columns can have values of mixed types (e.g. text and numbers), some can be restricted to one type only.

4. Single-file/multi-file. Metadata and data can be in the same file (which is good for portability). Or metadata and columns (or parts of columns such as vocabularies) can be separate files (which can allow, for instance, sharing vocabularies between different columns and even tables).

5. Null-friendly. CSVs can't store nulls. This format should be able to support nulls natively.

I believe such format would make columnar-to-columnar data transfers more efficient. It won't eliminate decompressing/compressing entirely, but:
  • It would be possible to transfer only needed columns
  • In case where both source and target systems support the same column types (e.g. vocabulary compression) encoding/decoding can be either greatly optimized or avoided entirely
  • Metadata would be pre-calculated and available at low cost (e.g. number of rows, existence of nulls, etc.).
  • Columnar format makes parallel processing much simpler and more efficient

As the result -- data transfers can be 10x-100x times faster. One common format means we need not N x N data converters, but only N. Standardization would also foster a tool ecosystem around it -- viewers, profilers, converters, etc. Single documentation. Single API.

At this point it's just an idea. But if we at EasyMorph ever decide to make a persistent storage for our in-memory engine, I don't see any reasons why not make it open-source.

It's not rocket science after all.

UPDATE 17 Feb 2016
Two open-source projects aimed to standardize columnar data exchange - Apache Parquet (persistent store) and recently announced Apache Arrow (in-memory store).