Usually, we see ETL processes that move data from row-based transactional systems into either row-based databases, or into columnar databases or analytical engines. In other words, the transfers are row-based to row-based, or row-based to columnar. But now, when columnar data stores become more and more widespread, the need to move data from one columnar system to another columnar system (i.e. columnar-columnar) will arise more and more often. And here is a problem -- if we do it in the usual way it's highly inefficient.
The reason it's inefficient is that columnar data is usually compressed. Therefore, passing a columnar dataset from one system to another system usually requires decompressing it entirely, than exporting it into some row-based format like CSV and then compressing again in the target system. Doesn't sound very efficient, does it?
To my knowledge (correct me if I'm wrong) at this point there is no any publicly available columnar data interchange format. For some reasons (which I don't really understand), all major vendors keep their columnar data formats proprietary, as if it was rocket science. I believe that if, for instance, Qlik made their QVD format open-source, then it could become a de-facto standard, simply because there is no any other common columnar data format. But as the vendors keep ignoring this opportunity, I believe there is a need in an open source format. Here is what it could be:
1. Columnar. Values are grouped into columns. A dataset can consist of one or more columns.
2. Extendable. Every column can be of one of predefined types, and these types encapsulate all logic required to read/write a column. Types can be:
- Uncompressed
- Vocabulary compression
- RLE compression
- Zipped
- etc.
3. Mixed/Strict typing. Some columns can have values of mixed types (e.g. text and numbers), some can be restricted to one type only.
4. Single-file/multi-file. Metadata and data can be in the same file (which is good for portability). Or metadata and columns (or parts of columns such as vocabularies) can be separate files (which can allow, for instance, sharing vocabularies between different columns and even tables).
5. Null-friendly. CSVs can't store nulls. This format should be able to support nulls natively.
I believe such format would make columnar-to-columnar data transfers more efficient. It won't eliminate decompressing/compressing entirely, but:
- It would be possible to transfer only needed columns
- In case where both source and target systems support the same column types (e.g. vocabulary compression) encoding/decoding can be either greatly optimized or avoided entirely
- Metadata would be pre-calculated and available at low cost (e.g. number of rows, existence of nulls, etc.).
- Columnar format makes parallel processing much simpler and more efficient
As the result -- data transfers can be 10x-100x times faster. One common format means we need not N x N data converters, but only N. Standardization would also foster a tool ecosystem around it -- viewers, profilers, converters, etc. Single documentation. Single API.
At this point it's just an idea. But if we at EasyMorph ever decide to make a persistent storage for our in-memory engine, I don't see any reasons why not make it open-source.
It's not rocket science after all.
UPDATE 17 Feb 2016
Two open-source projects aimed to standardize columnar data exchange - Apache Parquet (persistent store) and recently announced Apache Arrow (in-memory store).