April 9, 2017

Websites should offer downloading micromodels instead of CSV dumps

Many websites such as Google Analytics or PayPal allow users to download records (transactions) as CSV files or, sometimes, Excel spreadsheets. While it's more convenient than copy-pasting-parsing HTML from web-pages it's still not optimal because such dumps usually lack many details. For instance, when I download PayPal transactions I would also want to see product items, so that I could make a break down of sales by products. Or see customer addresses, so that I could analyze them from a geospatial perspective. Since the CSV dumps are denormalized, adding all the details and attributes would dramatically increase file sizes and also clutter them.

What I'm suggesting is that instead of CSV dumps websites should allow downloading micromodels -- user-related subsets of a bigger data model used by the web-service itself. Such micromodels would contain several linked normalized tables with only data relevant to the user who requested it. From a technical standpoint it can be one SQLite file generated on the fly. The SQLite file format allows packing multiple tables into one file which can hold millions of records.

Having a relational micromodel would allow more meaningful and interesting data analysis. It would play well with popular data analysis tools (except Excel which is poorly suited to work with relational data by design). Support for SQL queries would immediately make it compatible with vast amount of systems.

For information providers, an SQLite file with micromodel would be of size similar to current CSV dumps so it won't increase workload and traffic. Also generating micromodels can be even faster than generating CSV dumps since it won't require joining multiple tables in order obtain a denormalized view.

Below is an example of a possible micromodel schema for PayPal.