April 17, 2017

QViewer ver.3.3 is out

QViewer version 3.3 is now available for downloading. Here is what's new and exciting in the new version:

Support for QVD files larger than 2GB
Now you can open QVD files basically of any size. It was tested on QVD files up to 20GBs. Technically, at this point the only limitation is 2 billion rows per file. QViewer aggressively parallelize calculations so you may want to give it some time to perform necessary counts after opening a large file, because during that time CPU utilization can be near 100% which would make the application less responsive.

New Partial Load workflow
In the new version the workflow for partial load was significantly changed. The Partial Load button has been removed. Instead, when a file larger than certain threshold is opened QViewer suggests  performing a partial load (see the screenshot below). The threshold can be set at 512MB, 1GB, 2GB, 4GB or 8GB.

Indication of metrics affected by partial load
When a partial load is performed some metrics calculated by QViewer can be distorted because the loaded dataset is incomplete. To avoid confusion QViewer now shows which metrics are affected by partial load (shown red) and which are not.

Comments in generated script
When generating a LOAD statement, QViewer can now insert field comments obtained from the XML header. These comments are typically created using COMMENT FIELD or COMMENT TABLE statements in QlikView / Qlik Sense before exporting a table into QVD file.

Download QViewer

UPDATE 4/23/2017
I'm considering further improvements to QViewer. The ideas are floating around a few topics: table inspection for Qlik Sense Server, [shareable] selection bookmarks, support for other file formats (e.g. QVX, CSV, or XLSX), aggregation with built-in pivot tables (with selections applied). However, I'm not sure what would be the most useful for the Qlik dev community. I would appreciate hearing it from you -- what would you like to see in future versions of QViewer? Feel free to send me your suggestions (see my email in the upper right corner of this blog), or just leave a comment below.

April 9, 2017

Websites should offer downloading micromodels instead of CSV dumps

Many websites such as Google Analytics or PayPal allow users to download records (transactions) as CSV files or, sometimes, Excel spreadsheets. While it's more convenient than copy-pasting-parsing HTML from web-pages it's still not optimal because such dumps usually lack many details. For instance, when I download PayPal transactions I would also want to see product items, so that I could make a break down of sales by products. Or see customer addresses, so that I could analyze them from a geospatial perspective. Since the CSV dumps are denormalized, adding all the details and attributes would dramatically increase file sizes and also clutter them.

What I'm suggesting is that instead of CSV dumps websites should allow downloading micromodels -- user-related subsets of a bigger data model used by the web-service itself. Such micromodels would contain several linked normalized tables with only data relevant to the user who requested it. From a technical standpoint it can be one SQLite file generated on the fly. The SQLite file format allows packing multiple tables into one file which can hold millions of records.

Having a relational micromodel would allow more meaningful and interesting data analysis. It would play well with popular data analysis tools (except Excel which is poorly suited to work with relational data by design). Support for SQL queries would immediately make it compatible with vast amount of systems.

For information providers, an SQLite file with micromodel would be of size similar to current CSV dumps so it won't increase workload and traffic. Also generating micromodels can be even faster than generating CSV dumps since it won't require joining multiple tables in order obtain a denormalized view.

Below is an example of a possible micromodel schema for PayPal.

January 15, 2017

IT execs should not tell business users to choose between Tableau and Qlik

IT executives should not force business users to choose between Tableau and Qlik. After all, business users don't tell IT to choose between Windows and Linux.

Such "standardization", while still very popular in many organizations, brings more harm than good. It's originates from a few myths:

Myth #1: All BI tools are basically the same

This is no more true than saying "All airplanes are basically the same". Such altitude is coming from poor understanding of the purpose of Business Intelligence tools and capabilities of products available on the market. If two applications show charts and allow analyzing data it doesn't make them functionally equivalent because there is huge variety in the ways how data can be viewed and analyzed. Tell a good chef that all knives are basically the same therefore s/he should pick and use only one knife. Because, you know, "standardization".

BI tools are not the same. The task of data analysis and visualization is so complex and broad that no vendor can create a universal comprehensive solution, just like neither Boeing nor Airbus can create one universal airplane suitable for all cases -- from long passenger flights to air warfare to rescue operations.

For instance Qlik has amazing associative engine that allows easy discovery of logical relationships in data. Tableau has absolutely wonderful data visualization concept that unveils hidden patterns and provides meaningful perspectives that could be easily overlooked. Spotfire offers comprehensive means for scientific analysis and predictive modelling. The core concepts of these applications don't overlap. Some features surely do, but that doesn't make them interchangeable. Other BI tools also have their strong features. Any analytical application that has deeply thought-out, fundamental concept behind it will be significantly different from others.

Myth #2: Standardization is always a good thing

This myth is logically connected to Myth #1. Standardization, when it's applicable, has obvious benefits most of which boil down to one -- cost reduction. A company can get a deeper discount if it purchases more software licenses. You can save on training if you train employees only for 1 tool, instead of many. More people with similar skills are interchangeable, therefore less risk of losing critical expertise, also reduced staff count. And so on.

However, any cost reduction is only good when it doesn't degrade key functional capabilities. What would happen if you force a chef to "standardize" on knives? His/her functional capability would degrade. What would happen if you tell a military air force to use the same type of plane for cargo transportation and air dogfighting? Its functional capability would degrade. That's why nobody does it.

Myth #3: All business users have similar data analysis needs

There is a stereotypical understanding in the BI world that there are three main types of BI users: regular users (receive specialized reports), advanced users (interactively analyze and research data) and management users (look at dashboards, monitor high-level KPIs). It's an easy, simple and misleading classification. Misleading because BI adoption is never a one-way street. Unlike transactional systems, BI is still optional. It's hard to force people to use some tool if they don't like it or don't understand how to use it. In the case of BI adoption, they can always retreat to good old Excel (and they frequently do).

People are different and they have different preferences. When it comes to data analysis, two persons doing the same task might have different views on how to do it best. A good sign of a person who knows what s/he is doing is whether s/he has strong opinion on tools needed for the job. Therefore, business users themselves should pick what they need. If business users are okay with any analytical application given to them and have no own opinion on it then they don't really need it and purchasing it would be a waste of money.

Myth #4: Business users can do a good evaluation in a 2 month period

Data analysis is a rapidly developing discipline. It's developing in many directions: methods and techniques, visualizations, processing algorithms, metadata governance, unstructured data processing, etc. The times when a BI system was simply a visual SQL query generator with some interactive charting are long gone. BI is complex nowadays, and its complexity will only increase. Even best analytical applications on the market have rather steep learning curve, despite claims about the opposite in PR/marketing campaigns. Modern BI applications can be relatively easy to start with, but as soon as something slightly non-trivial is needed the learning curve skyrockets. Look at online forums like Tableau Community, or Qlik Community -- they are full of people asking tons of how-to questions which sometimes require rather lengthy and detailed answers.

I believe that a good understanding of capabilities of a data analysis application can be developed after at least 1 year of using the application regularly on complex real-life projects. That's in a case when there was no any previous experience with analytical applications. Or at least 6 months, if there was some (which means that you should be already familiar with some concepts). Asking business users without any previous experience with BI applications to provide a feedback on an application based on 2-3 month evaluation of some demo/PoC dashboard (i.e. which are not in production use) -- is a sure (albeit very common) way to get wrong conclusions. Don't do that.

OK, what's the alternative?

Tool as a Service (a.k.a. The Data Kitchen)

At this point you probably started suspecting that modern BI applications are not just more powerful than ever but also more complex than ever, and are more different from each other than ever. Comparing Tableau with Qlik makes as much sense as comparing apples with oranges. They are all round, after all, aren't they?

I believe that the most efficient way to adopt Business Intelligence is the one where adoption grows organically. IT departments should create an environment that fosters such organic growth, instead of limiting and restricting it for the purpose of hypothetical cost reduction. They should embrace the data kitchen concept, where multiple tools are available for the users who are looking for different ways to work with data. We can call it "Tool as a Service" if you will. Don't standardize on one BI system -- it's not going to work well. Ask business users what they like, and help them make it work from a technical perspective. It's the business users who should decide what to use and when. It's them who will accumulate the expertise of using the applications, not the IT people.

Practically, it means that teams as small as 5-10 people, or as big as 100 (or maybe even more) evaluate and test analytical applications themselves. The IT management should be ready that different teams may choose different applications. It's the users who should decide what works best for them. If they need two tools with somewhat overlapping features -- give them access to both. If more -- let them use as many as they need.

It doesn't mean that you will have to purchase every possible software for every user. Start with small packs of licenses for a few applications chosen by business users. Add more licenses when popularity of one of the applications increases. In this case license usage will correlate with growth of expertise. It's more efficient (also from a cost perspective) than spending millions for enterprise agreements then forcing everyone to use only the "standard" BI tool because "we spent so much money on it".

November 23, 2016

Unplugged: What I learned about people and technology while writing QViewer

This post is not about BI technology or vendors or data analysis. "Unplugged" is a new kind of articles in this blog. It's my personal observations with a bit of philosophical contemplation, if you will. Today I'm writing about what I've learned while developing and selling QViewer -- a side project started as a "quick and dirty" tool for personal use which then became an interesting business experiment that taught me a few new things about people and technology:

You can make and sell things as your side project. It's doable. I remember that somewhat awkward feeling when I received first ever payment for software that I made myself. It was very unusual. I had experience of selling enterprise software with a 6-digit price tag, but that was someone else's business. Getting your own first sale in a $50 range was no less exciting if not more.

People in general are good. Once you start selling software you interact with people all over the world. And it turns out that people are generally good around the globe. I was surprised how many very grateful and positive people there is. Probably it's the most unexpected and gratifying outcome of the whole project.

Some people cheat with licenses. Despite the fact that the cheapest QViewer license costs less than a dinner for two, and, unlike the dinner, is acquired forever -- they still cheat. I understand it's a part of the human nature -- feeling frustration and pity when someone steals from you and at the same time enjoying the benefits of stealing from someone else even if it's just pocket money. People are complicated animals. So I'm not saying anything about the people that cheat. I'm deeply content that the majority is honest. The humanity definitely has a chance to survive :)

Some people are strange. No need to deal with them. After all, doing business is a two-way street. I remember one person demanded a sales presentation, a webex demonstration and sending him a detailed commercial proposal for QViewer because he and "his guys" wanted to evaluate whether it's worth spending couple hundred dollars. I replied that I'm ready to answer any specific questions, and offered to try the free QViewer to get an idea about the product. I've never heard from him again.

95% of technical support time is spent on 5% of customers. Some people are just like that -- they don't read instructions, forget things, don't check spam folders before complaining that their key didn't arrive, can't figure out what instance of QViewer they're launching, etc. It's OK, they just need more help. After all, adults are just grown up kids.

User recommendations is the best advertisement. So far I've spent exactly $0 for advertising QViewer. Yet, it's quite popular, mostly because of user recommendations. For me it was a good example of what it looks like when you made something useful. If people recommend it to each other -- you're on the right path.

1 out 10 orders never paid. Spontaneous decisions, no problem.

Payment reminders work. Sometimes, your invoice sent to a customer may be buried in his/her email box under a pile of unread messages. Sending a friendly reminder once might help. Just once, that's enough for those who are really looking to buy.

Even small side projects can be extremely good for career opportunities. Needless to say, mentioning QViewer in my CV helped me tremendously in finding new employers (when I looked for them). I would argue that the salary increase it enabled has earned me more than selling QViewer licenses alone.

Developer tools are amazing nowadays. I wrote my first program in 1986. It was in BASIC on a military-grade DEC-type computer. In 90s I wrote programs in C++, Pascal and Assembly. Between 1998 and 2011 I didn't write a single line of code (except some Excel macros). Boy, how things have changed since then. When I started writing QViewer in 2012 I was totally fascinated with the capabilities of Visual Studio and C#. Later I fell in love with F# but that's a different story. And thanks God we have StackOverflow. Writing software has never been easier.

Obfuscate and protect your software. Sooner or later someone will try to disassemble your software for a purpose that might be disappointing for you. There is no absolute protection, but raising the barrier can significantly complicate the task. Once I interviewed a developer for EasyMorph. Trying to impress me, the guy told me that he also wrote a QVD viewer. However, after not answering a few questions about the QVD format he quickly admitted that he just disassembled and repacked some components of QViewer. I learned a lesson that day.

Writing and selling software changed my perception of the software industry. I understood what it takes to create it. I stopped using any pirated programs. Now I use only licensed software, even if it's rather expensive (I'm looking at you, Microsoft), and I always donate when a program is available for free but donations are accepted.

November 21, 2016

A simple join test that many fail

From time to time I happen to interview BI developers and I noticed that many of them don't understand how joins work. Probably, because most of the time they used to work with normalized data in transactional systems where primary keys always exist and defined by database design. In order to do figure out if the candidate has basic understanding of joins I ask him/her to answer the question below (without executing any actual query):

Hint: the correct answer is not 4. If you're unsure whether your answer is correct see this short video where both tables are joined using EasyMorph: https://www.youtube.com/watch?v=RYCtoRTEk84, or check our this SQLFiddle: http://sqlfiddle.com/#!9/60011/11/0

Not understanding joins sooner or later leads to uncontrolled data duplication in cases where joined tables are denormalized, which is a frequent cause of miscalculations in analytical applications.

UPDATE: Added a link to SQLFiddle (kudos to Devon Guerro).

November 15, 2016

Now we know where Tableau is heading. Where is Qlik going?

During the recent conference Tableau has unveiled its three-year roadmap. Briefly, it includes:
  • High-performance in-memory engine based on Hyper (in the timeframe that I predicted earlier)
  • Enhanced data preparation capabilities (Project Maestro)
  • Built-in data governance
  • Pro-active automatically designed visualizations
  • Tableau Server for Linux
The most interesting are the first two. Once implemented, they will significantly reduce the gap with Qlik in terms of performance and versatility. I wouldn't expect the first version of Tabeau's in-memory engine to be as performant and scalable as Qlik's QIX (let's not dismiss almost 20 years of tuning and optimizations), however I would predict that for small and medium deployments performance will not be an issue. Even if we assume that QIX would still be 2-3 times faster than Tableau Hyper -- performance won't be a decision-critical factor anymore.

Project Maestro is another inevitable move from Tableau people who now realize that self-service data analysis requires self-service data transformation. Tableau is still reluctant building a fully-featured ETL for business users like EasyMorph, however once Project Maestro is implemented the advantage of having built-in ETL capabilities in Qlik would be diminished (but not dismissed).

Now, when Tableau has clear advantage on the data visualization side and stops being a fancy add-on to databases but becomes more and more a self-contained analytical platform, the question is -- where is Qlik going?

QlikView is not actively developed anymore. All the recent developments on the Qlik Sense side in 90% cases are focused on expanding API capabilities, while its data visualization capabilities remain frugal. Honestly, I don't understand this development logic. I would understand it, if Qlik's product strategy assumed heavy reliance on 3rd party tools for decent data visualization and analysis. However so far I struggle to see any high-quality 3rd party tools built on top of Qlik Sense API that can amend the built-in visualizations. Qlik Market might have a few interesting extensions, but they're typically very specialized. Qlik Branch lacks high-quality extensions and is full of no longer supported experimental projects. Qlik itself doesn't promote any 3rd party tools and its product roadmap is yet to be seen.

So where is Qlik going?

September 4, 2016

How to use the memstat metrics in QViewer

Starting from version 3.1 QViewer shows two memstat metrics in Table Metadata window:
  • Size, bytes -- total size of the column in bytes
  • Avg.bytes per symbol -- average size of column values in byte 

click to zoom

These metrics are calculated similarly to memstat data available in QlikView (not available in Qlik Sense so far). Since the structure of QVD files is very close to the internal in-memory data format in Qlik these metrics can be used to optimize (reduce) memory footprint of resident tables which can be desirable for particularly large applications. The most convenient way to inspect resident tables in QViewer is setup a simple generic subroutine as described here. Alternatively, you can insert temporary STORE statements to save resident tables into QVDs and then open them in QViewer manually.

When looking at the memstat metrics in QViewer you would typically want to identify columns that take most of space (hint: click column headers in Table Metadata to sort the grid). A few things that you can do reduce table size:
  • Remove unnecessary columns that take a lot of space
  • Force Qlik to convert duals to numbers by multiplying them by 1
  • Trim text values to remove trailing and leading spaces
  • Use integers instead of floats where possible
  • Round up floats to fewer decimal digits to have fewer distinct values in the column
  • Use autonumbers instead of long composite text keys
Read also "A few tips for dealing with large QlikView applications".

Qlik uses special format for columns with incrementally increasing integers (autonumbers) -- they basically don't  take up any space in memory, although QVDs store them as regular columns.

The memstat metrics are calculated correctly even if QVDs are partially loaded in QViewer. Therefore you can use them in the free version of QViewer, or when partial loading was used.

QVDs generated in Qlik Sense v.2 and above are compatible with QVDs generated in QlikView and therefore can be opened in QViewer as well.

The total size is NOT simply the avg. symbol size multiplied by # of rows -- it's calculated using a more complicated logic that accounts data compression.