November 11, 2013

FastStart: tiny package to speed up creating new QlikView apps

FastStart is a tiny bundle consisting of an empty QlikView application with a couple of useful subroutines packed with empty configuration .xls file and portable QViewer for inspecting resident tables. I made it for myself for the cases when I need to quickly create a new app from scratch but you're welcome to download and use it. It was made suitable for rapid debugging and dealing with large datasets.

I'm sure there are other similar bundles, here is my approach:

Changes to default Document Settings:
  • Compression -- None
  • Generate Logfile -- Yes
  • Default font -- Calibri, 9pt
Changes to default Load Script:
  • Added Subroutines tab that contains two subroutines -- LoadVars (loads variables from external configuration file) and Inspect (calls QViewer for inspecting contents of a resident table);
  • Added Partial Load tab -- calls LoadVars so that variables can be updated without full reload (by simply pressing Ctrl + Shift + R), also can hold additional script commands for execution during Partial Reload which is convenient for modifying data model;
  • Default tab Main renamed to Variables  -- intended for setting up variables and runtime parameters;
  • Added Main tab -- blank tab, here we start manipulating with data.
Two additional folders
  • Config -- contains config.xlsx
  • Tools -- contains QViewer.exe
I intentionally tried to keep it minimalist and not overload it with pre-built stuff like UI layout framework, calendar scripts, predefined color variables, etc. Although, external configuration file for storing variables and subroutine for inspecting resident tables -- these two features I use in almost every application.

August 19, 2013

Few tips for dealing with large QlikView applications

Recently, I had to develop an application with significant amount of data -- starting with about 100mln. of records with future increase up to 400mln. It was an interesting experience and below are my tips and findings that might be of use for those who have to deal with large amounts of data in QlikView.

Tip #0 -- go to Document Properties and turn off compression. It will make saving and loading application much faster. Yes, it takes more disk space now but you're not going to send such large application by mail anyway.

Mitigate long reload times
Once you've got tens of millions of records you probably don't want to perform full reload after every small change in the script. Some people use built-in debugger for limited runs, although you can have better control over limited reloads using a variable and FIRST prefix for LOAD statements that operate with large tables. Example:

LET vFirst = 1000;//00000000;

FIRST $(vFirst) LOAD ...

In this case you can do quick short runs with a limited set of data either to check correctness of script syntax after small changes or verify transformation logic. Remove semi-colon and double slash for a full load. Make sure that you have enough zeroes.

You would probably want to reduce even full reload times as much as possible. In some cases, if data model allows logical partitioning of data it's better to switch to partial reload as the main reload mode for updating the application with new information. In this case data model isn't dropped entirely, so you can append new chunks of data to it instead of building everything from scratch. Full reload can be left for emergency cases, when entire data model has to be rebuilt.

This technique is especially useful when you work with monthly (weekly, daily) snapshots and you need to join them (you might want to do this for denormalization as a way of  performance improvement discussed below). In this case instead of joining two full tables you can join monthly snapshots first, and then concatenate results month by month, using partial reloads (see picture below).


To make it more manageable, it's better to store monthly snapshots in individual QVD files -- one QVD per month.

May 12, 2013

Teammates wanted: ETL tool for non-technical people to be prototyped

I'm working on a concept of a new kind of ETL tool and I'm looking for one or two teammates (maybe future co-founders) to join our efforts and make it happen.

Background
The concept is about new data transformation tool intended for use by non-technical people. We observe the rise of Business Discovery tools like QlikView, Tableau, etc. that further simplify data analysis and visualization for people without technical background and I think there is an area where similar change is still waiting to happen -- business data manipulation and transformation. This task is traditionally "outsourced" to technical engineers who build complex ETL jobs by the means of SQL scripting or ETL-tools. However, in my opinion, this way of doing things doesn't match reality anymore:

First, there is a growing need for non-technical (a.k.a. "business-") users to manipulate data themselves. Nowadays they are increasingly more inclined to browse data and perform ad hoc analysis rather than deal with static reports. However, not being able to prepare or modify data set for analysis brings a lot of inconvenience and limitation for them. Finally, we have situation when those who prepare data -- don't analyze it, and those who analyze it -- can't prepare it. I believe this is a major factor that inhibits effectiveness of Business Intelligence in general.

Second, uber-popular Excel, which partially solves the problem of self-service data manipulation, is becoming more and more an obstacle rather than a solution because of two reasons:

a) Typical data volumes went beyond Excel capabilities and continue growing exponentially. Nowadays average analyzed data set is about 10-30mln of records and of 200-500GB size. That's too much for desktop Excel.

b) Spreadsheets as technology doesn't play well with relational data sources (e.g. adding a new value to a dimension can break all references in a spreadsheet; writing back from Excel to a database is a hassle). At the same time relational databases is the main source of data for BI tools and will remain it for decades.

Concept
In my opinion, any attempts to make traditional ETL/ELT more user friendly or stuff Excel with even more features will be ineffective. I believe data transformation needs to be re-thought and re-designed from scratch in order to make it appropriate for use by non-technical people.

Here is how I envision it:
  1. It's visual. Who said you can't join two tables by dragging and dropping one on another? Left/right join? Not a problem. This can be done without knowing SQL at all. Another example -- grab a field header in a table, shake it with mouse to obtain a list of unique values of this field (which is actually a new table). Table concatenation, filtering, grouping -- all this can be done in a visual way.
  2. It's instant. What you see is what you get and get instantly. Result of manipulations and transformations is displayed immediately.
  3. It's table-based. Cell is just a particular case of a table. Vector either. Why don't treat them as tables? We don't really need a spreadsheet full of empty cells. It should be free-form layout consisting of connected tables (connection = transformation). Click on empty space to create a new cell and start editing it.
  4. It's functional/declarative. New columns and tables is result of functions applied to tables, fields or specific cell ranges. Transformation is actually a sequence of functions where resulting table of one function serves as input table for another. Business users understand concept of Excel-like functions, but don't understand SQL or imperative programming (and will never do).
  5. It's semantic. Name fields meaningfully. Fields with the same name expected to contain the same kind of information (e.g. Product Description). Refer by field names in functions and expressions (e.g. =Profit/Revenue). Make own re-usable functions.
  6. It's smart. Join by fields with the same names. Omit table name when referencing a field with unique name, etc.
Since business users very rarely operate with more than 50mlns of records the tool is not expected to have ultra-high performance or perform sophisticated transformations. Instead, main accent to be made on polished usability and simplicity of UI which would allow business users perform data manipulations themselves, without knowing SQL or asking data engineers for help. I believe this is achievable.

From technical standpoint -- this is in-memory data processing engine which is built either as WinForms .NET/C# application or web-application on any other stack (I've experience with .NET/C#, RoR, Javascript and now playing with Go). At some point JIT compilation of expressions will be required, so the stack should allow doing this.

Project
At this point the project is pure R&D which goal is to find out optimal product and UI concept, and build simple working prototype. There is no ready specification -- we will have to take a pen and a paper and think out, discuss and try various approaches until we find something we are strongly convinced about. There is a non-zero chance that this will never happen.

The result is working prototype that allows doing filtering, joins and concatenations and basic arithmetical operations on fields. Then there can be two options how it can evolve further -- either to build a commercial product and bootstrap as a side-project, or look for funding and turn into a company and then start building a commercial product.

Who I'm looking for
I'm looking for software development veterans who love Business Intelligence, understand its purpose and challenges, and are capable of resolving complex technical tasks which might involve extensive parallel calculations, concurrent execution and JIT compiling. Since my programming style is rather pedantic, DRY and, to some extent, perfectionist I'm looking for people with the same altitude and similar coding style. It should be mentioned that my main programming experience was 15 years ago and now I'm reinstating my coding skills with QViewer and some web-projects. I expect you to be more experienced programmer than me, however that doesn't mean I'm expecting someone to do all the coding -- we do programming together and with equal efforts.

If this project looks interesting to you and you think you have appropriate experience of software development -- give me a shout and tell a few words about yourself. My email is in the upper right block of this blog. If you know someone who might be willing to take part in this project -- please don't hesitate to share with him/her the link to this blog post.

April 8, 2013

First look at QlikView Direct Discovery

Direct Discovery is one of the recent and long awaited features of QlikView which became available starting from version 11.2. Direct Discovery allows mixing data from external data sources and from QlikView's in-memory data engine using the same QlikView scripting language and expressions. Aim of Direct Discovery is to go beyond RAM limitations and enable real-time querying against large volumes of data outside of QlikView, while presenting the same associative experience as for regular in-memory data. Here are two good documents that describe it in more details: QlikView Direct Discovery Datasheet and QlikView Direct Discovery FAQ.

From development standpoint, Direct Discovery can be employed by two ways: using DIRECT SELECT statement in loading script and SQL( ) function in expressions.

March 17, 2013

Better Lineage in QViewer and some other updates

Here is a brief overview of features recently added to EasyQlik QViewer -- my favorite QVD viewer :) If you're subscribed to QViewer newsletter then you've heard about some of them except the most recent.

Better Lineage

Since QlikView 11 all QVD files generated by version 11 apps contain SQL statements from loading script. This is quite handy when you need to recall what fields from database were used as source for a QVD.  It doesn't contain full snippets of script so it's not possible to see entire transformation logic, however this is handy when you only need to remind yourself what it was. Also QVDs contain connection strings, so that you can double-check if source database and schema are correct. Previously, Metadata window in QViewer showed only 1 line per Discriminator/Statement and it wasn't convenient for viewing SQL statements. Therefore starting from version 1.4.4 there is new additional multi-line window for viewing Discriminator/Statement records from QVD headers. See screenshot below (clickable).

January 19, 2013

Using web-services with QlikView

There is a number of ways to establish bi-directional integration between QlikView and external systems. Some of them employ web-services. In this case a web-service is a helper application which sits between QlikView and external application (or is part of that external application) and communicates using web-protocols. Web-services can be written in any popular programming language (PHP, Java, C#, Python, Ruby, Go etc.) and hosted using popular web servers like Apache, IIS or nginx or any other of your choice.


Let's consider several architectures:

December 26, 2012

Using Dynamic Updates in QlikView 11

Dynamic Updates is another one feature that makes difference between QlikView and other popular BI tools even bigger (see also "Really, is QlikView a BI tool?"). If you haven't heard about it -- it allows adding new or modifying existing data right in the in-memory data model of a QlikView application without re-running loading script. While this feature can be used (to some extent) for real-time updates, in my opinion the main benefit from it lies in a different area -- it enables creating closed-loop workflow with operational applications. It means that not only can you take data from operational applications (e.g. Salesforce or SAP), but return some user-generated output from QlikView into these applications as part of a single workflow. For example, in one of my applications a user can get a list of pending orders from logistic system, create batches from these orders based on some filters and selections in QlikView and then return these batches for delivery into the logistic application.

Dynamic Updates can be performed in two ways -- using VB macros (starting from version 10) and using Actions (starting from version 11). Using actions is a more preferable way (as VB macros are losing more and more favor with QlikTech). Unfortunately, documentation about Dynamic Update Actions is very scarce and there are lots of limitations and side effects, so I tried to collect here some practical hints and tips how it can be used:

Actions utilize SQL-like syntax (however very very limited). E.g.:

INSERT INTO * ('Batch ID', 'Order ID') VALUES (1, 'ABC123');

UPDATE Batches SET WrongBatch = 1 WHERE "Order ID" = null();

DELETE FROM Batches WHERE -1;

These commands look like SQL but here similarities end. Some of limitations:
  • You can use equal sign (=) in WHERE clause. You cannot use greater (>) or less (<) signs or not equal (<>)
  • You can combine conditions using AND or OR in WHERE clause. You cannot use NOT
  • Check for nulls is done differently than in ANSI SQL (see example above)
  • Notice different syntax for field names with spaces
I had problems with inserting dynamically several values by one command despite it's possible with hardcoded command. So I ended up dynamically generating several commands separated by semicolon using dollar-sign extensions. E.g.:

$(=concat(DISTINCT 'INSERT INTO * (OrderID) VALUES ($(vNewBatch),'&[Order ID]&')', ';'))

On QlikCommunity there is excellent demo application made by Matthias Dix that shows basic Dynamic Update commands and their equivalent using VB macros (full thread here).

You should also keep in mind that there are some significant side effects from using Dynamic Updates which often lead to unpredictable behavior. This article explains very well these side effects and how to deal with some of them. I struggled with failed variable assignments (Set Variable action) in the same action batch with Dynamic Update, unexpected freezes, and that any Dynamic Update command for some reason triggers all OnSelect actions. Finally I came to this workaround:

  • Put all variable assignments (Set Variable action) before Dynamic Update command otherwise new values of these variable won't be visible in expressions.
  • Use some dummy command right after Dynamic Update to fight freezes, e.g. Set Variable for dummy (i.e. not used) variable, or even better ...
  • ... use Selection -> Back action right after Dynamic Update command to compensate triggering all OnSelect events (won't have effect on VB Macros tied to OnSelect).


UPDATE

Keep in mind that for server-hosted applications all changes made by Dynamic Update apply immediately for all users that are working with the applications. It means that you need to make sure that actions of different users working simultaneously won't interfere. It can be achieved, for instance, by adding user name for each record created by Dynamic Update and then using set analysis expressions to limit records to only those which are required for the user. But in any case, extra care should be taken to manage correctly all possible concurrency cases.