May 12, 2013

Teammates wanted: ETL tool for non-technical people to be prototyped

I'm working on a concept of a new kind of ETL tool and I'm looking for one or two teammates (maybe future co-founders) to join our efforts and make it happen.

Background
The concept is about new data transformation tool intended for use by non-technical people. We observe the rise of Business Discovery tools like QlikView, Tableau, etc. that further simplify data analysis and visualization for people without technical background and I think there is an area where similar change is still waiting to happen -- business data manipulation and transformation. This task is traditionally "outsourced" to technical engineers who build complex ETL jobs by the means of SQL scripting or ETL-tools. However, in my opinion, this way of doing things doesn't match reality anymore:

First, there is a growing need for non-technical (a.k.a. "business-") users to manipulate data themselves. Nowadays they are increasingly more inclined to browse data and perform ad hoc analysis rather than deal with static reports. However, not being able to prepare or modify data set for analysis brings a lot of inconvenience and limitation for them. Finally, we have situation when those who prepare data -- don't analyze it, and those who analyze it -- can't prepare it. I believe this is a major factor that inhibits effectiveness of Business Intelligence in general.

Second, uber-popular Excel, which partially solves the problem of self-service data manipulation, is becoming more and more an obstacle rather than a solution because of two reasons:

a) Typical data volumes went beyond Excel capabilities and continue growing exponentially. Nowadays average analyzed data set is about 10-30mln of records and of 200-500GB size. That's too much for desktop Excel.

b) Spreadsheets as technology doesn't play well with relational data sources (e.g. adding a new value to a dimension can break all references in a spreadsheet; writing back from Excel to a database is a hassle). At the same time relational databases is the main source of data for BI tools and will remain it for decades.

Concept
In my opinion, any attempts to make traditional ETL/ELT more user friendly or stuff Excel with even more features will be ineffective. I believe data transformation needs to be re-thought and re-designed from scratch in order to make it appropriate for use by non-technical people.

Here is how I envision it:
  1. It's visual. Who said you can't join two tables by dragging and dropping one on another? Left/right join? Not a problem. This can be done without knowing SQL at all. Another example -- grab a field header in a table, shake it with mouse to obtain a list of unique values of this field (which is actually a new table). Table concatenation, filtering, grouping -- all this can be done in a visual way.
  2. It's instant. What you see is what you get and get instantly. Result of manipulations and transformations is displayed immediately.
  3. It's table-based. Cell is just a particular case of a table. Vector either. Why don't treat them as tables? We don't really need a spreadsheet full of empty cells. It should be free-form layout consisting of connected tables (connection = transformation). Click on empty space to create a new cell and start editing it.
  4. It's functional/declarative. New columns and tables is result of functions applied to tables, fields or specific cell ranges. Transformation is actually a sequence of functions where resulting table of one function serves as input table for another. Business users understand concept of Excel-like functions, but don't understand SQL or imperative programming (and will never do).
  5. It's semantic. Name fields meaningfully. Fields with the same name expected to contain the same kind of information (e.g. Product Description). Refer by field names in functions and expressions (e.g. =Profit/Revenue). Make own re-usable functions.
  6. It's smart. Join by fields with the same names. Omit table name when referencing a field with unique name, etc.
Since business users very rarely operate with more than 50mlns of records the tool is not expected to have ultra-high performance or perform sophisticated transformations. Instead, main accent to be made on polished usability and simplicity of UI which would allow business users perform data manipulations themselves, without knowing SQL or asking data engineers for help. I believe this is achievable.

From technical standpoint -- this is in-memory data processing engine which is built either as WinForms .NET/C# application or web-application on any other stack (I've experience with .NET/C#, RoR, Javascript and now playing with Go). At some point JIT compilation of expressions will be required, so the stack should allow doing this.

Project
At this point the project is pure R&D which goal is to find out optimal product and UI concept, and build simple working prototype. There is no ready specification -- we will have to take a pen and a paper and think out, discuss and try various approaches until we find something we are strongly convinced about. There is a non-zero chance that this will never happen.

The result is working prototype that allows doing filtering, joins and concatenations and basic arithmetical operations on fields. Then there can be two options how it can evolve further -- either to build a commercial product and bootstrap as a side-project, or look for funding and turn into a company and then start building a commercial product.

Who I'm looking for
I'm looking for software development veterans who love Business Intelligence, understand its purpose and challenges, and are capable of resolving complex technical tasks which might involve extensive parallel calculations, concurrent execution and JIT compiling. Since my programming style is rather pedantic, DRY and, to some extent, perfectionist I'm looking for people with the same altitude and similar coding style. It should be mentioned that my main programming experience was 15 years ago and now I'm reinstating my coding skills with QViewer and some web-projects. I expect you to be more experienced programmer than me, however that doesn't mean I'm expecting someone to do all the coding -- we do programming together and with equal efforts.

If this project looks interesting to you and you think you have appropriate experience of software development -- give me a shout and tell a few words about yourself. My email is in the upper right block of this blog. If you know someone who might be willing to take part in this project -- please don't hesitate to share with him/her the link to this blog post.

April 8, 2013

First look at QlikView Direct Discovery

Direct Discovery is one of the recent and long awaited features of QlikView which became available starting from version 11.2. Direct Discovery allows mixing data from external data sources and from QlikView's in-memory data engine using the same QlikView scripting language and expressions. Aim of Direct Discovery is to go beyond RAM limitations and enable real-time querying against large volumes of data outside of QlikView, while presenting the same associative experience as for regular in-memory data. Here are two good documents that describe it in more details: QlikView Direct Discovery Datasheet and QlikView Direct Discovery FAQ.

From development standpoint, Direct Discovery can be employed by two ways: using DIRECT SELECT statement in loading script and SQL( ) function in expressions.

March 17, 2013

Better Lineage in QViewer and some other updates

Here is a brief overview of features recently added to EasyQlik QViewer -- my favorite QVD viewer :) If you're subscribed to QViewer newsletter then you've heard about some of them except the most recent.

Better Lineage

Since QlikView 11 all QVD files generated by version 11 apps contain SQL statements from loading script. This is quite handy when you need to recall what fields from database were used as source for a QVD.  It doesn't contain full snippets of script so it's not possible to see entire transformation logic, however this is handy when you only need to remind yourself what it was. Also QVDs contain connection strings, so that you can double-check if source database and schema are correct. Previously, Metadata window in QViewer showed only 1 line per Discriminator/Statement and it wasn't convenient for viewing SQL statements. Therefore starting from version 1.4.4 there is new additional multi-line window for viewing Discriminator/Statement records from QVD headers. See screenshot below (clickable).

January 19, 2013

Using web-services with QlikView

There is a number of ways to establish bi-directional integration between QlikView and external systems. Some of them employ web-services. In this case a web-service is a helper application which sits between QlikView and external application (or is part of that external application) and communicates using web-protocols. Web-services can be written in any popular programming language (PHP, Java, C#, Python, Ruby, Go etc.) and hosted using popular web servers like Apache, IIS or nginx or any other of your choice.


Let's consider several architectures:

December 26, 2012

Using Dynamic Updates in QlikView 11

Dynamic Updates is another one feature that makes difference between QlikView and other popular BI tools even bigger (see also "Really, is QlikView a BI tool?"). If you haven't heard about it -- it allows adding new or modifying existing data right in the in-memory data model of a QlikView application without re-running loading script. While this feature can be used (to some extent) for real-time updates, in my opinion the main benefit from it lies in a different area -- it enables creating closed-loop workflow with operational applications. It means that not only can you take data from operational applications (e.g. Salesforce or SAP), but return some user-generated output from QlikView into these applications as part of a single workflow. For example, in one of my applications a user can get a list of pending orders from logistic system, create batches from these orders based on some filters and selections in QlikView and then return these batches for delivery into the logistic application.

Dynamic Updates can be performed in two ways -- using VB macros (starting from version 10) and using Actions (starting from version 11). Using actions is a more preferable way (as VB macros are losing more and more favor with QlikTech). Unfortunately, documentation about Dynamic Update Actions is very scarce and there are lots of limitations and side effects, so I tried to collect here some practical hints and tips how it can be used:

Actions utilize SQL-like syntax (however very very limited). E.g.:

INSERT INTO * ('Batch ID', 'Order ID') VALUES (1, 'ABC123');

UPDATE Batches SET WrongBatch = 1 WHERE "Order ID" = null();

DELETE FROM Batches WHERE -1;

These commands look like SQL but here similarities end. Some of limitations:
  • You can use equal sign (=) in WHERE clause. You cannot use greater (>) or less (<) signs or not equal (<>)
  • You can combine conditions using AND or OR in WHERE clause. You cannot use NOT
  • Check for nulls is done differently than in ANSI SQL (see example above)
  • Notice different syntax for field names with spaces
I had problems with inserting dynamically several values by one command despite it's possible with hardcoded command. So I ended up dynamically generating several commands separated by semicolon using dollar-sign extensions. E.g.:

$(=concat(DISTINCT 'INSERT INTO * (OrderID) VALUES ($(vNewBatch),'&[Order ID]&')', ';'))

On QlikCommunity there is excellent demo application made by Matthias Dix that shows basic Dynamic Update commands and their equivalent using VB macros (full thread here).

You should also keep in mind that there are some significant side effects from using Dynamic Updates which often lead to unpredictable behavior. This article explains very well these side effects and how to deal with some of them. I struggled with failed variable assignments (Set Variable action) in the same action batch with Dynamic Update, unexpected freezes, and that any Dynamic Update command for some reason triggers all OnSelect actions. Finally I came to this workaround:

  • Put all variable assignments (Set Variable action) before Dynamic Update command otherwise new values of these variable won't be visible in expressions.
  • Use some dummy command right after Dynamic Update to fight freezes, e.g. Set Variable for dummy (i.e. not used) variable, or even better ...
  • ... use Selection -> Back action right after Dynamic Update command to compensate triggering all OnSelect events (won't have effect on VB Macros tied to OnSelect).


UPDATE

Keep in mind that for server-hosted applications all changes made by Dynamic Update apply immediately for all users that are working with the applications. It means that you need to make sure that actions of different users working simultaneously won't interfere. It can be achieved, for instance, by adding user name for each record created by Dynamic Update and then using set analysis expressions to limit records to only those which are required for the user. But in any case, extra care should be taken to manage correctly all possible concurrency cases.

December 15, 2012

QViewer rebranded, got paid version


A few significant changes happened to QViewer -- viewer for QVD files which I wrote a few months ago and still continue developing.

First of all, QViewer has been re-branded as EasyQlik QViewer and got its own web-site http://easyqlik.com (which however doesn't differ much from the old one).

Second major change is that QViewer has now two versions -- free and full. Free version is limited to 10'000 rows which is still quite enough to do some data profiling. Full version is paid (license key can be purchased for $45) and unlimited. License is personal (i.e. limited to a person, not organization) and has no limitation by number of installations -- you can use QViewer on as much computers as you need. In 8 months the demo version was download almost 1'000 times so it gives me the hope that someone found it as much as useful as I do :) However, further development of QViewer in terms of features, usability and functionality will directly depend on number of licenses purchased as this will show if QViewer is really useful tool or not.

Third change is that finally I got rid of troublesome ClickOnce installer which caused lots of problems with installation and even more with uninstallation. Now QViewer comes with standard neat NSIS-based installer and actually is a portable application. It means that you don't necessarily need to run installer if you want to move QViewer to a different machine. Once it has been installed, you can simply copy qviewer.exe to other machine or add it to your portable apps collection on USB-stick. However, running installer has advantage of automated association with .qvd file extension, which can be problematic otherwise. And as previously, QViewer requires installed .NET v4.0 or above -- keep this in mind when using QViewer in portable mode.

Future development plans include adding sort functionality for listboxes and metadata tables, and design of custom data grid. The latter should significantly decrease load times as the current implementation uses standard DataGridView which is so slow that only initializing it actually takes major part of total load time.

And of course, if you have feature suggestions -- don't hesitate to give me a shout by email or in comments here.

How to look inside resident tables at any point of loading script

What do you usually do when you need to look inside a resident table somewhere in the middle of loading script? I guess you store it as QVD, then load this QVD in a separate temporary QlikView application, then create table box or something else to browse the table.

Here is small trick how you can make your life easier using QViewer:

Insert in the beginning of the loading script this short snippet:

SUB INSPECT (T)
LET vPathToQviewer = 'C:\User\Dmitry\AppData\Local\EasyQlik\QViewer\';
STORE $(T) into [$(vPathToQviewer)temp_qvd.qvd] (qvd);
EXECUTE "$(vPathToQviewer)qviewer.exe" "$(vPathToQviewer)temp_qvd.qvd";
ENDSUB

And then simply insert calls to this function in any place of the loading script where you need to inspect table in its current state. For instance:

CALL Inspect('Orders');

When execution of the loading script will come to this line it will open QViewer showing the table. Script will continue running as soon as you close QViewer.

This is convenient technique that allows inspecting resident tables after every transformation like joins, concatenate loads, etc. You can have as many inspection points as you need during a single run.

A few more tips:

You may receive Security Alert from QlikView because of EXECUTE statement as depicted below. Just press Override Security button. Or you can change security settings in the bottom of the Script Editor window.


You can put this snipped into some file (e.g. debug.qvs) and then include this file when you need to do debugging of some application. Also you might want to save inspected tables under their own names in some temporary folder -- just modify the snippet accordingly or add a new similar function, e.g. InspectSave.

Any more ideas?

UPDATE 1

+Donald Hutchins made a good point -- he offered to store temporary QVD file not in QViewer's folder but in the working folder of the appliction which loading script is being executed. I think it makes a lot of sense because of less problems with access rights and better fit for parallel use. Here is his variant with my minor changes:

SUB INSPECT (T)
    STORE $(T) into [$(QvWorkPath)\~$(T).qvd] (qvd);
    EXECUTE "C:\.......\QViewer.exe" "$(QvWorkPath)\~$(T).qvd";
    EXECUTE cmd /c del "$(QvWorkPath)\~$(T).qvd";
ENDSUB


UPDATE 2

+Matthew Fryer proposes one more variant of the Inspect sub in his blog QlikView Addict.