Monday, February 13, 2012

Sometimes Fat is Better



Your data model makes a big difference in the performance of your QlikView application.  Of course, if you have been working with QlikView for any length of time that is pretty obvious.  Here is a situation you might come across I thought I would share.

The client was trying to monitor costs for production items on a per unit basis as they travelled through the various cost centers.  So there are several hundred products, and a few dozen cost centers, with each combination having a per unit cost.  Then, it should be a fairly easy exercise to find the extended costs as units are sold, “slicing and dicing” by things like customers, dates, cost centers and other groupings.

Being fairly rational, I figured out a few ways to do this:

Product
Cost Center
Cost per Unit
A
Assembly
0.0233
A
Packing
0.0324
A
Q&A
0.0133
A
Storage
0.0544
B
Assembly
0.0766
B
Packing
0.0222
B
Q&A
0.0652
B
Storage
0.0143


This is the “thin” table approach.  Load all the costs into one column and then use set analysis in your expressions to display each cost center like:  SUM({<[Cost Center]={Assembly}>} [Cost per Unit] * [Units Sold]).

Product
Assembly
Packing
Q&A
Storage
A
0.0233
0.03424
0.0133
0.0544
B
0.0766
0.0222
0.0652
0.0143


Here is the “fat” table approach.  Create a cross table, with a column for each cost center.  This results in a fairly wide fact table in your data model but no need for set analysis in your expressions as you would reference each column independently:  SUM([Cost per Unit] * [Units Sold]).

Now I have always been told as rule of thumb, “Create narrow, long fact tables” and “when you have two like facts, concatenate them together, rather than having separate tables with a link”. 

Hence, I originally thought that a “thin” table approach would be best.  This was a disaster.  The model worked but performed mind-numbingly slow.  How could it be, when I used as few columns as possible?

I finally realized the problem here is that these costs are not facts.  They are really dimensions.  Sure they are numbers that we will calculate against, but they are still dimensions that either do not change or might slowly change (with the addition of date fields).  I was so used to thinking that any column with lots of numbers we are calculating with should be a fact.  Welcome to database design 101

I developed a small example application to illustrate both models and illustrate the performance differences.  There are about 150,000 sales records, 10 cost centers and 80 products.  So there are about 800 costs that needed to be housed either in one long column or 10 columns of 80.  I ended up with this data model (so that I could create two tables each using one of the cost sets)



I then created a pivot table from the “thin” table using set analysis and a similar pivot table from the “fat” set using the same expressions without the set analysis and referring to the respective data columns.

I also created a Memory statistics file to highlight the difference in calculation time between the two objects:




You can see that the “Fat” set is almost 5 times as efficient in the example.  And look at the memory advantage in Bytes! 

So maybe next time I will take a more thoughtful look at the data before deciding on a model because what can appear as a fact can really be a dimension.  And I should always be aware that regardless of the type of data, sometimes more columns really are better.