The QIX internal database

The QIX internal database uses symbol tables and data tables. These two types of tables get created after the script engine executes the script. The script execution is sequential; this means that it is executed from left to right and from top to bottom. 

Allow me first to describe how symbol tables are created in memory and what these tables contain.

The script engine is responsible for executing our Qlik Sense script. Whenever the script engine executes a LOAD or a SQL Select statement, it will bring into memory all the rows and columns in the data source, unless otherwise specified in the syntax. The data source can be anything from an Excel file to a table in a database.

The QIX internal database builds a separate symbol table for each field in the dataset. The main characteristic of the symbol table is that it contains a row for each distinct value of the source fields, as well as two columns: a value column and a bit-stuffed pointer.

Let's take a look at the following sample line of code:

SQL SELECT Customer, Customer_Country, Active_Flag FROM CUSTOMERS;

The script engine first connects to the database through the standard Qlik Sense database connection syntax, and then it sends whatever is between the keyword SQL and the semicolon to the database engine. The database engine, which is external to our Qlik Sense application, executes the SQL command, and finally, returns the resulting dataset to the Qlik engine.

Qlik Sense can connect to a database using one of the following syntaxes:
ODBC CONNECT TO connect-string | OLEDB CONNECT TO connect-string | CUSTOM CONNECT TO connect-string | LIB CONNECT TO connection

Based on the SQL query we just saw, the dataset returned by the database engine to Qlik Sense will have three columns: Customer, Customer_Country, and Active_Flag

I created fictional data for explanatory purposes.

The sample output will look as follows:

[CUSTOMER] Data source:

Customer

Customer_Country

Active_Flag

Sensei Ltd

United Kingdom

Y

Sensei Ltd

Australia

N

Magic Quadrant Inc.

United States

Y

Banda Bulevard

Argentina

Y

 

As previously mentioned, when the script engine executes the script, it builds a separate symbol table for each field in the dataset. This symbol table contains a row for each distinct value in the field, and two columns: the field value, and a bit-stuffed, binary index value, more generally known as the pointer.

In this example, the dataset contains three columns or fields. The internal database will create three different symbol tables, one for Customer, one for Customer_Country, and another one for Active_Flag.

Always load fields that you will use in your application to avoid using up memory for data you do not need.

The first symbol table will be the Customer Symbol table, which is the first field in the dataset. This symbol table will have two columns, the first of which contains the distinct values of the dataset field Customer and will look like this:

The field Customer in the dataset has four rows, three of which are distinct or unique values. We can see that the resulting Customer Symbol table contains three rows.

The second column of the symbol table is a bit-stuff binary index that univocally references each row in the symbol table.

Remember: The index is bit-stuffed. This means the binary index only uses as many bits as necessary to account for all the distinct values, making it as efficient as possible.

The resulting Customer Symbol table with the distinct values and the bit-stuffed index will look as follows:

Following the same logic, there will be a symbol table for Customer_Country, and one for Active_Flag

Let's take a look at the resulting symbol table for the field Customer_Country:

And the Active_Flag symbol table looks like this:

Before we explain the data table used in the QIX internal database, we would like to briefly walk you through how the bit-stuffed index is populated.

A few paragraphs back, we mentioned that the bit-stuffed index is binary, and it will thus use the minimum amount of binary bits necessary to represent each of the distinct values in the symbol table.

When we have a few distinct values, it is easy to understand how this is populated. In the case of the Active_Flag symbol table, we have two distinct values, and consequently, it will use one binary bit to represent each row, 0 for the first value (Y) and 1 for the second value (N).

If we have three distinct values, which is the case in the Customer symbol table, but use only one binary bit per row, the engine will run out of bits by the time it populates the third and last value. To overcome this, the engine uses two binary bits per row: 00 for the first value, 01 for the second value, and 10 for the last and final value.

I bet many of you are asking why the third value cannot be represented as 11, or why the first value is 00 and not 01. 

The following is a formula that will let you calculate the number of necessary bits per row that the engine needs to identify each of the distinct values.

A binary system, or base two system, uses two symbols: 0 and 1. To represent each of the rows in a symbol table, the engine uses the least amount of binary bits possible. The formula to calculate this is:

n >= [number of distinct values]

2 is the base system identifier, and n is the minimum amount of necessary bits to represent each of the distinct values.

The formula 2n needs to meet the following criteria:

  • must be a whole integer, greater than or equal to 1
  • 2must be greater than or equal to the total number of distinct values to represent
  • Start with n = 1 and increment its value by one until it meets the condition
    2>= [number of distinct values]

To summarize, the exponential number represents the number of binary bits needed to represent the distinct values, and the resulting number of 2is the maximum amount of values that can be represented with n bits, where n is a whole number or a natural number.

Let's practice this with an example. The Customer_Country symbol table contains four distinct values: United Kingdom, Australia, United States, and Argentina. So, if we start putting the formula together, it will look like this:

2n >= 4

We now have 2, which is the base system, and 4, which is the total number of distinct values we want to represent. So, how do we find out n? According to the definition, n has to be a positive whole number starting from 1, and this number must be greater than or equal to the number of total values to be represented.

If n = 1, we have 2= 2.  Two isn't greater than or equal to 4, so we need to try the next whole positive number.

We increase the value of n by 1, now equaling 2. For n = 2, we have 22 = 4, which is greater than or equal to 4, and it is also the minimum amount of bits we can have.

Understanding how to calculate the minimum number of bits needed to represent each value in the symbol table is as important as knowing the position of each of the bits. 

The position of the bits goes from right to left. In a bit-stuffed index of 10, the first bit is 0 and the second bit is 1

To understand the positioning of these numbers, we created the following matrix: 

Bit position

2 1 0
2[bit position] 22 = 4 21 = 2 20 = 1
Symbol table Value 1 0 0 0
Symbol table Value 2 0 0 1
Symbol table Value 3 0 1 0
Symbol table Value 4 0 1 1
Symbol table Value 5 1 0 0
Symbol table Value 6 1 0 1
Symbol table Value 7 1 1 0
Symbol table Value 8 1 1 1

If we take the example of the Customer_Country symbol table, there are four distinct values with three bits needed to represent those four values. With two binary bits, we can represent, at the most, eight values (23 = 8). 

The first bit (far right) has the position 0. So, 2= 1. For the first digit, we alternate one 0 and one 1 for as many values as we need (four, in my example).

The second bit is in position 1. So, 2= 2. For the second bit, we alternate two zeros and two ones.

Finally, the third bit is in the second position, and thus, we have 2= 4. We then alternate four zeros and four ones for as many values as we have.

The same logic applies to the digits in the other positions.

Now that we have described how the symbol tables are populated with their values and bit-stuffed indexes, we will talk about the second table that is created by the QIX internal database: the data table.

The data table contains the same number of rows and fields as the original data, but uses the binary index values instead of the clear-text values, and is therefore highly compressed.

As we have already mentioned, our example dataset has three columns (Customer, Customer_Country, and Active_Flag) and four rows of data.

In the data table, this same data will be represented with the bit-stuff indexes, and it will look as follows:

So, it is clear that the biggest impact on the compression is the number of distinct values in a field. Not only will more distinct values increase the number of rows in a field’s symbol table, but the binary pointer values will be longer.

Here are a few things you can do to reduce the number of distinct values in your dataset, which will help with data compression and therefore improve your application performance:

  • Split fields, if possible. For example, if you have a free text field Company Name with the values Hedge Fund Limited, Hedge Fund Ltd, and Hedge Fund Ltd., this will account for three different, distinct values. However, if you split this field into Company Name and Company Suffix, you can have the Company Name table with one distinct value and Company Suffix with three distinct values, but it will be a lot easier to cleanse the data and convert everything to either 'Limited' or 'Ltd,' which will in return create one symbol table with one distinct value.
  • Remove blank spaces. The field value 'Juan' isn't the same as 'Juan '. This will count as two different values. Use the trim() function to get rid of unwanted blank spaces.
  • Shorten your dates. If you have a timestamp in your source but only display dates in your application, get rid of the time by using the function floor(). This will convert a timestamp value from '04-03-2019 10:54:22' to '04-03-2019'. If you need to show your hour, split your field into two, one for the date and one for the hour. This will drastically decrease RAM consumption. 
  • Replace business keys with surrogate keys: business keys are natural identifiers of values that are understandable to the business. For example, a business key for a Red Bike product could be 'BKS-RD-101928'. For the business, it is easy to understand that this product is a red bike with an identifier number of 101928. However, you could leave the product description to be used in your application front-end and convert all business keys into surrogate keys with the AutoNumber() function. This will create an incremental integer for each distinct value. Replacing long strings with numbers will improve RAM consumption significantly. 
Use AutoNumber(expression, AutoId) when using it in different fields. AutoId allows you to name each of the counters. This keeps consistency in the counter among those with the same name.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.117.75