8

Moving beyond basic pivots

Normal pivot tables will meet many of your needs, however, when it comes to big data, you will need some more powerful tools. This chapter explains in lay terms the limitations in using normal pivot tables to interrogate relational data. The chapter starts by explaining what a relational database is using library examples, and then explains how such data can be interrogated using PowerPivot. The basic functions of PowerPivot are explained, including step-by-step instructions on how to link data to the PowerPivot model, how to create relationships within the model, what calculated columns and measures are, and how to create them. The chapter is written for readers who have never used PowerPivot.

Keywords

PowerPivot; big data; relational data; data warehousing; library analytics tools; DAX; calculated columns; measures

Sooner or later you will be faced with serious limitations in what you can do with normal pivot tables. This is where PowerPivot comes to the rescue! PowerPivot performs calculations in a fraction of the time of normal Excel, and it can handle a lot more data. The current version of PowerPivot can handle 1,999,999,997 distinct values in a column, and 1,999,999,997 rows of data. But this is only the tip of what PowerPivot can do.

Before PowerPivot I had written some complex VBA code to produce findings from a large dataset on library usage. I was interested in seeing whether changes in student usage of library resources over the students’ years of study correlated with changes in their grades. I found it did, but it took me weeks to get that data, and when I ran the code, I had to run it on a Friday, and when I came to work on the Monday hope that the computer had not crashed, and the code had finished running. I had to solve a problem of similar complexity recently, albeit for a totally unrelated dataset, and when I nutted out the formula I needed, PowerPivot completed the calculation almost instantly. Speed is essential for working on large datasets, life is too short to have to wait a weekend every time you want to run a new calculation. However, speed is no good if it is not backed up by function.

Broadly speaking, PowerPivot provides two additional functions that allow you to perform calculations that would otherwise be impossible using normal pivot tables:

• Formulas that can count distinct items

• Formulas that iterate through every row of data.

While this may not seem very impressive now, the benefit will soon become apparent when you try to use an ordinary pivot table to work with a denormalized table.

A denormalized table is one where you have data that previously existed in a relational database, and it has been exported to a single table. It will include a lot of repetition in the table, i.e., “duplicate” rows. Librarians fully understand resource databases and metadata. You may, however, be a bit rusty on what the structure of a relational database actually encompasses, so to avoid the risk of confusion I will quickly cover this now. Please forgive me if you already know this.

Relational databases

Imagine you were asked to create a system to record two bits of information, the title of a book, and the author. Sounds simple enough. So you create a spreadsheet, and you have one column called “Author,” and another called “Title.” Your boss then asked you to enter all your books and authors into your new system. The system works fine until you get to the third book. This book has two authors. What do you do? You might decide to rename the second column “Author1,” and create a third column called “Author2.” Great, you press on. Then you get to the sixth book, which has ten authors! What do you do now? You could continue to press on and create more author columns, but that is not a workable option for two reasons. Firstly, imagine you were asked to produce a list of all the authors. What are you going to do, cut and paste all the author columns into a single column? This would be very time consuming, and error prone. Secondly, imagine that instead of entering books, you were entering data on sales, with one column being the name of the sales person, and another being the name of the customer. A single salesperson might have hundreds, or even thousands of customers. A thousand “customer” columns is unworkable.

The problem is that one book might have many authors. This is called a one-to-many relationship. But it does not stop there. One author might write many books. So when it comes to books, many books can be written by many authors. This is what is called a many-to-many relationship. It is possible to enter such relational data into a single table. However, it comes with many problems. For example, you might decide that you are going to stick with the two columns, one for “Author,” and another for “Title,” and if a book has two authors, you enter two rows of data. If a book has three authors, you enter three rows of data. And so on…

image

There is a lot of duplicate data entry, but its not too bad. But what happens if you wanted to also include more information, such as publisher, year of publication, ISBN, etc. The amount of duplicated data entry would soon escalate out of control. What happens if someone enters one of those duplicate entries incorrectly? For a start, any report you run of such a system will be inaccurate. If you used this system to count to count the number of titles by counting the unique ISBNs, then how accurate is your data going to be if many of these have been entered incorrectly? Of course, it is always possible to enter something incorrectly, but the point is such a system is designed so as that mistakes will be almost certain. You want the chances of mistakes to be unlikely, not almost certain. Finally, say you realized that Jane Smith had been entered incorrectly, and it should have been Janet Smith. How are you going to fix this in such a system? Whatever you do it is going to be risky, labor intensive, or both.

For all these reasons, a flat file (i.e., a single table), is a pretty poor tool for entering anything but simple data. Relational databases were created to manage data that consists of many-to-many relationships. Relational databases solve this problem of the many-to-many relationships by splitting the data into separate tables, and creating a joining table.

image

All the authors are entered into the author table, and they are only entered once. That way, if you needed to change Jane Smith to Janet Smith, then you would only need to do it once, in the Author table. In exactly the same vein, all the book titles are entered into the “Title” table, and they are only entered once. You will notice that both the Author and the Title tables have a unique identification field, called AuthorID, and TitleID respectively, and a little key symbol next to them. Given that the author should only be in the database once, you could in this example use the AuthorName as the primary key. However, this is not good practice. What would happen, for example, if we had two authors with the same name? The purpose of the primary key is to uniquely identify each row of data in a table. A lot of databases use sequential numbers that are automatically generated when a user adds a new record. So, in this example, if the first author I entered was “Joe Bloggs,” the Joe would get an AuthorID of 1A. If I then entered Jane Smith, she would get an AuthorID of 2A. The order authors are entered is completely irrelevant, and does not matter. The point is once an author is assigned an ID, that ID becomes their ID for the life of the database. It a bit like when new client joins up your library, they are given a number, and that number is their library number, and no one else has that number. The same logic applies to the Title table.

The AUTHOR_TITLE table is where the magic happens. This table can then be used to assign a title to each author, and an author to each title. So, for example if the TitleID for the work “Libraries are great” was 1T, and the AuthorID for Joe Bloggs was 1A, and for Jane Smith her AuthorID was 2A, then your AUTHOR_TITLE table would look like this:

image

Now, relational databases are fantastic when it comes to entering data. They dramatically reduce the amount of data entry required, the risk of error, and data storage size requirements. However, while relational databases are essential for managing data entry, they can be awful when it comes to data reporting.

Imagine you wanted to produce a report that showed all the titles in your library, and their authors. You could not use the AUTHOR_TITLE, as it would just return nonsense. Don’t let the fact that in this very simplistic example we could have used actual author and title names in this table fool you. Real world relational databases quickly become very complex, and using numbers as primary keys is essential. So, to get a report that included data that makes sense to humans we would need to tell the computer for each record in the AUTHOR_TITLE table, go off and fetch the AuthorName for the AuthorID from the AUTHOR table – and while you are at it, do the same sort of thing for the titles, then give this data to me in a new table. So when you want to retrieve data from a relational database, there is no avoiding giving the computer complicated instructions on what it needs to do. These instructions are commonly encoded in a language called Structured Query Language, or SQL for short.

Now, you could get someone to write an SQL query to give you the report you need, and many organizations have databases where this is someone’s job. However, writing SQL can be quite labor intensive. If I wanted another report, but this time including publisher details, then the SQL query would need to be amended. In this example it would be relatively straight forward to extend the SQL query, but in many cases it will not, because it will require data to be drawn from across new many-to-many relationships that the previous SQL did not have to worry about.

This is why data warehousing became popular. SQL still plays a role, but a data warehouse can be created to deliver a system where users can build the reports they need, without knowing how to do anything more than drag and drop dimensions and measures onto a crosstab.

Normal pivot tables are built to create reports on non-relational data. They are very good at this. They are flexible, fast, and reasonably intuitive – once you get used to them. They start to show their limitations very quickly, however, when it comes to handling relational data that has been flattened into a single table.

For example, say we tried to run a pivot over the SQL export from our small library database. The SQL would produce a file that looks exactly the same as the original file format we tried to use for data entry.

image

This file format is useless for data entry, but useful for data reporting. But it is not very useful when it comes to normal pivot tables. If we wanted to count the number of unique Titles from this table, there is no simple way of doing it. You could do something dodgy, like create a new column, and use the COUNTIF formula to return a value that you can then use to calculate the number of unique titles, such as the below:

image

This is fine, albeit clumsy, if you want to calculate the total number of unique titles. However, what if you wanted to count how many titles Jane Smith had written? She would get the nonsense number of 0.5. You could of course adjust for this by making a combined key of author and title, and running a COUNTIF on that key. However, if you take this dodgy approach you will soon find that you need an outrageous number of columns to cater for all the reports you require. Furthermore, none of your dodgy formulas will have scope over the other dodgy formulas. So this road is way too dodgy to drive down.

PowerPivot

The better solution is PowerPivot. It is a free add on for Excel – and it has enormous functionality. I continue to be surprised by the number of professional staff who are unaware of the existence of PowerPivot – professionals that would benefit immensely from using it. PowerPivot has been built specifically for the purpose of reporting on denormalized relational data. In other words, it has been designed to report on data such as the above example, where relational data has been flattened into a single table. PowerPivot can also be used to build relationships between data, and this is something I have discussed further in the next chapter.

If we were to build a PowerPivot on our Author & Title dataset, all we would need to do is add a measure, say called DistinctTitles, and the formula would be=DISTINCTCOUNT([title]). That’s it, nothing else required. And if we filtered our author to Jane Smith, we would see one distinct title. In other words, PowerPivot dynamically calculates values based on your current data view and filters used.

The second fantastic feature of PowerPivot, is that it has a suite of formulas that are designed to iterate over every row of data. The closest thing in normal Excel are array formulas, but that is about as close as a Porsche 911 is to a Trabant.

The iterating PowerPivot formulas don’t just look at the current row, they loop over the entire table for each and every row. This gives you the power to write formulas that can check through all the existing data, then do something with the current row based on what you found in all the other rows. This is one of those things, where chances are you will have no idea of its potential use value until you actually are faced with a problem where you need to use such functionality. If you are working with denormalized relational data, then you will very quickly find you need formulas that can iterate over an entire table.

I will attempt to explain this with an example. Say we returned to our small library database, and we wanted to do something simple like calculate the average number of authors per title.

image

Because our data has been denormalized, you will find this difficult using normal Excel, and even if you did come up with something, it would be static and would not produce sensible data when you slice and dice your data. The best way to understand how row iteration works is to imagine you are manually calculating the average number of titles per author. If you were to do this manually, you would start at the first row, then scan down the list of remaining titles to count the number of titles that are the same as the current row you are in. So, for “Why Libraries are great” we find two rows, and therefore two authors. For the “Life of Turtles” we find one author, and for the “Libraries of the 22nd century” we find two authors. So that is two authors, one author and two authors, giving a total number of five authors contributing to three titles. So the average number of authors per title is 1.67. If you followed the logic carefully, you will see that to arrive at this total, we had to loop through the entire table for each row of data in the table. PowerPivot formulas can do this, where normal Excel formulas cannot. The PowerPivot measure that you could use to do the above is:

=AVERAGEX(

 DISTINCT(LibraryTable[title]),

  CALCULATE(COUNTROWS(LibraryTable), ALLEXCEPT(LibraryTable, LibraryTable[Title])

   )

  )

This is going to mean nothing to you right now. But I challenge you to enter the simple library database table, and after you have finished reading the next chapter, see if you can write a measure that calculates the average authors per title. Even if you don’t get it right, I suspect that you will come very close to understanding the principle behind it, and if you come back to this formula after trying to do it yourself, the act of trying to solve it yourself will help you to learn. You might even find a much simpler and therefore better way of writing this formula.

How to use PowerPivot

You can download PowerPivot for free from the Microsoft website. The current (June 2015) URL is:

https://support.office.com/en-gb/article/Power-Pivot-Add-in-a9c2c6e2-cc49-4976-a7d7-40896795d045

PLEASE be aware that Microsoft can change their website at any time, so this URL might not work by the time you read this book.

Chances are you will not have administrator rights to your computer, so you will have to get your IT department to install PowerPivot. If they did install PowerPivot correctly, then the first time you start Excel you will see a message indicating that the PowerPivot add-in is loading. When you open Excel you will see a new menu ribbon for PowerPivot.

When you click on this menu you will see a few foreign options. The PowerPivot window will be the most unfamiliar thing. To access this window, click on the green PowerPivot icon on the far left of the ribbon. You have not linked PowerPivot yet to any data source, so you will just see a gray blank PowerPivot screen in the PowerPivot window.

You can link PowerPivot to many sources. For example, you can link it to text files, to Excel tables, to database tables, to database queries, and to data feeds. To link to any of these sources usually only takes a couple of clicks, and is very simple. Once data has been linked to the PowerPivot model there are four main things you can do:

• add calculated columns to the linked table

• join linked tables

• create pivot tables

• create measures for the pivot table

I will use a very simplistic Author and Title data to show how you can create a PowerPivot. The simplest way to link data in an Excel spreadsheet is to convert your data into a table, then create a “Linked Table.”

To convert your data to a table, click on the Insert menu, select the data you wish to convert to a table, then click on the “Table” icon. Once you have converted a range to a table, it will look like the following:

image

It is very good practice to name your tables something sensible, and to do this before you attempt to link PowerPivot to your table. PowerPivot can be a bit like a temperamental high end sports car at times – a very high performer, but prone to failure if not handled carefully! Notice I have named this table “LibraryTable” (see top left of above image).

Once you have converted your range to a table, and you are confident that you will not need to change the table name, the next step is to link the table to the PowerPivot model. To do this, just click on the PowerPivot ribbon, click on your table, then click on the icon “Create Linked Table.” As soon as you do this the PowerPivot window will open, and you will see your table.

image

Adding calculated columns

In the PowerPivot window you will see next to your data a blank column, with the header “Add Column.” To add a column you can double click on the “Add Column” header to change its title, then in any of the cells below, enter a formula.

image

The formulas that you use in PowerPivot are very different to the ones you use in normal Excel. Where there are similarities, these can be deceptive, as the similarities can confuse you when you encounter something that does not work as you would expect. So the very first step to using PowerPivot is an emotional one, you need to accept that it is different, stop looking for the similarities with normal Excel, and expect that it will take you a little bit of time to learn.

One of the first, and perhaps most fundamental difference, is when it comes to entering formulas. In normal Excel your formulas will refer to anything from tables, to ranges, to individual cells. PowerPivot formulas share more in common with SQL than normal Excel, and they only refer to tables and columns. There is no cell A1, and you cannot type=A1+B3. The more proficient you are at normal Excel, the more time it is going to take you to adjust to this!

Creating a PowerPivot PivotTable

That’s a bit of a mouthful! You can create a PivotTable either via the PowerPivot window, on the Home menu in the center of the ribbon – or alternatively you can create a pivot via the main Excel screen. To insert a pivot via the main Excel screen, click on the PowerPivot menu, then click on insert PivotTable. After you have inserted a PivotTable, you will see an empty pivot, and a field list, much like a regular PivotTable.

image

The top window includes all the tables and their fields and measures. You have only linked one table to the PowerPivot model, the “LibraryTable.” Obviously, if you have not linked a table to PowerPivot, then it will not appear in the model. The measures, which will be discussed below have a calculator icon next to them. In this example, AverageAuthorsPerTitle is a measure. You will not see this measure, as you have not added it yet. Just like normal pivots, you can drag and drop fields into the Slicers, Report Filters, Column Labels, Row Labels, or Values boxes. Measures are more restrictive, and you cannot put measures anywhere except in the Values box. Depending upon how you set up your PivotTable options, you may also drag and drop directly onto the PivotTable. The Slicers function in PowerPivot is much easier to manage compared to normal Excel.

The difference between a measure and a calculated column

There will be some occasions where you could add a calculated column to the values section of the PivotTable, which raises the question, why have measures? The answer is a calculated column is static, whereas a measure can be dynamic. Sometimes it is OK for data to be static, sometimes it is not. For example, if you wanted to count the number of books an author published in specific period, and you want to be able to change that period, from say months to years to decades, at will, then you will probably need a measure. A measure can calculate a value based on the current filters you have applied to the data, and the currently active row and column fields. A calculated column, however, is not nearly as flexible. Say, for example, taking the simple library database we used earlier, we wanted to count the number of titles produced by each author. You could include a calculated column that counts the number of times an author appeared in our table, and a calculated column to create a publication frequency distribution. But what happens, if for example, you wanted to slice and dice the data, and only look at publications within a certain period, or by the subject matter of the book (obviously our simple table had to just grow in our imagination to accommodate this example). Sure, you could filter the data down to the dataset you want to see. However, if you do this you will only see rubbish data. This is because when we include a calculated column the value it returns does not change for that row – as it does not take into account any filters that the users may have applied. If Joe Bloggs has authored two books, then the calculated column will return a value of two for each row of data relating to Joe. So, whenever we use this calculated column as a row label, and we counted the number of distinct (unique) authors in each frequency, then Joe will either be added to the frequency of two titles, or he will be excluded altogether, because we had filtered him out from the PivotTable. But what if he produced one book in 2013, and another in 2014, and we wanted to count the number of books Joe published in 2014. There is no way to get the value of one from the existing data. We could add another calculated column that counted frequency of publication by year. However, this calculated column would also be static, which means if we wanted to slice and dice the data by another field we would run into similar problems. So, measures are dynamic, calculated columns are static.

Now you might be asking, why bother with calculated columns then? Why not just use measures? The answer is that you can only uses measures as measures, i.e., as values. If you don’t believe me, try dragging a measure into a row or column on a PivotTable.

Adding a measures

There are a couple of ways to add a measure. You can do it via the PowerPivot ribbon on the main Excel screen, where there is an icon called “New Measure.” Alternatively, you can right click over the relevant table in the PowerPivot Field List, and select “Add New Measure.” When you do this, the Measure Settings dialogue box as shown in the next figure will open:

image

I would like to share a couple of tips that I found useful. You can zoom in on the formula by clicking on the formula window, and using the middle mouse wheel. This is very helpful for people like me with a few decades of screen fatigue under our belts. While you are in the formula window, also try adding a new line. You can do this by pressing the “Alt” plus “Enter” keys simultaneously. The formula bar shows the formula syntax, and shows a list fields as you start typing. Remember that you can set your number format in this dialogue box too.

Rather than outline generic formulas, I will take you through to the next chapter, which will demonstrate how to create a desktop library cube using PowerPivot.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.223.10