Chapter 2
In This Chapter
Understanding the best data modeling practices
Leveraging Excel functions to deliver data
Creating smart tables that expand with data
One of Excel’s most attractive features is its flexibility. You can create an intricate system of interlocking calculations, linked cells, and formatted summaries that work together to create a final analysis. However, years of experience have brought me face to face with an ugly truth: Excel is like the cool gym teacher who lets you do anything you want — the freedom can be fun, but a lack of structure in your data models can lead to some serious headaches in the long run.
What’s a data model? A data model provides the foundation upon which your reporting mechanism is built. When you build a spreadsheet that imports, aggregates, and shapes data, you’re essentially building a data model that feeds your dashboards and reports.
Creating a poorly designed data model can mean hours of manual labor maintaining and refreshing your reporting mechanisms. On the other hand, creating an effective model allows you to easily repeat monthly reporting processes without damaging your reports or your sanity.
The goal of this chapter is to show you the concepts and techniques that help you build effective data models. In this chapter, you discover that creating a successful reporting mechanism requires more than slapping data onto a spreadsheet. Although you see how to build cool dashboard components in later chapters, those components won’t do you any good if you can’t effectively manage your data models. On that note, let’s get started.
Building an effective model isn’t as complicated as you may think. It’s primarily a matter of thinking about your reporting processes differently. Most people spend very little time thinking about the supporting data model behind a reporting process. If they think about it at all, they usually start by imagining a mock-up of the finished dashboard and work backward from there.
Rather than see only the finished dashboard in your head, try to think of the end-to-end process. Where will you get the data? How should the data be structured? What analysis will need to be performed? How will the data be fed to the dashboard? How will the dashboard be refreshed?
Obviously, the answers to these questions are highly situation-specific. However, some data modeling best practices will guide you to a new way of thinking about your reporting process. These are discussed in the next few sections.
One of the most important concepts in a data model is the separation of data, analysis, and presentation. The fundamental idea is that you don’t want your data to become too tied into any one particular way of presenting that data.
To wrap your mind around this concept, think about an invoice. When you receive an invoice, you don’t assume that the financial data on the invoice is the true source of your data. It’s merely a presentation of data that’s actually stored in a database. That data can be analyzed and presented to you in many other manners: in charts, in tables, or even on websites. This sounds obvious, but Excel users often fuse data, analysis, and presentation.
For instance, I’ve seen Excel workbooks that contain 12 tabs, each representing a month. On each tab, data for that month is listed along with formulas, pivot tables, and summaries. Now what happens when you’re asked to provide a summary by quarter? Do you add more formulas and tabs to consolidate the data on each of the month tabs? The fundamental problem in this scenario is that the tabs actually represent data values that are fused into the presentation of your analysis.
For an example more in line with reporting, take a look at Figure 2-1. Hard-coded tables like this one are common. This table is an amalgamation of data, analysis, and presentation. Not only does this table tie you to a specific analysis, but there’s little to no transparency into what the analysis exactly consists of. Also, what happens when you need to report by quarter or when another dimension of analysis is needed? Do you import a table that consists of more columns and rows? How does that affect your model?
The alternative is to create three layers in your data model: a data layer, an analysis layer, and a presentation layer. You can think of these layers as three different spreadsheets in an Excel workbook: one sheet to hold the raw data that feeds your report, one sheet to serve as a staging area where the data is analyzed and shaped, and one sheet to serve as the presentation layer. Figure 2-2 illustrates the three layers of an effective data model.
As you can see in Figure 2-2, the raw dataset is located on its own sheet. Although the dataset has some level of aggregation applied to keep it manageably small, no further analysis is done on the Data sheet.
The analysis layer consists primarily of formulas that analyze and pull data from the data layer into formatted tables commonly referred to as staging tables. These staging tables ultimately feed the reporting components in your presentation layer. In short, the sheet that contains the analysis layer becomes the staging area where data is summarized and shaped to feed the reporting components. Notice on the Analysis tab in Figure 2-2, the formula bar illustrates that the table consists of formulas that reference the Data tab.
There are a couple of benefits to this setup. First, the entire reporting model can be refreshed easily by simply replacing the raw data with an updated dataset. The formulas on the Analysis tab continue to work with the latest data. Second, any additional analysis can easily be created by using different combinations of formulas on the Analysis tab. If you need data that doesn’t exist in the Data sheet, you can easily append a column to the end of the raw dataset without disturbing the Analysis or Presentation sheets.
Along those same lines, remember that you’re not limited to three spreadsheets, either. That is to say, you can have several sheets that provide the raw data, several sheets that analyze, and several that serve as the presentation layer.
Wherever you choose to place the different layers, keep in mind that the idea remains the same. The analysis layer should primarily consist of formulas that pull data from the Data sheets into staging tables used to feed your presentation. Later in this chapter, you explore some of the formulas that can be used in your analysis sheets
Not all datasets are created equal. Although some datasets work in a standard Excel environment, they may not work for data modeling purposes. Before building your data model, ensure that your source data is appropriately structured for dashboarding purposes.
At the risk of oversimplification, I assert that datasets typically used in Excel come in three fundamental forms:
The punch line is that only flat data files and tabular datasets make for effective data models. I review and discuss each of these different forms in the next few sections.
Spreadsheet reports display highly formatted, summarized data and are often designed as presentation tools for management or executive users. A typical spreadsheet report makes judicious use of empty space for formatting, repeats data for aesthetic purposes, and presents only high-level analysis. Figure 2-3 illustrates a spreadsheet report.
Although a spreadsheet report may look nice, it doesn’t make for an effective data model. Why? The primary reason is that these reports offer you no separation of data, analysis, and presentation. You’re essentially locked into one analysis.
Although you could make charts from the report shown in Figure 2-3, it’d be impractical to apply any analysis outside what’s already there. For instance, how would you calculate and present the average of all bike sales using this particular report? How would you calculate a list of the top ten best-performing markets?
With this setup, you’re forced into very manual processes that are difficult to maintain month after month. Any analysis outside the high-level ones already in the report is basic at best — even with fancy formulas. Furthermore, what happens when you’re required to show bike sales by month? When your data model requires analysis with data that isn’t in the spreadsheet report, you’re forced to search for another dataset.
Another type of file format is a flat file. Flat files are data repositories organized by row and column. Each row corresponds to a set of data elements, or a record. Each column is a field. A field corresponds to a unique data element in a record. Figure 2-4 contains the same data as the report in Figure 2-3 but expressed in a flat data file format.
Notice that every data field has a column, and every column corresponds to one data element. Furthermore, there’s no extra spacing, and each row (or record) corresponds to a unique set of information. But the key attribute that makes this a flat file is that no single field uniquely identifies a record. In fact, you’d have to specify four separate fields (Region, Market, Business Segment, and a month’s sales amount) before you could uniquely identify the record.
Flat files lend themselves nicely to data modeling in Excel because they can be detailed enough to hold the data you need and still be conducive to a wide array of analysis with simple formulas — SUM, AVERAGE, VLOOKUP, and SUMIF, just to name a few. Later in this chapter, you explore formulas that come in handy in a reporting data model.
Many effective data models are driven primarily by pivot tables. Pivot tables (which I cover in Chapter 6) are Excel’s premier analysis tools. For those of you who have used pivot tables, you know they offer an excellent way to summarize and shape data for use by reporting components, such as charts and tables.
Tabular datasets are ideal for pivot table–driven data models. Figure 2-5 illustrates a tabular dataset. Note that the primary difference between a tabular dataset, as shown in Figure 2-5, and a flat data file is that in tabular datasets the column labels don’t double as actual data. For instance, in Figure 2-4, the month identifiers are integrated into the column labels. In Figure 2-5, the Sales Period column contains the month identifier. This subtle difference in structure is what makes tabular datasets optimal data sources for pivot tables. This structure ensures that key pivot table functions, such as sorting and grouping, work the way they should.
The attributes of a tabular dataset are as follows:
In Chapter 1, you might have read that measures used on a dashboard should absolutely support the initial purpose of that dashboard. The same concept applies to the back-end data model. You should only import data that’s necessary to fulfill the purpose of your dashboard or report.
In an effort to have as much data as possible at their fingertips, many Excel users bring into their spreadsheets every piece of data they can get their hands on. You can spot these people by the 40-megabyte files they send through email. You’ve seen these spreadsheets — two tabs that contain some reporting or dashboard interface and then six hidden tabs that contain thousands of lines of data (most of which isn’t used). They essentially build a database in their spreadsheet.
What’s wrong with utilizing as much data as possible? Well, here are a few issues:
These are all issues that can be avoided by importing only aggregated and summarized data that’s useful to the core purpose of your reporting needs.
Wanting to keep your data model limited to one worksheet tab is natural. In my mind, keeping track of one tab is much simpler than using different tabs. However, limiting your data model to one tab has its drawbacks, including the following:
You can include any information you think appropriate in your model map. The idea is to give yourself a handy reference that guides you through the elements in your data model.
This best practice is simple. Make sure your data model does what it’s supposed to do before building dashboard components on top of it. In that vein, here are a few things to watch for:
The obvious goal here is to eliminate easily avoidable errors that may cause complications later.
As you discover in this chapter, the optimal data model for any reporting mechanism is one in which data, analysis, and presentation are separated into three layers. Although all three layers are important, the analysis layer is where the real art comes into play. The fundamental task of the analysis layer is to pull information from the data layer and then create staging tables that feed your charts, tables, and other reporting components. To do this effectively, you need to employ formulas that serve as data delivery mechanisms — formulas that deliver data to a destination range.
You see, the information you need lives in the data layer (typically, a table containing aggregated data). Data delivery formulas are designed to get that data and deliver it to the analysis layer so it can be analyzed and shaped. The cool thing is that after you’ve set up the data delivery formulas, the analysis layer automatically updates each time the data layer is refreshed.
Confused? Don’t worry — in this section, I show you a few Excel functions that work particularly well in data delivery formulas. As you complete the examples here, you’ll start to see how these concepts come together.
The VLOOKUP function is the king of all lookup functions in Excel. I’d be willing to bet you’ve at least heard of VLOOKUP, if not used it a few times yourself. The purpose of VLOOKUP is to find a specific value from a column of data where the leftmost row value matches a given criterion.
Take a look at Figure 2-7 to get the general idea. The table on the left shows sales by month and product number. The bottom table translates those product numbers to actual product names. The VLOOKUP function can help in associating the appropriate name to each respective product number.
To understand how VLOOKUP formulas work, take a moment to review the basic syntax. A VLOOKUP formula requires four arguments:
VLOOKUP(Lookup_value, Table_array, Col_index_num, Range_lookup)
As you can imagine, there are countless ways to apply a VLOOKUP in all kinds of analyses. Let’s take a moment to walk through a scenario where using a VLOOKUP can help enhance your dashboard model.
With a few VLOOKUP formulas and a simple drop-down list, you can create a data model that not only delivers data to the appropriate staging table, but also allows you to dynamically change data views based on a selection you make. Figure 2-8 illustrates the setup.
The data layer in the model shown in Figure 2-8 resides in the range A9:F209. The analysis layer is held in range E2:F6. The data layer consists of all formulas that extract and shape the data as needed. As you can see, the VLOOKUP formulas use the Customer Name value in cell C3 to look up the appropriate data from the data layer. So if you entered Chevron in cell C3, the VLOOKUP formulas would extract the data for Chevron.
Well, remember that the idea behind separating the data layer and the analysis layer is so that the analysis layer can be automatically updated when the data is refreshed. When you get new data next month, you should be able to simply replace the data layer in the model without having to rework the analysis layer. Allowing for more rows than necessary in your VLOOKUP formulas ensures that if the data layer grows, records won’t fall outside the lookup range of the formulas.
Later in this chapter, I show you how to automatically keep up with growing data tables by using smart tables.
In the example illustrated in Figure 2-8, the data model allows you to select customer names from a drop-down list when you click cell C3. The customer name serves as the lookup value for the VLOOKUP formulas. Changing the customer name extracts a new set of data from the data layer. This allows you to quickly switch from one customer to another without having to remember and type the customer name.
Now, as cool as this seems, the reasons for this setup aren’t all cosmetic. There are practical reasons for adding drop-down lists to your data models.
Many of your models consist of multiple analytical layers in which each shows a different set of analyses. Although each analysis layer is different, they often need to revolve around a shared dimension, such as the same customer name, the same market, or the same region. For instance, when you have a data model that reports on Financials, Labor Statistics, and Operational Volumes, you want to make certain that when the model is reporting financials for the South region, the Labor Statistics are for the South region as well.
An effective way to ensure this happens is to force your formulas to use the same dimension references. If cell C3 is where you switch customers, every analysis that is customer-dependent should reference cell C3. Drop-down lists allow you to have a predefined list of valid variables located in a single cell. With a drop-down list, you can easily switch dimensions while building and testing multiple analysis layers.
Adding a drop-down list is relatively easy with Excel’s Data Validation functionality. To add a drop-down list, follow these steps:
In the Source input box, reference the range of cells that contain your predefined selection list.
In our example, this would be the list of customers you want exposed through the dashboard.
The HLOOKUP function is the less popular cousin of the VLOOKUP function. The H in HLOOKUP stands for horizontal. Because Excel data is typically vertically oriented, most situations require a vertical lookup, or VLOOKUP. However, some data structures are horizontally oriented, requiring a horizontal lookup; thus, the HLOOKUP function comes in handy. The HLOOKUP searches a lookup table to find a single value from a row of data where the column label matches a given criterion.
Figure 2-10 demonstrates a typical scenario in which HLOOKUP formulas are used. The table in C5 requires quarter-end numbers (March and June) for 2011. The HLOOKUP formulas use the column labels to find the correct month columns and then locate the 2011 data by moving down the appropriate number of rows. In this case, 2011 data is in row 4, so the number 4 is used in the formulas.
To get your mind around how this works, take a look at the basic syntax of the HLOOKUP function.
HLOOKUP(Lookup_value, Table_array, Row_index_num, Range_lookup)
HLOOKUPs are especially handy for shaping data into structures appropriate for charting or other types of reporting. A simple example is demonstrated in Figure 2-11. With HLOOKUPs, the data shown in the raw data table at the bottom of the figure is reoriented in a staging table at the top. When the raw data is changed or refreshed, the staging table captures the changes.
The SUMPRODUCT function is actually listed under the math and trigonometry category of Excel functions. Because the primary purpose of SUMPRODUCT is to calculate the sum product, most people don’t know you can actually use it to look up values. In fact, you can use this versatile function quite effectively in most data models.
The SUMPRODUCT function is designed to multiply values from two or more ranges of data and then add the results together to return the sum of the products. Take a look at Figure 2-12 to see a typical scenario in which the SUMPRODUCT is useful.
In Figure 2-12, you see a common analysis in which you need the total sales for the years 2011 and 2012. As you can see, to get the total sales for each year, you first have to multiply Price by the number of Units to get the total for each Region. Then you have to sum those results to get the total sales for each year.
With the SUMPRODUCT function, you can perform the two-step analysis with just one formula. Figure 2-13 shows the same analysis with SUMPRODUCT formulas. Rather than use 11 formulas, you can accomplish the same analysis with just 3!
The syntax of the SUMPRODUCT function is fairly simple:
SUMPRODUCT(Array1, Array2, …)
Array: Array represents a range of data. You can use anywhere from 2 to 255 arrays in a SUMPRODUCT formula. The arrays are multiplied together and then added. The only hard-and-fast rule you have to remember is that all arrays must have the same number of values. That is to say, you can’t use the SUMPRODUCT if range X has 10 values and range Y has 11 values. Otherwise, you get the #VALUE! error.
The interesting thing about the SUMPRODUCT function is that it can be used to filter out values. Take a look at Figure 2-14 to see what I mean.
The formula in cell E12 is pulling the sum of total units for just the North region. Meanwhile, cell E13 is pulling the units logged for the North region in the year 2011.
To understand how this works, take a look at the formula in cell E12, shown in Figure 2-14. That formula reads SUMPRODUCT((C3:C10="North")*(E3:E10)).
In Excel, TRUE evaluates to 1 and FALSE evaluates to 0. Every value in column C that equals North evaluates to TRUE or 1. Where the value is not North, it evaluates to FALSE or 0. The part of the formula that reads (C3:C10="North") enumerates through each value in the range C3:C10, assigning a 1 or 0 to each value. Then internally, the SUMPRODUCT formula translates to
(1*E3)+(0*E4)+(0*E5)+(0*E6)+(1*E7)+(0*E8)+(0*E9)+(0*E10).
This gives you the answer of 1628 because
(1*751)+(0*483)+(0*789)+(0*932)+(1*877)+(0*162)+(0*258)+(0*517)
equals 1628.
As always in Excel, you don’t have to hard-code the criteria in your formulas. Rather than explicitly use "
North"
in the SUMPRODUCT formula, you could reference a cell that contains the filter value. You can imagine that cell A3 contains the word North, in which case you can use (C3:C10=A3) instead of (C3:C10="North"). This way, you can dynamically change your filter criteria, and your formula keeps up.
Figure 2-15 demonstrates how you can use this concept to pull data into a staging table based on multiple criteria. Note that each of the SUMPRODUCT formulas shown here references cells B3 and C3 to filter on Account and Product Line. Again, you can add data validation drop-down lists to cells B3 and C3, allowing you to easily change criteria.
The CHOOSE function returns a value from a specified list of values based on a specified position number. For instance, if you enter the formulas CHOOSE(3, “Red”, “Yellow”, “Green”, “Blue”) into a cell, Excel returns Green because Green is the third item in the list of values. The formula CHOOSE(1, “Red”, “Yellow”, “Green”, “Blue”) would return Red. Although this may not look useful on the surface, the CHOOSE function can dramatically enhance your data models.
Figure 2-16 illustrates how CHOOSE formulas can help pinpoint and extract numbers from a range of cells. Note that instead of using hard-coded values, like Red, Green, and so on, you can use cell references to list the choices.
Take a moment to review the basic syntax of the CHOOSE function:
CHOOSE(Index_num, Value1, Value2, …)
Index_num
: The Index_num argument specifies the position number of the chosen value in the list of values. If the third value in the list is needed, the Index_num is 3. The Index_num argument must be an integer between one and the maximum number of values in the defined list of values. That is to say, if there are ten choices defined in the CHOOSE formula, the Index_num argument can’t be more than ten.V
alue argument represents a choice in the defined list of choices for that CHOOSE formula. The Value> arguments can be hard-coded values, cell references, defined names, formulas, or functions. You can have up to 255 choices listed in your CHOOSE formulas.The CHOOSE function is especially valuable in data models in which multiple layers of data need to be brought together. Figure 2-17 illustrates an example in which CHOOSE formulas help pull data together.
In this example, you have two data tables: one for Revenues and one for Net Income. Each contains numbers for separate regions. The idea is to create a staging table that pulls data from both tables so that the data corresponds to a selected region.
To understand what’s going on, focus on the formula in cell F3, shown in Figure 2-17. The formula is CHOOSE($C$2,F7,F8,F9,F10). The Index_num argument is actually a cell reference that looks at the value in cell C2, which happens to be the number 2. As you can see, cell C2 is actually a VLOOKUP formula that pulls the appropriate index number for the selected region. The list of defined choices in the CHOOSE formula is essentially the cell references that make up the revenue values for each region: F7, F8, F9, and F10. So the formula in cell F3 translates to CHOOSE(2, 27474, 41767, 18911, 10590). The answer is 41,767.
One of the challenges you can encounter when building data models is a data table that expands over time. That is to say, the table grows in the number of records it holds due to new data being added. To get a basic understanding of this challenge, take a look at Figure 2-18. In this figure, you see a simple table that serves as the source for the chart. Notice that the table lists data for January through June.
Imagine that next month, this table expands to include July data. You’ll have to manually update your chart to include July data. Now imagine you had this same issue across your data model, with multiple data tables that link to multiple staging tables and dashboard components. You can imagine it’d be an extremely painful task to keep up with changes each month.
To solve this issue, you can use Excel’s Table feature (you can tell they spent all night coming up with that name). The Table feature allows you to convert a range of data into a defined table that’s treated independently of other rows and columns on the worksheet. After a range is converted to a table, Excel views the individual cells in the table as a single object with functionality that a typical data range doesn’t have.
For instance, Excel tables offer the following features:
To convert a range of data to an Excel table, follow these steps:
On the Insert tab of the Ribbon, click the Table button.
This step opens the Create Table dialog box, as shown in Figure 2-19.
After the conversion takes place, notice a few small changes. Excel has put autofilter drop-downs on the header rows, the rows in the table now have alternate shading, and any header that didn’t have a value has been named by Excel.
You can use Excel tables as the source for charts, pivot tables, list boxes, or anything else for which you’d typically use a data range. In Figure 2-20, a chart has been linked to the Excel table.
Here’s the impressive bit. When data is added to the table, Excel automatically expands the range of the table and incorporates the new range into any linked object. That’s just a fancy way of saying that any chart or pivot table tied to an Excel table automatically captures new data without manual intervention.
For example, if I add July and August data to the end of the Excel table, the chart automatically updates to capture the new data. In Figure 2-21, I added July with no data and August with data to show you that the chart captures any new records and automatically plots the data given.
Take a moment to think about what Excel tables mean to a data model. They mean pivot tables that never have to be reconfigured, charts that automatically capture new data, and ranges that automatically keep up with changes.
If you want to convert an Excel table back to a range, you can follow these steps:
3.22.66.140