This chapter discusses several Excel functions and functionalities that can assist in the analysis of data sets, including:
The chapter does not deal with the linking of Excel to external data sets, such as an Access database or SQL servers, or other options that are contained on the Data tab; such techniques are outside the scope of this text.
In practice one will often have to manipulate or clean initial data before it can be used for analysis and reporting purposes. There are many techniques that may need to be applied in such cases, including:
In general when analysing data sets, one has two generic approaches to creating summary reports:
When performing complex database queries (based on multiple underlying criteria or on unusual criteria), one has a choice as to whether to apply such criteria at the reporting level only (whilst leaving the data set unchanged), or whether to add a new calculated field to the data set, with such a field being used to identify whether the more complex set of criteria has been met (e.g. by producing a flag field that is 0 or 1, accordingly). This latter approach is often simpler and more transparent than creating a complex criteria combination at the query level. Of course, it increases the size of the data set, so that where many such combinations are required, it can be more efficient (or only possible) to build them within the reporting structure instead.
In some cases, a combined approach may be necessary. For example, if one of the criteria for a query were the length of the country name, or the third letter of the name, then it would likely be necessary to create a new field (e.g. using the LEN or MID functions) to create this field explicitly in the database.
An Excel database is a contiguous range of cells in which each row contains data relating to properties of an item (e.g. date of birth, address, telephone number of people), with a set of column headers that identify the properties. The column structure of a database differentiates it from general Excel data sets, in which formulae may be applied to either rows or columns.
Note also that it is generally important for the field names (headers) to be in a single cell, and for a structured approach to be used to create them (especially for larger databases). The frequent approach in which categories and sub-categories are used to define headers using two or more rows is not generally the most appropriate approach. Rather, field headers will need to be in a single row. For example, instead of having one “header” row and a “sub-header” row beneath it, the headers and sub-header fields can be combined in a structured way. This is illustrated in Figure 26.1, in which the range B2:I3 contains the (inappropriate) two-row approach, and in Figure 26.2, where the range B3:I3 contains the adapted single-row approach.
An Excel Table is a database that has been explicitly defined as a Table, using any of:
During the process of creating a Table, there is a possibility to explicitly define a name for it; otherwise Excel will automatically assign one (such as Table1).
The key properties (and benefits) of Tables are:
(Once a Table has been defined, it can be converted back to a range by right-clicking on it and using Table Tools/Convert to Range, or by using the Convert-to-Range button on the Design tab that appears when selecting any point in the table.)
In many real-life applications, the techniques in this chapter (and general techniques in data manipulation and analysis) become much more effective when automated with VBA macros. For example:
Examples of these techniques are discussed in Chapter 32.
Filters can be applied by simply selecting any cell in the database range and clicking on the Data/Filter icon. Note that when doing so it is important to ensure that the field headers are included (or the top row will by default be treated as containing the headers).
Further, one should define the data set within a single isolated range with (apart from the headers) no adjoining labels, and have no completely blank rows or columns within the intended data set. In this case, the default data set will be the Current Region of the selected point, i.e. the maximum-sized rectangular region defined by following all paths of contiguity from the original point. In practical terms, all rows above and below are included up to the point at which a first blank row is encountered (and similarly for columns), so that it is important to ensure that there are no blank rows in the data set, and that there are no additional labels immediately below the data set or above the field headings (and similarly in the column direction). Note that the Current Region of the selected point can rapidly be seen by using the short-cut Ctrl+Shift+*.
The file Ch26.1.DaysLateAnalysis.Initial.xlsx shows an example of a filter applied to a data set. Note that drop-down menus are automatically added to the headers, with these drop-downs providing sorting options as well as field-sensitive filtering options. The list of filtering options for a given field (e.g. for the Country field) provides the ability to quickly inspect for errors (e.g. spelling errors, or negative numbers in contexts where all numbers should be positive, as well as see whether blank items are present). Figure 26.3 shows an example of the initial data set, and Figure 26.4 shows the results of applying, within the Country field, the filters to select only the blanks or incorrect spelling of the country Italy.
Note that potential disadvantages of this standard filter approach are that:
An easy way to remove all filters is with the short-cut Alt-D-F-S.
Note that one needs to take extra care when applying most Excel functions to a filtered data set, as the results may be unexpected or not intuitive. For example, for most regular Excel functions (e.g. SUM, COUNT), the result is the same whether the data is visible or not and whether it is filtered or not. On the other hand, the SUBTOTAL function has options to include or ignore hidden data (see Chapter 17), but these options may have unexpected effects when the data is filtered. Once a filter is applied, a row that is hidden (either before or after the application of the filter) is treated as if it were filtered out (rather than being not filtered out but simply hidden).
Figure 26.5 shows the functions applied to a filtered data set, and Figure 13.6 shows the same, except that one of the rows has been hidden. Within each case, the SUBTOTAL function returns the same result, which is that based on the visible data, so that the function options regarding hidden data seem to have no effect in Figure 26.6. The reader can verify in Excel file that with filters removed, the hiding of a row will lead to the two SUBTOTAL functions giving different results.
Thus, it is generally advisable not to apply Excel functions to filtered data. Rather, the use of Database functions or other approaches in which criteria are made explicit within the functions (such as using SUMIFS) are to be preferred.
In order to analyse a data set, one may also need to identify the unique list of items within a category (such as country names and customers), or a unique list of combined items (such as unique country-customer combinations). The use of Excel's Remove Duplicates (on the Data tab) is an important functionality in this respect, enabling one to generate the actual list of unique items.
The file Ch26.2.DaysLateAnalysis.UniqueItems.xlsx shows an example. In the first step, one copies all the data in the relevant common (such as the country names) to another area not contiguous with the data set (perhaps using the short-cut Ctrl+Shift+↓ to select the data). The Remove Duplicates menu is then used to create a list of unique items (taking care as to whether the header field has been copied or not). Figure 26.7 shows the initial steps of this process, applying the Remove Duplicates to the copied set of Country data, with the result shown in Figure 26.8.
The identification of unique combinations can be done in one of two ways:
The file Ch26.3.DaysLateAnalysis.UniqueCombinations.xlsx shows the results of using the second approach. Figure 26.9 shows the initial steps in its application (i.e. the copying of the relevant data sets and application of the Remove Duplicated menu). Note that if working with a larger data set (such as if the whole data set had been copied so that the columns for the date or amount would not be relevant for the definition of unique items), the check-boxes that show within the Remove Duplicates dialog for these items would simply be left unticked. Figure 26.10 shows the result of the process.
When cleaning or tidying data sets, one may wish to simply delete some rows, including:
As noted above, the use of the short-cut Ctrl+Shift+* to select the data would not select anything below the first blank row, so that if the presence of blank rows within the data set is possible, this shortcut should not be used; rather, the data set should be selected explicitly in its entirety before applying filters.
The file Ch26.4.DaysLateAnalysis.DeleteUsingFilter.xlsx shows an example. The main steps in are:
Let us assume that it is desired to remove all records for which the Country field is either blank or applies to Italy (which are assumed to both be regarded as irrelevant for all subsequent analysis purposes). Figure 26.11 shows the part of the process which has filtered the data, and has selected the filter rows in preparation for their deletion. Figure 26.12 shows the result when these rows have been deleted and the filters removed, thus giving the cleaned data set.
Note that this process can be automated relatively easily with VBA macros. For example, when the macro is run, the user could be asked to click on a specific cell which contains the identifier of the items to be deleted within that field (e.g. a blank cell, or a cell containing the word Italy), with the macro taking care of all other steps behind the scenes. In this way, data sets can be rapidly cleaned: the copying, filtering and removal of filters is embedded as part of the overall process, and one can run such a process many times very quickly.
In some cases, one may wish to use filters to extract items rather than to delete them. In such a case, the initial part of the process can be performed as above (up to that shown in Figure 26.11), whereas the latter part would involve copying the data to a new range, rather than deleting it. This is in principle a very straightforward and standard operation if basic Excel elements are respected (such as ensuring that any copy operation does not overwrite the original data). Note that Excel will replace any calculated items with their values to ensure integrity of the process.
Although Excel provides several in-built options regarding filters, it is often more convenient to add a new criterion to the data set.
The file Ch26.5.DaysLateAnalysis.NewCriteria.xlsx shows an example in which it is desired to identify and delete all rows in which two or more field elements are blank. A new column has been added which uses the COUNTBLANK function to calculate the number of blanks fields (see Figure 26.13). For simplicity of presentation, Conditional Formatting has been applied to highlight those cells where the identified number of blanks in the row is equal to the threshold figure (defined in Cell F2). In other words, it identifies the rows that may be desired to be deleted. Note that (since the original data set in the example was not explicitly defined as an Excel table), once this column is added, the Data/Filter menu will need to be reapplied to ensure that a filter is included in the new column. The same procedure(s) as in the above examples can then be used to simply filter out and delete (or copy) these rows.
Generally, it is preferable to define the data set as an Excel Table. Note that one needs to take care to ensure that, if an original data set contains blank rows, then the Table range needs to be extended to include all data (as the Insert/Table operation will have selected only the Current Region of a data point by default).
The file Ch26.6.DaysLateAnalysis.Table1.xlsx shows an example of the process to create a Table (see Figure 26.14), noting that the default range would have been only up to Row 15 (as Row 16 is blank), so that the range was manually extended to include the rows up to Row 104.
Once defined as a Table, any (contiguous) new contiguous rows or columns that are added will automatically become part of the Table. In addition, a formula entered in the first data cell of a field will be copied to all rows of the table. Indeed, the simple making of entry in such a cell (such as in Cell F5 of the original, non-extended, Table) will result in a new column being added automatically. Figure 26.15 shows the same process as above (in which the COUNTBLANK function is used as a new field), when applied to a Table.
Note that the syntax of the formula in every cell within the Table range of column F is the same (and is not row-dependent):
If one were to change the formula, such as to count the blanks from different points, then the deletion of the original field name (as shown Figure 26.16) will produce a drop-down menu of field names that can be chosen instead to build the required formula.
Note also that, when using a Table, one can create formulae using the Header Names that the Table generates, such as:
Finally, note also that when one has selected a cell within the Table, the Table Tools Design Tab appears. This contains icons to Remove Duplicates and to create a PivotTable summary. The tab also allows one to rename the Table, although experience suggests that this process is best done when the Table is first created, and not once formulae have been built that refer to the name.
In some earlier examples, we used the filtering process embedded within the drop-down menus both to delete unwanted data and to extract data that may be wanted for separate analysis. In practice, such approaches are generally most appropriate only for the deletion of data, whereas the extraction of wanted items to another range is more effectively implemented using the Data/Advanced Filter tool.
The file Ch26.7.DaysLateAnalysis.ExtractCriteriaRange.xlsx shows an example (see Figure 26.17). Note that there is a separate range to define the criteria for the extraction and another range in which to copy the extracted data (this latter range is optional; as per the dialog box, the list can otherwise be filtered in place, so that the original full data set would simply be filtered as if done with a drop-down menu).
Note that:
Excel's Database functions return calculations relating to a specified field in a database, when a set of criteria is applied to determine which records are to be included. The available functions are DAVERAGE, DCOUNT, DCOUNTA, DGET, DMIN, DMAX, DPRODUCT, DSTDEV, DSTDEVP, DSUM, DVAR and DVARP. For example, DSUM provides a conditional summation in the database, i.e. a summation of a field where only items that meet the defined criteria are included. In a sense, DSUM is similar to the SUMIFS function (similarly, DAVERAGE is similar to AVERAGEIFS, and DMIN or DMAX to MINIFS or MAXIFS). The other Database functions generally do not have simple conditional non-database equivalents (although the use of array functions does allow them to be reproduced in principle, as discussed in Chapter 18).
As for the Advanced Filter, Database functions require one to specify a Database (which may have been defined as an Excel Table), a Criteria Range (headers and at least one row, with additional contiguous rows representing OR criteria) and a field on which the conditional calculations are to be conducted.
Database functions and their non-database equivalents provide calculations which are dynamic or live-linked to the data set (in contrast to PivotTables). Each has advantages over the other:
The file Ch26.8.DaysLateAnalysis.DatabaseFunctions1.xlsx contains an example. Figure 26.18 shows part of a data set (which has been defined as a Table, and given the name DataSet1) as well as the additional parameter information required to use the Database functions, i.e. the name of the field which is to be analysed (in this case the field “Amount £”) and the Criteria Range (cells I4:L5). Figure 26.19 shows the implementation of some of the functions.
If one wishes to implement a between-type query (e.g. sum all items whose amounts are between £5000 and £10,000), the simplest way is usually to extend the Criteria Range by adding a column and applying two tests to the same criteria field; this is analogous to the procedure when using a SUMIFS function, as discussed earlier in the text. (An alternative is to implement two separate queries and take the difference between them.)
The file Ch26.9.DaysLateAnalysis.DatabaseFunctions2.xlsx contains an example, which is essentially self-explanatory; Figure 26.20 shows the extended Criteria Range that could be used in this case.
When using Database functions, the Criteria Range consists of field headings and one or more contiguous rows below this. Thus, if a second set of criteria were required, a new Criteria Range would be needed, and if many such sets of criteria were required, there would be several such ranges, so that the space taken up in the workbook could become quite large (i.e. each Criteria Range would require at least three rows of Excel – the header, the criteria values and a blank row before the next Criteria Range).
An alternative is to have a single Criteria Range, and to place in sequence the set of values to be used for each query within this range, each time also recording the results of the query (as if doing a sensitivity analysis). Unfortunately, such a process cannot in general be simplified by using lookup functions to place the data within the criteria range: if the criteria range contains functions which return blank cells, then the Database functions do not consider such calculated blanks as being the same as if the cell containing the same criteria value were truly empty.
Thus, in practice, such a process needs to be done either manually or automated using VBA macros (see Chapter 32), because the copying in Excel (or assigning in VBA) of a blank cell into the Criteria Range leads to its value being treated as genuinely empty.
The file Ch26.10.DaysLateAnalysis.DatabaseFunctions3.xlsx shows examples of these. Figure 26.21 shows that the Database functions will return zero (cells O5:R5) when a lookup process is used to find the desired value to be used in the criteria range, even when the additional step is built, in that when blanks within the input ranges are referred to, the return values are also set to blank, not as zeros; see the Formula Bar. Since calculated blanks within the Criteria Range are not treated as if they were simple empty cells, no matching records are found. However, when the queries are pasted individually into the Criteria Range (for demonstration purposes, a new range has been set up in cells I8:M9), no such issue arises. One can repeat the process to copy the individual queries and each time paste the results into a storage range; in the Figure, the first three criteria sets have been completed, with the results pasted in cells O13:O15). Clearly, such a process can be automated using a VBA macro, as discussed in Chapter 32.
PivotTables can be used to produce cross-tabulation reports which summarise aspects of a database by category. Their main advantages include:
Despite their power, PivotTables have some properties which can be potential disadvantages:
The file Ch26.11.DaysLateAnalysis.PivotTable1.xlsx shows an example of a simple PivotTable. The underlying database has been defined as an Excel Table called DataSet1 (although doing so is not necessary to create a PivotTable). The PivotTable was created by selecting any cell in the database and using Insert/PivotTable, and then filling in the data or choices required by the dialog box. When doing so, one may in general choose to place the PivotTable on a new worksheet to keep the data separate from the analysis, and to allow adaptation of the data set (insertion of new rows) without potentially causing conflicts. Figure 26.22 shows the foundation structure that is created automatically.
Thereafter, the user can select field names using the check boxes and drag these to the Rows, Columns, Values or Filters areas, as desired. Figure 26.23 shows an example report which summarises the figures by country and customer.
Note that the report would simply be transposed if the items row–column structure of the country and customer were switched.
Figure 26.24 shows how one may choose to use a filter approach to present the results. In this example, the Country field was moved from the Columns are to the Filters area with the result that a new filter appeared in Row 1; this has been applied to show only the data relating to Germany.
Some additional important points include:
Figure 26.25 shows an example in which some of these points have been implemented: a second PivotTable has been inserted, a Filter is used in place of a column structure, and the order of the countries has been manually changed. Note that the PivotTable Fields dialog appears only once but is specific to the PivotTable that is active.
An important feature of PivotTables is the ability to rapidly see the individual items that make up the summarised figure, i.e. to list the relevant detailed individual records in the underlying database. One can create such a list either by:
Figure 26.26 shows the result of applying this process to Cell B23 (summary data for France) in the above example (shown in the DrillDown worksheet of the example file).
In their basic form, Slicers are rather like Filters, but simply provide an enhanced visual display, as well as more flexibility to rapidly experiment with options.
Slicers can be created using the menu on the PivotTable Tools/Analyze tab. Figure 26.27 shows an example of adding a Slicer on the Country field of the second PivotTable, and then showing the summary results for the selected set of countries (holding the Ctrl key to select multiple items).
In Figure 26.28, we show the result of including a second Slicer (based on Customer), in which all items for the second Slicer are selected. Figure 26.29 shows the result of using the second Slicer to report only the items that relate to the customer Cust01; one sees that the Country slicer adjusts to show that such a customer is not present in Italy. This highlights one of the main advantages of slicers over filter approaches (i.e. reflecting correctly the interactions between subsets of the data).
Timeline Slicers are simply a special form of slicer that can be used (inserted using the PivotTable Tools/Analyze tab) when a date field is to be analysed. Rather than have to adapt the data (such as by using the MONTH or YEAR functions), the slicers automatically provide options to analyse the data according to different levels of time granularity. Figure 26.30 shows the result of inserting a TimeLine Slicer (in which the full Customer list is used), with the drop-down menu of the slicer set to the “Months” option.
One useful feature of PivotTables is the ability to rapidly include or exclude data from the summary reports (using either filters or slicers). For example, a comment field within the data set may indicate that some items need to be treated in a special way.
The file Ch26.12.DaysLateAnalysis.PivotTable2.xslx shows an example. Figure 26.31 shows a database in which there is a comment field. Figure 26.32 shows a PivotTable in which a Slicer has been introduced, with the analysis intended to include only those items in the comment field that are either blank or marked “OK” or marked “Checked”.
Note that if one were to conduct such summary analysis using functions (database or non-database), the bespoke treatment of such “corrected” items would be much more cumbersome: for example, one would generally need to create a special field in the database that acts as a criterion to determine whether items would be included or not in the calculation. However, since the comments made could essentially be anything, it could be quite cumbersome, time-consuming and error-prone to create the applicable formulae.
One can obtain (or lookup) a value in any visible PivotTable using the GETPIVOTDATA function. The function has a number of parameters:
The file Ch26.13.DaysLateAnalysis.PivotTable3.xslx contains an example, shown in Figure 26.33.
Since the data must already be visible, one might ask as to the benefit of this. One case where it may be useful is if there are many large PivotTables, but one wishes only selected data from them. In such cases, it may be that functions such as SUMIFS would be more appropriate (which may often be the case, unless the filter and slicer functionality is required simultaneously).
Once a PivotTable has been created, a PivotChart can also be added using the PivotChart icon on the Analyze tab of the PivotTable Tools. The two objects are then inherently linked: the application of a filter on one will adjust the corresponding filter on the other, and the insertion of a slicer will affect both.
The file Ch26.14.DaysLateAnalysis.PivotTable4.xslx contains an example. Figure 26.34 shows a PivotChart in which a country filter is applied, and the column format of the PivotTable adjusts automatically. The Timeline slicer that has been inserted for the PivotTable also affects the display shown on the chart.
Where one wishes to produce reports based on multiple databases or tables, it may be cumbersome to have to consolidate them together into a single database. Excel's Data Model allows for PivotTable analysis of multiple data sets by linking them through relationships (of course, one would still need to have, or to create, some form of identifier or key that enables the matching process to take place through the relationship, but the cumbersome and memory-intensive lookup functions can often be avoided).
The file Ch26.15.DataModel.xlsx shows an example. Figure 26.35 shows three data sets that have been defined as Excel Tables, with the first providing a complete list of unique identifiers for each customer, the second providing information on revenues and dates, and the third providing cost information.
To create a PivotTable report that uses information from these tables (without explicitly joining them into a single data set), the following steps are necessary:
3.23.111.245