Producing a robust and efficient data query is not just about finding the appropriate load and transform functions and placing them in the correct sequence. It is also about extending, maintaining, and updating the process. This can be either to correct an error once the query is being tested or to adapt a query to new requirements. This chapter will introduce you to some of the techniques that you can apply to handle the various stages of the query life cycle.
Delivering revelatory analytics can mean sourcing data from a large range and variety of queries. It may also imply that these queries have to be linked together to create a cascade of data transformations that prepares the core elements of a practical and usable data model collated from multiple sources. It follows that you will therefore need to know how to manage the queries that you create to use them efficiently and to keep your queries under control in real-world situations.
Managing the Transformation Process
Pretty nearly all the transformation steps that we have applied so far have been individual elements that can be applied to just about any data table. However, when you are carrying out even a simple data load and transform process, you are likely to want to step through several transformations in order to shape, cleanse, and filter the data to get the result you want. This is where the Power Query approach is so malleable, because you can apply most data transformation steps to just about any data table. The art consists of placing them in a sequence that can then be reused any time that the data changes to reprocess the new source data and deliver an up-to-date output.
Renamed
Deleted
Moved (in certain cases)
The even better news is that, in many cases, steps can be modified. This way you are not stuck with the choices that you made initially, but have the opportunity of tweaking and improving individual steps in a process. This can avoid your having to rebuild an entire sequence of steps in an ETL routine simply by replacing one element in the ETL process.
In order to experiment with the various ways that you can modify queries, you are going to need some initial data. So, to start with, I suggest that you use the following Excel source file: C:DataMashupWithExcelSamplesCarSalesDataForQueries.xlsx. This source file contains queries that connect to six source tables in another Excel file, thereby imitating a real-world scenario (where the data sources could come from multiple different sources and have different origins: database, text, etc.).
Once you have opened this file, switch to the Query Editor window by double-clicking any of the existing queries in the Queries & Connections window.
Modifying a Step
How you alter a step will depend on how the original transformation was applied. This becomes second nature after a little practice and will always involve first clicking the step that you wish to modify and then applying a different modification. If you invoke a ribbon option, such as altering the data type, for instance, then you change the data type by simply applying another data type directly from the ribbon. If you used an option that displayed a dialog (such as splitting a column, among others), then you can right-click the step in the Applied Steps list and select Edit Settings from the context menu. Alternatively, and if you prefer, you can click the “gear” icon that is displayed to the right of most (but not all) steps to display a dialog where you can adjust the step settings. This dialog will show all the options and settings that you applied initially; in it, you can make any modifications that you consider necessary.
A final possibility that makes it easy to alter the settings for a process is to edit the formula that appears in the formula bar each time you click a step. This, however, involves understanding all the complexities of each piece of the code that underpins the data transformation process. I will provide a short overview of code modification in Chapter 12.
If you can force yourself to organize the process that you are writing with Power Query, then a little forethought and planning can reap major dividends. For instance, certain tasks, such as setting data types, can be carried out in a single operation. This means that you only have to look in one place for a similar set of data transformations. Not just that, but if you need to alter a data type for a column at a later stage, I suggest that you click the Changed Type step before you make any further alterations. This way, you extend the original step, rather than creating other steps—which can make the process more confusing and needlessly voluminous.
Renaming a Step
- 1.
Select the query (or source table or worksheet, if you prefer). I will use the Clients query in this example.
- 2.
Right-click the step that you want to rename, Changed Type, for instance.
- 3.
Select Rename from the context menu.
- 4.
Type in the new name. I will use NewDataTypes.
- 5.
Press Enter.
The step is renamed and the new name will appear in the Applied Steps list in the Query Settings pane. This way you can ensure that when you come back to a data transformation process days, weeks, or months later, you are able to understand more intuitively the process that you defined, as well as why you shaped the data like you did.
You can use upper- or lowercase characters—or a mixture of both—when naming steps in Power Query. You can also add spaces and special characters.
Deleting a Step or a Series of Steps
Deleting a step is all too easy, but doing so can have serious consequences. This is because an ETL process is often an extremely tightly coupled series of events, where each event depends intimately on the preceding one. So deleting a step can make every subsequent step fail. Knowing which events you can delete without drastic consequences will depend on the types of process that you are developing as well as your experience with Power Query. In any case, this is what you should do if you need to delete a step:
- 1.
Place the pointer over the process step that you want to delete.
- 2.
Click the cross (×) icon that appears.
- 3.
Select Delete. The Delete Step dialog might appear (if deleting this step might have unexpectedly negative consequences), as shown in Figure 10-1.
- 4.
Confirm by clicking the Delete button. The step is deleted.
If—and it is highly possible—deleting this step causes issues for the rest of the process, you will see that the data table is replaced by an error message. This message will vary depending on the type of error that Power Query has encountered.
When describing this technique, I was careful to state that you might see the Delete Step dialog. If you are deleting the final step in a sequence of steps, then you will probably not see it, since there should not be any potentially horrendous consequences; at worst, you will have to re-create the step. If you are deleting a step in the middle of a process, then you might want to think seriously about doing so before you cause a potentially vast number of problems. Consequently, you are asked to confirm the deletion in these cases.
An alternative technique is to right-click the step that you want to delete and select Delete. You may still have to confirm the deletion.
If you realize that an error in a process step has invalidated all your work up until the end of the process, rather than deleting multiple elements one by one, click Delete Until End from the context menu at step 2 in the preceding exercise.
Discarding Changes
- 1.
Click the close button (the small cross) at the top right of the Power Query Editor window. The dialog shown in Figure 10-2 will appear.
- 2.
Click Discard to close Power Query without applying any changes.
Don’t count on using an undo function as you can in other desktop applications. To lower your blood pressure, you may prefer to save a copy of a file containing an intricate data transformation process before deleting any steps. You can also make copies of the entire data transformation process as “M” code—as you will learn in Chapter 12.
Modifying an Existing Step
Power Query does not try and lock you into a rigid sequence of events when you create a series of applied steps to create and transform a data flow. This really becomes obvious when you discover that you need to alter a step in a process.
Suppose, for instance, that you discover that you have loaded a wrong Excel worksheet when you selected the initial data from an Excel file. You do not want to repeat the process when you can simply substitute one worksheet name for another.
- 1.
Select the query that you want to modify (Clients in this example).
- 2.
Click the step to modify (in this case, it will be Navigation).
- 3.
Click the gear (or cog) icon to the right of the step name. The appropriate dialog will appear. In this case, it will be the Navigation dialog that you can see in Figure 10-3.
- 4.
Click the table or worksheet that you want to use instead of the current dataset (Table1 in this example).
- 5.
Click OK.
The Query Editor will replace one source dataset with another. It might also add extra steps to ensure that the data is adapted for use in the query.
If the Query Editor can modify a step, the gear icon will be displayed to the right of the step name.
Clicking the modification (the gear) icon will display the dialog that was used to create the step (even if the step was created automatically by Power Query)—or a dialog that allows you to modify the step.
- 1.
Select the query that you want to modify (Clients in this example).
- 2.
Click the last step.
- 3.
Right-click the Address2 column and select Remove.
A new step will appear in the Applied Steps list, named Removed Columns. This step does not have the modification icon. So, for the moment, you can remove it, but not modify it—at least, not using the graphical user interface. You can, however, modify the code for a step as you will learn in Chapter 12.
Modifying existing steps is not a “magic bullet.” This is because a series of data transformations can be highly dependent on a tailored logic that has been developed for a specific data structure. It follows, for instance, that you can only replace a data source with another one that has a virtually identical structure. However, modifying a step can avoid your having to rewrite an entire data flow sequence in many cases.
Adding a Step
You can add a step anywhere in the sequence. All you have to do is click the step that precedes the new step that you want to insert before clicking the icon in any of the ribbons that corresponds to the new step. As is the case when you delete a step, Power Query will display an alert warning you that this action could cause problems with the process from this new step on.
Altering Process Step Sequencing
- 1.
Right-click the step that you want to resequence.
- 2.
Select Move Up or Move Down from the context menu.
I remain pessimistic that this can work miracles, but it is good to know that it is there.
Remember that before tweaking the order in which the process is applied, clicking any process step causes the table in the Power Query window to refresh to show you the state of the data up to and including the selected step. This is a very clear visual guide to the process and how the ETL process is carried out. Indeed, clicking the steps one after another will “scroll through” the changes in the data and demonstrate exactly how the while process is structured and works.
An Approach to Sequencing
- 1.
Load the sample data into Power Query.
- 2.
Promote or add comprehensible column headers. For example, you really do not want to be looking at step 47 of a process and wondering what Column29 is, when it could read (for instance) ClientName.
- 3.
Remove any columns that you do not need. The smaller the dataset, the faster the processing. What is more, you will find it easier to concentrate on, and understand, the data if you are only looking at information that you really need. Any columns that have been removed can be returned to the dataset simply by deleting or editing the step that removed them.
- 4.
Alter the data types for every column in the table. Correct data types are fundamental for many transformation steps and are essential for filtering, so it’s best to get them sorted out early on.
- 5.
Filter out any records that you do not need. Once again, the smaller the dataset, the faster the processing. This includes deduplication.
- 6.
Parse any complex JSON or XML elements.
- 7.
Carry out any necessary data cleansing.
- 8.
Carry out any necessary transforms.
- 9.
Carry out any necessary column splits or adding custom columns.
- 10.
Add any derived columns.
- 11.
Add any calculations or logical transformations of data.
- 12.
Handle any error records that the ETL process has thrown up.
Once again, I must stress that this is not a definitive guide. I hope, however, that it will help you to see “the wood for the trees” when you are creating data load and transformation processes using Power Query.
Error Records
Some data transformation operations will cause errors. This can be a fact of life when mashing up source data. For instance, you could have a few rows in a large dataset where a date column contains a few records that are texts or numbers. If you convert the column to a date data type, then any values that cannot be converted will appear as error values.
Managing Queries
Once you have used Power Query for any length of time, you will probably become addicted to creating more and deeper analyses based on wider-ranging data sources. Inevitably, this will mean learning to manage the data sources that feed into your data models efficiently and productively.
Organizing and grouping queries into folders
Duplicating queries
Referencing queries
Documenting queries
Adding a column as a new query
Let’s take a look at these functions, one by one.
Query management is heavily dependent on the Queries pane. Power Query hides this pane by default, so you will need to display it by clicking the chevron at the top right of the collapsed Queries pane on the left of the data. What is more, many of the query management functions are also available directly from inside Excel in the popup menu of the Queries & Connections pane.
Organizing Queries
When you have anything from a handful to a few dozen queries that you are using in the Power Query Editor, you may want to exercise some control over how they are organized . To begin with, you can modify the order in which queries appear in the Queries pane on the left of the Power Query Editor window. This lets you override the default order, which is that the most recently added data source appears at the bottom of the list.
- 1.
Display the Queries pane if it is not already visible.
- 2.
Right-click the query that you want to move.
- 3.
Select Move Up (or Move Down) from the context menu.
You have to carry out this operation a number of times to move a query up or down a number of places. So the alternative—dragging queries up and down in the list—is probably worth using as well.
Grouping Queries
You can also create custom groups to better organize the queries that you are using in an Excel file. This will not have any effect on how the queries work. Grouping queries is simply an organizational technique, and it will not change in any way the data tables that you see in report mode in Excel. You will, however, see the groups that you created reflected in the Queries & Connections pane in Excel.
Creating a New Group
- 1.
Right-click the query that you want to add to a new group. I will use the Colors query in the Query Editor—opened from the Excel file CarSalesDataForQueries.xlsx.
- 2.
Select Move To Group ➤ New Group from the context menu. The New Group dialog will appear.
- 3.
Enter a name for the group and (optionally) a description. I will name the group ReferenceData. The dialog will look something like Figure 10-4.
- 4.
Click OK.
By default, all other queries are added to a group named Other Queries.
If you have created dozens of queries, this technique can really help you to manage a complex data load process. As you might expect, you can expand or close a group by clicking the triangle to the left of the folder name.
Renaming Groups
- 1.
Right-click the group that you want to rename.
- 2.
Select Rename from the context menu.
- 3.
Edit or replace the name.
- 4.
Press Enter.
The Other Queries group cannot be renamed or deleted. By default, all new queries will be added to this group. You can also double-click the group to rename it directly.
Adding a Query to a Group
- 1.
Right-click the query that you want to add to another existing group.
- 2.
Select Move To Group ➤ Destination Group Name from the context menu.
The selected query is moved to the chosen group. The group structure will also be visible in the Queries & Connections pane in Excel.
Duplicating Queries
- 1.
Right-click the query that you want to copy.
- 2.
Select Duplicate from the context menu.
The query is copied and the duplicate appears in the list of queries inside the same group as the source query. It has the same name as the original query, with a number in parentheses appended. You can always rename it in the Query Settings pane, in the Queries pane on the left of the Query Editor window, or in the Queries & Connections pane in Excel itself.
You can copy and paste queries if you prefer. The advantage of this technique is that you can choose the destination group for the copied query simply by clicking the folder icon for the required group before pasting the copy of the query.
Referencing Queries
If you are building a complex ETL (Extract, Transform, Load) routine, you might conceivably organize your work in stages to better manage the process. To help you with this, the Power Query Editor allows you to use the output from one query as the source for another query. This enables you to break down different parts of the process (e.g., structure, filters, then cleansing) into separate queries so that you can concentrate on different aspects of the transformation in different queries.
- 1.
Right-click the query that you want to use as the source data for a new query.
- 2.
Select Reference from the context menu. A new query is created in the list of queries in the Queries pane.
- 3.
Right-click the new query, select Rename, and give it a meaningful name.
Unless you rename the query, the new query has the same name as the original query, with a number in parentheses appended. If you click the new query, you see exactly the same data in the referenced query as you can see if you click the final step in the source query.
From now on, any modifications that you make in the referenced (source) query produce an effect on the data that is used as the source for the second query. In other words, you have created a sequence of queries in a data ingestion process.
- 1.
In the Data ribbon, click Get Data ➤ From File ➤ From Workbook.
- 2.
Select the file BrilliantBritishCars.xlsx and click Import.
- 3.
In the Navigator dialog, select the source table BaseData.
- 4.
Click the popup menu at the right of the Load button and select Load To. The Import Data dialog will be displayed.
- 5.
Click Only create connection.
- 6.
Click OK.
The source query will appear in the Queries & Connections pane, but will be flagged as Connection only.
You cannot hide queries in the Queries & Connections pane in Excel—but you can place reference queries in a custom group to isolate them visually.
You want to isolate complex data transformations into more manageable subsets. You may, for instance, want one intermediate query that transforms the data while a subsequent query cleanses the data.
You could want to apply a common set of initial transformations that then feed into two separate data preparation paths—a detailed view of the data and an aggregated view.
The way to convert an existing query to a connection only—ready for use as a reference query—is to delete the worksheet (or Power Pivot sheet) containing the output data.
Documenting Queries
- 1.
Right-click the query that you want to annotate.
- 2.
Select Properties from the context menu. The Query Properties dialog will appear.
- 3.
Add a description. The result could be like the dialog shown in Figure 10-6.
- 4.
Click OK.
The description that you added is now visible as a tooltip if you hover the cursor over the query name in the list of queries in the Queries pane of the Query Editor. You will also see the description in the Peek window for this query if you hover the mouse pointer over the query in the Queries & Connections pane in Excel.
Clicking Fast Data Load will attempt to load the data faster—but Power Query could remain unresponsive for some time while the data is loaded.
Adding a Column as a New Query
- 1.
In the Queries list on the left, select the query containing the column that you want to isolate as a new query.
- 2.
Right-click the title of the column containing the data that you want to isolate.
- 3.
Select Add as New Query from the context menu. A new query is created. It is named after the original query and the source column.
- 4.
In the Transform ribbon, click To Table. The To Table dialog will appear, as you can see in Figure 10-7.
- 5.
Click OK. The new query will become a table of data and will have the name of the column that you selected.
- 6.
Rename the query, if you judge this necessary.
You can now use this query in your data model and as part of a linked set of query processes.
A query created in this way is completely disconnected from the source query from where the data was taken. Put another way, any refresh of the source data will have no effect on the new query that you created from a column.
Managing Queries from the Queries & Connections Pane
You are not obliged to switch to the Query Editor to carry out many of the query management tasks that you discovered in this chapter. You can perform several operations directly from the Queries & Connections pane.
- 1.
Open the Excel file Chapter10Sample.xlsx, and display the Queries & Connections pane.
- 2.
Right-click one of the queries. The popup menu will appear, as shown in Figure 10-8.
- 3.
Click the menu option that you want.
Create, delete, and modify groups of queries or queries
Move queries between groups and inside groups
Expand and collapse groups
Alter group properties
Copy and paste queries and groups
Duplicate and reference queries
Merge and append queries
Alter query and group properties
Refresh queries
Show the Peek window for a query
The contents of the context menu will vary depending on whether you have right-clicked a group or an individual query.
Conclusion
In this chapter, you saw how to manage and extend the contents of the queries that you can create using Power Query. Specifically, you saw how to modify individual steps in a data load and transformation process. This ranged from renaming steps to changing the order of steps in a process—or even altering the specification of what a step actually does.
Then you saw how to manage whole queries. You learned how to rename and group queries as well as how to chain queries so that the output from one query became the source of data for another query.
Finally, you learned how to reference—or link—queries to isolate parts of a data ingestion process or to break down a complex process into manageable parts.
It is now time to learn how to add interactivity to your processes using parameters in Power Query. You will discover this in the next chapter.