© Adam Aspin 2020
A. AspinData Mashup with Microsoft Excel Using Power Query and Mhttps://doi.org/10.1007/978-1-4842-6018-0_10

10. Organizing and Managing Queries

Adam Aspin1 
(1)
Stafford, UK
 

Producing a robust and efficient data query is not just about finding the appropriate load and transform functions and placing them in the correct sequence. It is also about extending, maintaining, and updating the process. This can be either to correct an error once the query is being tested or to adapt a query to new requirements. This chapter will introduce you to some of the techniques that you can apply to handle the various stages of the query life cycle.

Delivering revelatory analytics can mean sourcing data from a large range and variety of queries. It may also imply that these queries have to be linked together to create a cascade of data transformations that prepares the core elements of a practical and usable data model collated from multiple sources. It follows that you will therefore need to know how to manage the queries that you create to use them efficiently and to keep your queries under control in real-world situations.

Managing the Transformation Process

Pretty nearly all the transformation steps that we have applied so far have been individual elements that can be applied to just about any data table. However, when you are carrying out even a simple data load and transform process, you are likely to want to step through several transformations in order to shape, cleanse, and filter the data to get the result you want. This is where the Power Query approach is so malleable, because you can apply most data transformation steps to just about any data table. The art consists of placing them in a sequence that can then be reused any time that the data changes to reprocess the new source data and deliver an up-to-date output.

The key to appreciating and managing this process is to get well acquainted with the Applied Steps list in the Query Settings pane. This list contains the details of every step that you applied, in the order in which you applied it. Each step retains the name that Power Query gave it when it was created, and each can be altered in the following ways:
  • Renamed

  • Deleted

  • Moved (in certain cases)

The even better news is that, in many cases, steps can be modified. This way you are not stuck with the choices that you made initially, but have the opportunity of tweaking and improving individual steps in a process. This can avoid your having to rebuild an entire sequence of steps in an ETL routine simply by replacing one element in the ETL process.

In order to experiment with the various ways that you can modify queries, you are going to need some initial data. So, to start with, I suggest that you use the following Excel source file: C:DataMashupWithExcelSamplesCarSalesDataForQueries.xlsx. This source file contains queries that connect to six source tables in another Excel file, thereby imitating a real-world scenario (where the data sources could come from multiple different sources and have different origins: database, text, etc.).

Once you have opened this file, switch to the Query Editor window by double-clicking any of the existing queries in the Queries & Connections window.

Modifying a Step

How you alter a step will depend on how the original transformation was applied. This becomes second nature after a little practice and will always involve first clicking the step that you wish to modify and then applying a different modification. If you invoke a ribbon option, such as altering the data type, for instance, then you change the data type by simply applying another data type directly from the ribbon. If you used an option that displayed a dialog (such as splitting a column, among others), then you can right-click the step in the Applied Steps list and select Edit Settings from the context menu. Alternatively, and if you prefer, you can click the “gear” icon that is displayed to the right of most (but not all) steps to display a dialog where you can adjust the step settings. This dialog will show all the options and settings that you applied initially; in it, you can make any modifications that you consider necessary.

A final possibility that makes it easy to alter the settings for a process is to edit the formula that appears in the formula bar each time you click a step. This, however, involves understanding all the complexities of each piece of the code that underpins the data transformation process. I will provide a short overview of code modification in Chapter 12.

Tip

If you can force yourself to organize the process that you are writing with Power Query, then a little forethought and planning can reap major dividends. For instance, certain tasks, such as setting data types, can be carried out in a single operation. This means that you only have to look in one place for a similar set of data transformations. Not just that, but if you need to alter a data type for a column at a later stage, I suggest that you click the Changed Type step before you make any further alterations. This way, you extend the original step, rather than creating other steps—which can make the process more confusing and needlessly voluminous.

Renaming a Step

Power Query names steps using the name of the transformation that was applied. This means that if another similar step is applied later, Power Query uses the same name with a numeric increment. As this is not always comprehensible when reviewing a sequence of transformation steps, you may prefer to give more user-friendly names to individual steps. This is done as follows:
  1. 1.

    Select the query (or source table or worksheet, if you prefer). I will use the Clients query in this example.

     
  2. 2.

    Right-click the step that you want to rename, Changed Type, for instance.

     
  3. 3.

    Select Rename from the context menu.

     
  4. 4.

    Type in the new name. I will use NewDataTypes.

     
  5. 5.

    Press Enter.

     

The step is renamed and the new name will appear in the Applied Steps list in the Query Settings pane. This way you can ensure that when you come back to a data transformation process days, weeks, or months later, you are able to understand more intuitively the process that you defined, as well as why you shaped the data like you did.

Note

You can use upper- or lowercase characters—or a mixture of both—when naming steps in Power Query. You can also add spaces and special characters.

Deleting a Step or a Series of Steps

Deleting a step is all too easy, but doing so can have serious consequences. This is because an ETL process is often an extremely tightly coupled series of events, where each event depends intimately on the preceding one. So deleting a step can make every subsequent step fail. Knowing which events you can delete without drastic consequences will depend on the types of process that you are developing as well as your experience with Power Query. In any case, this is what you should do if you need to delete a step:

  1. 1.

    Place the pointer over the process step that you want to delete.

     
  2. 2.

    Click the cross (×) icon that appears.

     
  3. 3.

    Select Delete. The Delete Step dialog might appear (if deleting this step might have unexpectedly negative consequences), as shown in Figure 10-1.

     
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig1_HTML.jpg
Figure 10-1

The Delete Step dialog

  1. 4.

    Confirm by clicking the Delete button. The step is deleted.

     

If—and it is highly possible—deleting this step causes issues for the rest of the process, you will see that the data table is replaced by an error message. This message will vary depending on the type of error that Power Query has encountered.

When describing this technique, I was careful to state that you might see the Delete Step dialog. If you are deleting the final step in a sequence of steps, then you will probably not see it, since there should not be any potentially horrendous consequences; at worst, you will have to re-create the step. If you are deleting a step in the middle of a process, then you might want to think seriously about doing so before you cause a potentially vast number of problems. Consequently, you are asked to confirm the deletion in these cases.

An alternative technique is to right-click the step that you want to delete and select Delete. You may still have to confirm the deletion.

If you realize that an error in a process step has invalidated all your work up until the end of the process, rather than deleting multiple elements one by one, click Delete Until End from the context menu at step 2 in the preceding exercise.

Discarding Changes

If, when working with Power Query, you realize at any point that you have just destroyed hours of work, then (after drawing a deep breath)
  1. 1.

    Click the close button (the small cross) at the top right of the Power Query Editor window. The dialog shown in Figure 10-2 will appear.

     
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig2_HTML.jpg
Figure 10-2

The discard changes dialog in Power Query

  1. 2.

    Click Discard to close Power Query without applying any changes.

     

Don’t count on using an undo function as you can in other desktop applications. To lower your blood pressure, you may prefer to save a copy of a file containing an intricate data transformation process before deleting any steps. You can also make copies of the entire data transformation process as “M” code—as you will learn in Chapter 12.

Modifying an Existing Step

Power Query does not try and lock you into a rigid sequence of events when you create a series of applied steps to create and transform a data flow. This really becomes obvious when you discover that you need to alter a step in a process.

Suppose, for instance, that you discover that you have loaded a wrong Excel worksheet when you selected the initial data from an Excel file. You do not want to repeat the process when you can simply substitute one worksheet name for another.

Assuming that you have opened the Excel workbook CarSalesDataForQueries.xlsx and have switched to the Query Editor:
  1. 1.

    Select the query that you want to modify (Clients in this example).

     
  2. 2.

    Click the step to modify (in this case, it will be Navigation).

     
  3. 3.

    Click the gear (or cog) icon to the right of the step name. The appropriate dialog will appear. In this case, it will be the Navigation dialog that you can see in Figure 10-3.

     
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig3_HTML.jpg
Figure 10-3

The Navigation dialog displayed for step modification

  1. 4.

    Click the table or worksheet that you want to use instead of the current dataset (Table1 in this example).

     
  2. 5.

    Click OK.

     

The Query Editor will replace one source dataset with another. It might also add extra steps to ensure that the data is adapted for use in the query.

As you saw in the previous nine chapters, Power Query offers a vast range of data ingestion and modification possibilities. So I cannot, here, describe every possible option as far as modifying an Applied Step is concerned. Nonetheless, the principle is simple:
  • If the Query Editor can modify a step, the gear icon will be displayed to the right of the step name.

  • Clicking the modification (the gear) icon will display the dialog that was used to create the step (even if the step was created automatically by Power Query)—or a dialog that allows you to modify the step.

Certain steps do not display the modification icon. This is because the step cannot be modified, only removed (at least, using the Query Editor interface). As an example of this, add the following step:
  1. 1.

    Select the query that you want to modify (Clients in this example).

     
  2. 2.

    Click the last step.

     
  3. 3.

    Right-click the Address2 column and select Remove.

     

A new step will appear in the Applied Steps list, named Removed Columns. This step does not have the modification icon. So, for the moment, you can remove it, but not modify it—at least, not using the graphical user interface. You can, however, modify the code for a step as you will learn in Chapter 12.

Note

Modifying existing steps is not a “magic bullet.” This is because a series of data transformations can be highly dependent on a tailored logic that has been developed for a specific data structure. It follows, for instance, that you can only replace a data source with another one that has a virtually identical structure. However, modifying a step can avoid your having to rewrite an entire data flow sequence in many cases.

Adding a Step

You can add a step anywhere in the sequence. All you have to do is click the step that precedes the new step that you want to insert before clicking the icon in any of the ribbons that corresponds to the new step. As is the case when you delete a step, Power Query will display an alert warning you that this action could cause problems with the process from this new step on.

Altering Process Step Sequencing

It is possible—technically—to resequence steps in a process. However, in my experience, this is not always practical, since changing the order of steps in a process can cause as much damage as deleting a step. Nonetheless, you can always try it like this:
  1. 1.

    Right-click the step that you want to resequence.

     
  2. 2.

    Select Move Up or Move Down from the context menu.

     

I remain pessimistic that this can work miracles, but it is good to know that it is there.

Tip

Remember that before tweaking the order in which the process is applied, clicking any process step causes the table in the Power Query window to refresh to show you the state of the data up to and including the selected step. This is a very clear visual guide to the process and how the ETL process is carried out. Indeed, clicking the steps one after another will “scroll through” the changes in the data and demonstrate exactly how the while process is structured and works.

An Approach to Sequencing

Given the array of available data transformation options, you may well be wondering how best to approach a new ETL project using Power Query. I realize that all projects are different, but as a rough and ready guide, I suggest attempting to order your project like this:
  1. 1.

    Load the sample data into Power Query.

     
  2. 2.

    Promote or add comprehensible column headers. For example, you really do not want to be looking at step 47 of a process and wondering what Column29 is, when it could read (for instance) ClientName.

     
  3. 3.

    Remove any columns that you do not need. The smaller the dataset, the faster the processing. What is more, you will find it easier to concentrate on, and understand, the data if you are only looking at information that you really need. Any columns that have been removed can be returned to the dataset simply by deleting or editing the step that removed them.

     
  4. 4.

    Alter the data types for every column in the table. Correct data types are fundamental for many transformation steps and are essential for filtering, so it’s best to get them sorted out early on.

     
  5. 5.

    Filter out any records that you do not need. Once again, the smaller the dataset, the faster the processing. This includes deduplication.

     
  6. 6.

    Parse any complex JSON or XML elements.

     
  7. 7.

    Carry out any necessary data cleansing.

     
  8. 8.

    Carry out any necessary transforms.

     
  9. 9.

    Carry out any necessary column splits or adding custom columns.

     
  10. 10.

    Add any derived columns.

     
  11. 11.

    Add any calculations or logical transformations of data.

     
  12. 12.

    Handle any error records that the ETL process has thrown up.

     

Once again, I must stress that this is not a definitive guide. I hope, however, that it will help you to see “the wood for the trees” when you are creating data load and transformation processes using Power Query.

Error Records

Some data transformation operations will cause errors. This can be a fact of life when mashing up source data. For instance, you could have a few rows in a large dataset where a date column contains a few records that are texts or numbers. If you convert the column to a date data type, then any values that cannot be converted will appear as error values.

Managing Queries

Once you have used Power Query for any length of time, you will probably become addicted to creating more and deeper analyses based on wider-ranging data sources. Inevitably, this will mean learning to manage the data sources that feed into your data models efficiently and productively.

Fortunately, Power Query comes replete with a small arsenal of query management tools to help you. These include
  • Organizing and grouping queries into folders

  • Duplicating queries

  • Referencing queries

  • Documenting queries

  • Adding a column as a new query

Let’s take a look at these functions, one by one.

Note

Query management is heavily dependent on the Queries pane. Power Query hides this pane by default, so you will need to display it by clicking the chevron at the top right of the collapsed Queries pane on the left of the data. What is more, many of the query management functions are also available directly from inside Excel in the popup menu of the Queries & Connections pane.

Organizing Queries

When you have anything from a handful to a few dozen queries that you are using in the Power Query Editor, you may want to exercise some control over how they are organized . To begin with, you can modify the order in which queries appear in the Queries pane on the left of the Power Query Editor window. This lets you override the default order, which is that the most recently added data source appears at the bottom of the list.

Do the following to change the position of a query in the list:
  1. 1.

    Display the Queries pane if it is not already visible.

     
  2. 2.

    Right-click the query that you want to move.

     
  3. 3.

    Select Move Up (or Move Down) from the context menu.

     

You have to carry out this operation a number of times to move a query up or down a number of places. So the alternative—dragging queries up and down in the list—is probably worth using as well.

Grouping Queries

You can also create custom groups to better organize the queries that you are using in an Excel file. This will not have any effect on how the queries work. Grouping queries is simply an organizational technique, and it will not change in any way the data tables that you see in report mode in Excel. You will, however, see the groups that you created reflected in the Queries & Connections pane in Excel.

Creating a New Group

Here is how to create a new group:
  1. 1.

    Right-click the query that you want to add to a new group. I will use the Colors query in the Query Editor—opened from the Excel file CarSalesDataForQueries.xlsx.

     
  2. 2.

    Select Move To Group ➤ New Group from the context menu. The New Group dialog will appear.

     
  3. 3.

    Enter a name for the group and (optionally) a description. I will name the group ReferenceData. The dialog will look something like Figure 10-4.

     
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig4_HTML.jpg
Figure 10-4

The New Group dialog

  1. 4.

    Click OK.

     
The new group is created and the selected query will appear in the group. The Queries pane will look something like Figure 10-5.
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig5_HTML.jpg
Figure 10-5

The Queries pane with a new group added

Note

By default, all other queries are added to a group named Other Queries.

If you have created dozens of queries, this technique can really help you to manage a complex data load process. As you might expect, you can expand or close a group by clicking the triangle to the left of the folder name.

Renaming Groups

You can rename any groups that you have added.
  1. 1.

    Right-click the group that you want to rename.

     
  2. 2.

    Select Rename from the context menu.

     
  3. 3.

    Edit or replace the name.

     
  4. 4.

    Press Enter.

     
Note

The Other Queries group cannot be renamed or deleted. By default, all new queries will be added to this group. You can also double-click the group to rename it directly.

Adding a Query to a Group

To move a query from its current group to another group, you can carry out the following steps:
  1. 1.

    Right-click the query that you want to add to another existing group.

     
  2. 2.

    Select Move To Group ➤ Destination Group Name from the context menu.

     

The selected query is moved to the chosen group. The group structure will also be visible in the Queries & Connections pane in Excel.

Duplicating Queries

If you have done a lot of work transforming data, you could well want to keep a copy of the original query before trying out any potentially risky alterations to your work. Fortunately, this is extremely simple.
  1. 1.

    Right-click the query that you want to copy.

     
  2. 2.

    Select Duplicate from the context menu.

     

The query is copied and the duplicate appears in the list of queries inside the same group as the source query. It has the same name as the original query, with a number in parentheses appended. You can always rename it in the Query Settings pane, in the Queries pane on the left of the Query Editor window, or in the Queries & Connections pane in Excel itself.

Note

You can copy and paste queries if you prefer. The advantage of this technique is that you can choose the destination group for the copied query simply by clicking the folder icon for the required group before pasting the copy of the query.

Referencing Queries

If you are building a complex ETL (Extract, Transform, Load) routine, you might conceivably organize your work in stages to better manage the process. To help you with this, the Power Query Editor allows you to use the output from one query as the source for another query. This enables you to break down different parts of the process (e.g., structure, filters, then cleansing) into separate queries so that you can concentrate on different aspects of the transformation in different queries.

To use the output of one query as the source data for another, you need to reference a query , like this:
  1. 1.

    Right-click the query that you want to use as the source data for a new query.

     
  2. 2.

    Select Reference from the context menu. A new query is created in the list of queries in the Queries pane.

     
  3. 3.

    Right-click the new query, select Rename, and give it a meaningful name.

     

Unless you rename the query, the new query has the same name as the original query, with a number in parentheses appended. If you click the new query, you see exactly the same data in the referenced query as you can see if you click the final step in the source query.

From now on, any modifications that you make in the referenced (source) query produce an effect on the data that is used as the source for the second query. In other words, you have created a sequence of queries in a data ingestion process.

In practice, I suspect, you will not want to use two copies of the same query to create reports. Indeed, if a query is being used as an “intermediate” query, the data that it contains might not even be fully usable. So you could want to prevent the intermediate query from outputting data to Excel. To do this:
  1. 1.

    In the Data ribbon, click Get Data ➤ From File ➤ From Workbook.

     
  2. 2.

    Select the file BrilliantBritishCars.xlsx and click Import.

     
  3. 3.

    In the Navigator dialog, select the source table BaseData.

     
  4. 4.

    Click the popup menu at the right of the Load button and select Load To. The Import Data dialog will be displayed.

     
  5. 5.

    Click Only create connection.

     
  6. 6.

    Click OK.

     

The source query will appear in the Queries & Connections pane, but will be flagged as Connection only.

Tip

You cannot hide queries in the Queries & Connections pane in Excel—but you can place reference queries in a custom group to isolate them visually.

You may be wondering why you would want to create “intermediate” queries. Some ideas are
  • You want to isolate complex data transformations into more manageable subsets. You may, for instance, want one intermediate query that transforms the data while a subsequent query cleanses the data.

  • You could want to apply a common set of initial transformations that then feed into two separate data preparation paths—a detailed view of the data and an aggregated view.

Note

The way to convert an existing query to a connection only—ready for use as a reference query—is to delete the worksheet (or Power Pivot sheet) containing the output data.

Documenting Queries

In a complex ETL process, it is easy to get confused—or simply forget—which query does what. Consequently, I always advise documenting queries by adding a meaningful description.
  1. 1.

    Right-click the query that you want to annotate.

     
  2. 2.

    Select Properties from the context menu. The Query Properties dialog will appear.

     
  3. 3.

    Add a description. The result could be like the dialog shown in Figure 10-6.

     
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig6_HTML.jpg
Figure 10-6

Adding a description to a query

  1. 4.

    Click OK.

     

The description that you added is now visible as a tooltip if you hover the cursor over the query name in the list of queries in the Queries pane of the Query Editor. You will also see the description in the Peek window for this query if you hover the mouse pointer over the query in the Queries & Connections pane in Excel.

Note

Clicking Fast Data Load will attempt to load the data faster—but Power Query could remain unresponsive for some time while the data is loaded.

Adding a Column as a New Query

There are occasions when you might want to extract a column of data and use it as a separate query. It could be that you need the data that it contains as reference data for another query, for example. The following steps explain how you can do this:
  1. 1.

    In the Queries list on the left, select the query containing the column that you want to isolate as a new query.

     
  2. 2.

    Right-click the title of the column containing the data that you want to isolate.

     
  3. 3.

    Select Add as New Query from the context menu. A new query is created. It is named after the original query and the source column.

     
  4. 4.

    In the Transform ribbon, click To Table. The To Table dialog will appear, as you can see in Figure 10-7.

     
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig7_HTML.jpg
Figure 10-7

The To Table dialog

  1. 5.

    Click OK. The new query will become a table of data and will have the name of the column that you selected.

     
  2. 6.

    Rename the query, if you judge this necessary.

     

You can now use this query in your data model and as part of a linked set of query processes.

Note

A query created in this way is completely disconnected from the source query from where the data was taken. Put another way, any refresh of the source data will have no effect on the new query that you created from a column.

Managing Queries from the Queries & Connections Pane

You are not obliged to switch to the Query Editor to carry out many of the query management tasks that you discovered in this chapter. You can perform several operations directly from the Queries & Connections pane.

For example:
  1. 1.

    Open the Excel file Chapter10Sample.xlsx, and display the Queries & Connections pane.

     
  2. 2.

    Right-click one of the queries. The popup menu will appear, as shown in Figure 10-8.

     
../images/497001_1_En_10_Chapter/497001_1_En_10_Fig8_HTML.jpg
Figure 10-8

The Queries & Connections pane popup menu

  1. 3.

    Click the menu option that you want.

     
As you can see, not all the query management options are available. However, you can use this context menu to
  • Create, delete, and modify groups of queries or queries

  • Move queries between groups and inside groups

  • Expand and collapse groups

  • Alter group properties

  • Copy and paste queries and groups

  • Duplicate and reference queries

  • Merge and append queries

  • Alter query and group properties

  • Refresh queries

  • Show the Peek window for a query

Note

The contents of the context menu will vary depending on whether you have right-clicked a group or an individual query.

Conclusion

In this chapter, you saw how to manage and extend the contents of the queries that you can create using Power Query. Specifically, you saw how to modify individual steps in a data load and transformation process. This ranged from renaming steps to changing the order of steps in a process—or even altering the specification of what a step actually does.

Then you saw how to manage whole queries. You learned how to rename and group queries as well as how to chain queries so that the output from one query became the source of data for another query.

Finally, you learned how to reference—or link—queries to isolate parts of a data ingestion process or to break down a complex process into manageable parts.

It is now time to learn how to add interactivity to your processes using parameters in Power Query. You will discover this in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.166.55