Discovering and Loading Data with Power Query
Before you can present any analysis or insight, you need source data. Your source data could be in many places and in many formats. Nonetheless, you need to access it, look at it, and quite possibly clean it up to some extent. You may also need to join separate data sources before you can shape the data into a coherent data set using PowerPivot, deliver the results using Power View or Power Map, and then share it using Power BI.
Discovering, loading, cleaning, and modifying source data is where Power Query comes in. Using this, the latest addition to the Microsoft self-service business intelligence stack, you can carry out
Although I have outlined these three steps as if they are completely separate and sequential, the reality is that they often blend into a single process. Indeed there could be many occasions when you will examine the data after it has been loaded into Power Query—or join data tables before you clean them. The core objective will, however, always remain the same: find some data and then load it into Power Query where you can tweak, clean, and shape it.
This process could be described simplistically as “First, catch your data.” In the world of data warehousing, the specialists call it ETL, which is short for Extract Transform Load. Despite the reassuring confidence that the acronym brings, this process is rarely a smooth logical progression through a clear-cut series of processes. The reality is often far messier than that. You may often find yourself importing some data, cleaning it, importing some more data from another source, combining the second data set with the first one, cleaning some more, and then repeating many of these operations several times.
In this chapter and the following one then, I will try and show you how the process can work in an ideal world. I hope that this will make the various steps that comprise an ETL process clearer. All I am asking is that you remain aware that all the options and possibilities that Power Query offers make it a multifaceted and tremendously capable tool. The science is to know which options to use. The art is to know when to use them.
In this chapter we will begin by seeing how to find and load data from a variety of sources. Once again I will be using a set of example files that you can find on the Apress web site. If you have followed the instructions in Appendix A, then these files will be in the C:HighImpactDataVisualizationWithPowerBI folder. Now, although I will be re-creating the source data set that we have used up until now in this book, I will also be going beyond that and importing data that is not used in the Power View, PowerPivot, and Power Map chapters. This is because the sheer variety of techniques that can be applied to load and transform data goes beyond those that you can demonstrate with a simple data set. Consequently, at times, the examples I use will not necessarily lead into a structured data set, but I will use them as just that—examples—to show you a specific solution to a particular ETL problem.
The Power Query Interface
You access Power Query using the Power Query tab and ribbon that appear in Excel once you’ve downloaded the Power Query add-in and installed it (as described in Chapter 1). Similar to the way we worked with PowerPivot, you begin working in Excel, using the options available in the Power Query ribbon, and then you progress to the Power Query window where you will carry out most of your work.
The Power Query ribbon essentially lets you find data and connect to it. It will also show a list of all the queries that you have developed in the current workbook and let you manage queries. It will be your starting point for both developing new queries and using or modifying existing queries.
The Power Query ribbon contains the elements that are explained in Figure 12-1.
Figure 12-1. The Power Query ribbon
Table 12-1. Power Query Ribbon Options
Option |
Comments |
---|---|
Online Search |
Search for data from public sources. |
From Web |
Import tabular data from web pages. |
From File |
Import data from CSV, text, XML, and Excel files. |
From Database |
Import data from relational databases. |
From Other Sources |
Import data from a variety of sources. |
From Table |
Import data from an Excel table to the current workbook. |
Merge |
Join two queries to add columns from one query output to another. |
Append |
Add data from an identically structured query output to the selected query. |
Workbook |
Display—or hide—the list of queries in the current workbook. |
Shared |
Display any shared queries. |
Locale |
Define the regional settings used by Power Query. |
Fast Combine |
Bypass security to accelerate data loads. |
Data Source Settings |
Manage data source settings. |
Update |
Update Power Query with the latest version. |
Options |
Choose Power Query options. |
Sign In |
Sign in to Power BI to access shared queries and data. |
Power Query groups potential data sources into the following categories:
You can also list the contents of folders on any available local disk or network share (even if it is not always a data source) and then leverage this to import files. Similarly (if you have the necessary permissions), you can list the databases and data available on the database servers you connect to. This way Power Query can provide not only the data, but also the metadata—or data about data—which can help you get the job done.
Unfortunately the sheer range of data sources from which Power Query can read data is such that we do not have space here to examine the minutiae of every one. As a consequence, we will take a rapid tour of some of the most frequently used data sources in the next few pages. You will probably be using some types of data source much more than others including, I hope, those that are outlined here. What is more, many of the data sources can be used in a similar way, and the Power Query interface does a wonderful job of making the arcane connection details as unobtrusive as possible. As a result, once you have grasped the way in which Power Query connects to data sources in general, it should not be difficult for you to use some of the more rare connections types without detailed explanation.
Sending files across networks and over the Internet or via email has become second nature to most of us. As long as the files that you have obtained conform to some of the widely-recognized standards currently in use (of which you will learn more later), you should have little difficulty in loading them into Power Query.
The file sources that Power Query can currently read and from which it can load data are given in Table 12-2.
Table 12-2. File Sources
File Sources |
Comments |
---|---|
Excel |
Allows you to read Microsoft Excel files (versions 97 to 2013) and load worksheets, named ranges, and tables |
CSV |
Lets you load text files that conform to the CSV (comma separated values) format |
XML |
Allows you to load XML data |
Text |
Lets you load text files using a variety of separators |
Much corporate data currently resides in relational databases. It follows that being able to look at this data is essential for much of today’s business intelligence. In the real world, connecting to corporate data will require, at the very least, you to have a logon name and possibly a password that will let you connect. I imagine that it will also require permissions to read the tables and views (which, without going into the grisly details, are essentially the same thing for Power Query). So the techniques described here are probably the easy bit. The hard part is having to be pleasant with the guardians of corporate data so that they concede that you actually need the data and should be allowed to see it.
The databases that Power Query can currently connect to, and can preview and load data from, are given in Table 12-3.
Table 12-3. Database Sources
Database |
Comments |
---|---|
SQL Server |
Lets you connect to a Microsoft SQL Server on-premises database and import records from all the data tables and views that you are authorized to access |
Windows Azure SQL Database |
Lets you connect to a Microsoft SQL Server cloud-based database and import records from all the data tables and views that you are authorized to access |
Access Database |
Lets you connect to a Microsoft Access file on your network and load queries and tables |
Oracle Database |
Lets you connect to an Oracle database and import records from all the data tables and views that you are authorized to access |
DB2 Database |
Lets you connect to an IBM DB2 database and import records from all the data tables and views that you are authorized to access |
MySQL |
Lets you connect to a MySQL database and import records from all the data tables and views that you are authorized to access |
PostgreSQL |
Lets you connect to a PostgreSQL database and import records from all the data tables and views that you are authorized to access |
Sybase |
Lets you connect to a Sybase database and import records from all the data tables and views that you are authorized to access |
Teradata |
Lets you connect to a Teradata database and import records from all the data tables and views that you are authorized to access |
Connecting to Oracle, DB2, MySQL, PostgreSQL, Sybase, or Teradata will require not only that the database administrator has given you the necessary permissions, but also that connection software (known as drivers or providers) has been installed on your PC. Given the “corporate” nature of the requirements, I suggest that you talk directly to your IT department to get this set up.
Other Sources
Up until now we have seen some of the more traditional sources of data that you might need for your analysis. As the world changes, the available sources evolve, and the current trend is toward less “structured” (and controlled) data sources and toward more varied and often less corporate sources.
Power Query can connect to, and read data from, a whole host of these less classic sources. Those to which it can currently connect are listed in Table 12-4.
Table 12-4. Other Sources
Source |
Comments |
---|---|
SharePoint List |
Loads a SharePoint List as a data table. You will need SharePoint permissions to access SharePoint data. |
OData Feed |
Connects to an OData feed to read and load the data it contains. OData is a standardized protocol for creating and consuming data, especially over the Internet. |
Windows Azure Marketplace |
Lets you load data that you are authorized to access on the Windows Azure Marketplace. This will require a Windows Azure Marketplace subscription. |
Hadoop File |
Reads Hadoop (“Big Data”) files. |
Windows Azure HDInsight |
Reads cloud-based Hadoop files in the Microsoft Azure environment. |
Windows Azure Blob Storage |
Reads from a cloud-based unstructured data store. |
Windows Azure Table Storage |
Reads from Windows Azure tables. |
Active Directory |
Reads data from the Enterprise Active Directory. This will probably require custom access rights. |
MS Exchange |
Reads data from the MS Exchange Email system. |
|
Reads Facebook data. |
These data sources are so varied, and often personal, that I will not be going through them here. Suffice it to say that the connection principles and approach are as identical as possible to those that you will see in the next few pages. So it is up to you to try connecting to this brave new world and using the data it offers.
Loading Data
It is time to start looking at the heavy lifting aspect of Power Query, and how you can use it to load data from a variety of different sources. As always, I will begin on the bunny slopes with a simple example of “scraping” data from a web page. Then, given the plethora of available data sources, and so we can better give the process some structure, we will begin by loading the five tables that make up the data set you used in the chapters on Power View and PowerPivot. However, each table will come from one of the classic data sources that currently can be found in most workplaces:
After we have loaded these five tables, we will look into getting data from a relational database—SQL Server will be the example here—as databases are, in my experience, a frequent source of core data for analysis, and we can apply several tricks and techniques to database sources. Finally, we will look at a technique for loading multiple files of the same format, as this can be a frequent requirement.
As a first and extremely simple example, let’s grab some data from a web page. Since I want to concentrate on the method rather than the data, I will use a web page that has nothing to do with the sample data in the book, and that we will not be using at all, other than as a simple introduction to the process of loading data using Power Query.
Figure 12-2. The From Web source dialog
Figure 12-3. The Navigator window
Figure 12-4. The Peek popup
Figure 12-5. The Power Query window
In a few simple steps you have seen just what Power Query can do and how it is done. You have found data and loaded it into Excel—a complete ETL process in a few clicks. In a very short time, you have seen lots of things happen and a whole new world open up in front of you. Consequently, I think that it would be a good idea to explain some of these new elements before going any further.
The Navigator window will appear when connecting to many, but not all, data sources. It is there to let you
Depending on the data source to which you have connected, you might see only a few data tables in the Navigator window, or hundreds of them. In any case, what you can see are the structured data sets that Power Query can recognize and is confident that it can import. Equally dependent on the data source is the level of complexity of what you will see in the Navigator window. If you are looking at a database server, for instance, then you may start out with a list of databases, and may need to dig deeper into the arborescence of the data by expanding databases to list the available data tables and views.
The Peek popup (also known as the fly-out) is, as its name implies, a preview of the data in a table. It provides
Note The Peek popup is a brilliant data discovery tool. Without having to load any data, you can take a quick look at the data source and any data that it contains that can probably be loaded by Power Query into Excel or PowerPivot. You can then decide if it is worth loading, and so you do not waste time on a data load that could be superfluous.
The Peek popup will only show you a tiny snippet of the available data. This is deliberate because it is only designed to provide a preview. One useful trick, if you have a table containing many columns, is to click on a column name at the bottom of the Peek popup. The chosen column will appear in the window so that you can see the data it contains.
Power Query will not forget how it loaded data. Each separate load process is remembered as a “query” and appears in the WorkBook Queries window, which is normally to the right of the current workbook. Each query in the WorkBook Queries window has a name and a last activated date, as well as an indication of whether it loads data into Excel or PowerPivot or remains a query process, but without actually loading the data anywhere from Power Query.
The WorkBook Queries window also has a Peek popup. So if you hover the cursor over a query, you will see what the data returned by this query will look like. If you want to go back to a query to continue modifying it, all you have to do is double-click on it, and you will open the Power Query window and see the data as you left it. Figure 12-6 shows the WorkBook Queries window in more detail.
Figure 12-6. The WorkBook Queries window
Since you will be using the various possibilities of the WorkBook Queries window in your day-to-day work using Power Query, Table 12-5 explains the available options in the context menu that appears when you right-click on a WorkBook query.
Table 12-5. Query Options
Option |
Description |
---|---|
Edit |
Opens the Power Query window where you can modify the query. Double-clicking will also carry out this action. This is also possible by clicking on Edit in the Peek popup. |
Refresh |
Runs the query and reprocesses the data. |
Duplicate |
Makes a copy of the query. |
Reference |
Creates a new query that references the selected query. |
Delete |
Deletes the query. You will be asked for confirmation. This is also possible by clicking on Delete in the Peek popup. |
Merge |
Lets you join the data from two queries. This is described in the section “Merge.” |
Append |
Lets you add identically structured data from another query to the selected query. |
Share |
Lets the user share this query (and its output) using Power BI. This is described in Chapter 15. |
Show the Peek |
Displays the Peek popup. |
If you wish to rename a query, you can do this from inside the Power Query window.
Note If you have returned data to a worksheet, and then you delete the worksheet, you will also delete the query—without any warning.
The scenario is as follows: you have been given a comma-separated text file (also known as a CSV file) containing a list of data. You now want to load this into Power Query so that you can look at the data and consider what needs doing (if anything) to make it useable. Here is what you have to do:
Figure 12-7. The Power Query window
And that, for the moment, is that. You have loaded the file into Power Query and, in the next chapter, you will soon learn how to shape the data and load it into either Excel or PowerPivot. If you really want to jump ahead to this discussion now, all is explained in the sections “Dataset Shaping” and “Data Cleansing and Modification” in Chapter 13.
What Is a CSV File?
Before we move on to other file types, there are a few comments I need to make about CSV files. There is a technical specification of what a “true” CSV file is, but I won’t bore you with that. What’s more many programs that generate CSV files do not always follow the definition exactly. What matters is that Power Query can handle files that
If you have followed the process for loading a CSV file in the previous section, then you will find it considerably similar to loading a text file. This is not surprising. Both are text files, and both should contain a single list of data. The core differences are these:
Here, then, is how to load a text file into Power Query:
Where Power Query is really clever is that it can make a very educated guess as to how the text file is designed. That is, it can nearly always guess the field separator (the character that isolates each element in a list from the other elements) and so not only will it break the list into columns, but it will also avoid importing the column separator.
XML, or Extensible Markup Language, is a standard means of sending data between IT systems. Consequently you have every chance of having to load an XML file one day. Although an XML file is just text, it is text that has been formatted in a very specific way, as you can see if you ever open a XML file in a text editor such as notepad. To load an XML file,
Figure 12-8. The Power Query window after loading an XML file
The actual internal format of an XML file can get extremely complex. Sometimes an XML file will contain only one data table, sometimes it will contain many data separate tables. On other occasions it will contain one table whose records contain nested levels of data that you need to handle by expanding or aggregating. These techniques are described later in this chapter in the context of database sources.
If you are using Power Query you are probably already a major Excel user and have many, many spreadsheets full of data that you want to rationalize and use for analysis and presentation. So, let’s see how to load the contents of an Excel file. To begin with, I will show you how to get data from another workbook:
As you can see from this simple example, having Power Query read Excel data is really not difficult. However, you might still be wondering about a couple of things that you saw during this process, so here are some anticipatory comments:
The Navigator window will display
Sometimes these can, in effect, be duplicate references to the same data, so you should really use the most precise data source that you can. For instance, I advise using a named table or a range name rather than a worksheet source, as the latter could easily end up containing “noise” data, which would make the load process more complex than it really needs to be.
Power Query will list, and use, data connections in a source Excel workbook if the data connection is active and has returned data to the workbook. Once a link to Power Query has been established, you can delete the data table itself in the source Excel workbook—and still use Power Query to load the data over the data connection in the source workbook.
Power Query will not take into account any data filters on an Excel data table. Consequently you will have to reapply any filters (of which you’ll learn more later) in Power Query if you want to subset the source data.
Excel Data from the Current Workbook
Loading a table from an existing worksheet into Power Query is, if anything, easier than the process that you just saw, but is, nonetheless, slightly different. Here is what you have to do:
Figure 12-9. Loading data from the current spreadsheet—the From Table dialog
The source list will be converted into an Excel table and its data loaded into Power Query.
Another well-used data repository that proliferates in many corporations today is Microsoft Access. It is a powerful relational desktop database and can contain multiple tables, each containing millions of records. So we need to see how to load data from this particular source.
I am sure that you can see a pattern emerging here. Indeed, this pattern will continue as you progress to loading tables from relational databases in a few pages time.
Note Power Query cannot see linked tables, only imported tables or tables that are actually in the Access database. It can, however, read queries overlaid upon native, linked, or imported data.
An Access database can contain hundreds of tables or queries (which, like database views, can be considered as if they are tables). So what if you want to import several tables at once? No problem:
Each table or query in Access will become a separate Power Query query and will be listed in the WorkBook Queries window. Each query will take the name of the table it has loaded into Power Query. If you want to see the contents of the query in the Power Query window and continue transforming the data, then all you have to do is to double-click on the query.
Loading the List of Tables
Another option that you may find useful is loading the list of tables and queries into Power Query rather than the data. There are two potential reasons for this:
Here is how to do this:
How you can use this list of tables to load data is handled later in this chapter in the section “Database Metadata”. For the moment, let’s just say that it is a smart way to get a handle on your data.
Enterprise relational databases still hold much of the world’s data, so you really need to know how to tap into the potentially vast mines of information that they contain. The bad news is that there are many, many databases out there, each with their intricacies and quirks. The good news is that once you have learned to load data from one of them, you should be able to use any of them.
As Power BI and Power Query are core Microsoft technologies, I will use the Microsoft enterprise relational database—SQL Server—as an example to show the principles of database access. The first advantage of this setup is that you probably do not need to install any software to enable access to SQL Server (though this is not always the case, so talk this through with your IT department). A second advantage is that the techniques are pretty similar to those used and applied by Oracle, DB2, and the other databases to which Power Query can connect. Furthermore, you can load multiple tables or views (each as a separate query) from a database. To see this in action (and presuming you have created the database CarSalesData as described in Appendix A), take the following steps:
Figure 12-10. The Microsoft SQL Database dialog
Figure 12-11. The Navigator window when selecting multiple items
Since this is very similar to the way in which you loaded data from Access, I imagine that you are getting the hang of things by now. As always, though, there are a few comments to make.
First, let’s cover the initial connection to the server. The options are explained in Table 12-6.
Table 12-6. Database Connection Options
Option |
Comments |
---|---|
Server |
You cannot browse to find the server and need to type or paste the server name. If the server has an instance name, you need to enter the server and the instance. |
Database |
If you know the database, then you can enter (or paste) it here. This will restrict the number of available tables in the Navigator window and will make finding the correct table or view easier. |
SQL Statement |
If you enter an SQL SELECT statement that is fully qualified—that is, it contains the database reference if you have not added this already in the dialog—then Power Query will open the Power Query window and load the table directly. |
Note These options will vary depending on the make of database you are connecting to.
Second, remember that the world of relational databases is huge; much, much more could be said at this point. I am afraid that I simply do not have the space to devote to all the subtleties of how you can use the available relational database sources, however. One important point to note is that you can use stored procedures (with SQL Server or Oracle, for instance) to return data. These are complex pieces of business logic that are stored in a database. To call a SQL Server stored procedure, you would enter the following elements into the Microsoft SQL Database dialog:
This way, either you or your IT department can create complex and secure ways to allow data from the corporate databases to be read into Power Query from enterprise databases.
Finally, you can create multiple workbook queries at once if you have selected multiple items by clicking the Load button. This will create a query for each selected table which you can modify and extend later.
There could well be occasions when you want to connect to a database, but not yet load the data. Perhaps you want merely to memorize the connection parameters, or alternatively, you want to examine the data more thoroughly. This is possible if you load not a table, but the database metadata, into Power Query, like this:
Figure 12-12. Database metadata
From here, you can look at the tables in the database and choose any single table that you wish to load by clicking on it in the Navigator window (or by right-clicking and selecting Drill-down in the table data). Creating a query like this, and then copying it to load the data tables, saves you from having to reconnect to the source repeatedly.
Note Here I was using MS Access as a demonstration. The same principles apply to all relational databases which Power Query can connect to.
Data Discovery or Data Load?
You have now seen how to load data from several different sources, although far from all of them. What you have been able to see, I hope, is that Power Query applies a common interface to the art and science of loading data, whatever the source.
Yet you may remember that, at the start of this chapter, I mentioned that Power Query is for discovering data as well as loading it. This covers the following points:
Peeking at Data
Only data sources containing multiple data tables can be previewed. If a source contains, by definition, only one data table, then you cannot “peek” at it but can only load it into Power Query and take a look at it there. As we have seen, Excel files, Access database files, and relational databases allow you to see the structure of the available data before you even see the data itself. However, I wanted to warn you first that the following file types cannot be previewed:
There are already many thousands of publicly available data sources, and the number is growing at an incredible rate. It could well be that you need data from the public domain to add to your corporate data to provide comparisons, or perhaps you need it just on its own to make a point. Power Query can search for available public data and attempt to get you the information that you are looking for. As an example of this
Figure 12-13. The Search window
The list of data sources could cover many pages of the Online Search window. You can flip from page to page by clicking the page number figures at the bottom of the window. Hovering the mouse pointer over a data source will display the fly-out showing the data. You can then load the data that interests you by right-clicking on it and selecting Edit.
Another facet of the Search facility is its ability to scan corporate data sources that have been made available using Power BI (this is explained in Chapter 15). If you or your colleagues have made searches available, and you have access rights, then Power Query will attempt to find these too.
The Search ribbon also lets you refine your search. Essentially you can
The Search ribbon is shown in Figure 12-14 and its options explained in Table 12-7.
Figure 12-14. The Search ribbon
Table 12-7. Search Ribbon Options
Option |
Description |
---|---|
All |
Searches across all areas—public and organizational |
My Shared |
Only uses searches for queries that you have shared in Power BI. |
Public |
Searches for public data only |
Organization |
Searches for organizational data only |
Name |
Filters the search using the search name |
Description |
Filters the search using the search description |
From |
Filters the search to specific people or web sites |
Data Source |
Filters the search to specific sources of data |
This Week |
Filters the search to queries updated in a specific time frame |
Column Names |
Show results based on column name filtering |
Certified |
Only displays certified searches |
Recent Searches |
Lists any recent searches |
Close Search |
Closes the search window |
Note Shared and organizational shared queries are explained in Chapter 15.
Conclusion
In this chapter you have seen how the latest addition to the Power BI toolset, Power Query, can help you find and load data from a variety of sources. These sources can be more traditional—such as Access, Excel, or text files—or they could come from big data repositories or social media sources. Indeed, data could even be found in public data repositories or from commercial cloud-based sources. Power Query will even help you find available data (from both inside and outside the enterprise) and remember any recent searches you have made. Yet this is only the first part of the story. Now you need to learn how to shape and tweak the data to prepare it for further use. This is the subject of the next chapter.
3.22.181.47