© Ásgeir Gunnarsson and Michael Johnson 2020
Á. Gunnarsson, M. JohnsonPro Microsoft Power BI Administrationhttps://doi.org/10.1007/978-1-4842-6567-3_13

13. Datasets and Dataflows

Ásgeir Gunnarsson1   and Michael Johnson2
(1)
Hafnarfjordur, Iceland
(2)
St. Andrews, South Africa
 
Read this chapter if you would like to find out more information about
  • How to deploy and manage datasets and dataflows

  • Configure data refreshes

  • Enable large models in Power BI Premium

  • Enable and use the XMLA endpoint

As part of the Application Lifecycle Management process, administrators are often responsible for the deployment and management of report artifacts. Datasets and dataflows are two important artifacts that often require more care than reports and dashboards would. Critical tasks relating to the management of datasets and dataflows include
  • Publishing to the service

  • Configuring data sources

  • Refreshing data

  • Managing access

  • Promoting and certifying data

In this chapter, we discuss some of these key administration activities as they relate to Power BI datasets and dataflows.

Power BI datasets

Power BI datasets, also referred to as the Power BI model, consist of tables, relationships, calculations, formatting, and data. Datasets are created in Power BI Desktop together with a report as part of a single PBIX. When a report is published to the Power BI Service, the dataset is included as part of that deployment, assuming that the report was not built on a dataset that has already been published to the power BI service. These reports are sometimes referred to as lean reports as they do not contain an embedded data model. Once the PBIX file has been published, the model gets separated from the report, and both will inherit the name of the PBIX file. It is possible to change the name of the report but not the dataset.

Publishing to the service

Depending on the deployment policy of your organization, the job of publishing Power BI artifacts may become the responsibility of an administration or operations team. This separation of duties can play an essential role in ensuring that a well-governed Power BI environment is maintained.

You can read more about deployment processes in Chapter 5.

There are many ways to deploy datasets and dataflows to the Power BI environment.
  • Publishing using Power BI Desktop

  • Uploading from the portal

  • Publishing using APIs

  • Publishing using the XMLA endpoint

Publishing using Power BI Desktop

The most common approach to publishing reports is to use Power BI Desktop. When logged in, the Power BI application has a Publish button on the ribbon.
../images/496939_1_En_13_Chapter/496939_1_En_13_Figa_HTML.jpg

Before the PBIX can be published to the service using the Desktop application, all changes must first be saved. After selecting the Publish button, a dialog window is displayed, listing all workspaces that the user has permission to publish to. Once the workspace has been selected, the process will first ensure that there is no existing report with that name; if the report name already exists, you will be asked if you want to overwrite the published file or cancel. The PBIX is then uploaded to the service. This may take some time if the model is large. Once Power BI Desktop has completed the upload, a dialog is displayed that confirms the upload is complete along with a link to the report.

Uploading from the portal

It is also possible to upload Power BI reports using the portal. This option is useful when the administrator does not have the Power BI Desktop installed, such as if they are using an operating system that is not supported.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig1_HTML.jpg
Figure 13-1

Uploading a PIBX file using the portal

Using this option, the administrator must ensure that they are in the correct workspace before uploading the report.

Publishing using the REST APIs or PowerShell Cmdlets

Manual deployments may be appropriate for small organizations with few reports, but in any large organization with many reports and datasets, an automated approach to these deployments is preferable. The Power BI Service provides REST APIs that can be used to perform many operations including the publishing of PBIX files. Many of these functions are also available as PowerShell Cmdlets. Chapter 15 discusses these APIs in more detail.

Using these automated tools, it is possible to upload many reports in a consistent manner. These automation options can also be integrated with many of the automated deployment tools that your organization may already use.

Publishing using the XMLA endpoint

If your organization uses Power BI Premium, it is also possible to create and modify datasets directly using the new XMLA endpoint functionality. This endpoint is discussed in more detail later in this chapter. The XMLA endpoint exposes the underlying Analysis Services service, allowing traditional tools such as Visual Studio or Tabular Editor to make changes to the model. This is the preferred way to deploy large models as this approach may not force the upload of all data, making it a faster way to apply changes.

Keeping data up-to-date

Once a dataset has been deployed to the Power BI Service, often the next step is to ensure that the dataset is kept up-to-date. This is done by ensuring that the dataset can connect to its source systems and load the latest data. To achieve this the user configuring the report needs to
  • Configure data sources and gateways for the dataset

  • Invoke data refresh

Configuring data sources

When Power BI reports and their underlying datasets are published to the Power BI Service, the credentials for the associated data sources are not included. This means that the credentials need to be provided before data access can occur. For datasets using DirectQuery or live connections, this means that the reports will remain unusable and reports using import mode that the data will become stale. Configuration of the datasets data sources can be accessed via the settings table on the dataset.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig2_HTML.jpg
Figure 13-2

Data source configuration for datasets

There are three important tabs on this menu.

Gateway connections

Gateways provide a bridge between the Power BI Service and on-premises data sources. The gateway can also be used to refresh cloud-based data sources when combined with on-premises data or when forced to. One advantage of the Power BI Gateway is that data sources can be configured once and reused by reports and datasets that have access to the data source.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig3_HTML.jpg
Figure 13-3

Selecting an On-premises data gateway

Gateways are discussed in greater detail in Chapter 14.

Data source credentials

When data gateways are not required, the data source credentials can be manually provided per data source.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig4_HTML.jpg
Figure 13-4

Updating data source credentials

Parameters

Parameters can be used in many ways to define configurable variables and are a great way of creating dynamic connection details in Power BI datasets. This is useful for organizations that have separate environments for development, testing, and production. Connection strings in Power BI are embedded in the application and cannot be changed during deployment like they can be in other tools such as SQL Server Integration Services. Parameters can be used to dynamically modify the connection string by providing configurable values such as in the example in Figure 13-5.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig5_HTML.jpg
Figure 13-5

Using parameters to configure source database

When in development these parameters would point to the development servers, but when deployed to a production workspace, these parameter values would then point to the production data sources.

Once the data sources and gateway have been correctly configured, the next step is to refresh the data.

Data refreshes

For reports that use import mode to remain up-to-date, the data in those models must be refreshed. How these refreshes are scheduled is determined by the size, frequency, and method of change.

Manual refresh

Manual refreshes are performed by selecting the refresh now option for the dataset. The refresh now button adds the dataset into a queue of datasets to be refreshed, so it may not start immediately.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig6_HTML.jpg
Figure 13-6

Manually refreshing a Power BI dataset

This option may be appropriate for data sources that are not expected to be refreshed often or have a data source that is not consistently online.

Scheduled refresh

Refreshing datasets manually is not a sustainable practice when your users require constant up-to-date data. For this Power BI supports scheduled refreshes. Scheduled refresh can be configured to trigger refreshes multiple times per day. The capacity type determines the number of times a dataset can be refreshed. Datasets inside of a shared capacity can be refreshed up to 8 times per day. Datasets published to Premium workspaces can be refreshed up to 48 times per day.

Schedules can be created to run either daily or weekly, with time of day set at intervals of 30 minutes. This means that a dataset could be set to refresh every hour during core business hours, once every 3 hours throughout the day, or whatever custom refresh schedule works for your business. Premium backed workspaces can be set to refresh every 30 minutes throughout the day. In Figure 13-7 you see how to schedule a data refresh twice daily at 8:00AM and 10:30:AM in the selected timezone.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig7_HTML.jpg
Figure 13-7

Creating a refresh schedule

Warning While it is be possible to process the model every 30 minutes, careful attention needs to be paid the added load that these refreshes place on the source systems, as these refreshes can cause high resource usage. If this is the case, then as the administrator, you may want to recommend the use of incremental refresh.

Automated refresh

Some data sources may not support refreshes at set times during the day. For example, if your organization runs a large end-of-day process that completes at inconsistent time intervals, then it is not possible to know when the best time to schedule the load is. In this case, it is possible to invoke a dataset refresh using the Power BI APIs. Using the APIs it is possible to include the Power BI dataset refresh as a step in your ETL process. This ensures that the refresh happens as soon as new data is available, while not being called repetitively or before the data is ready.

Incremental refresh

Incremental refresh is a feature in Power BI that allows tables to be split into smaller sections called partitions. These partitions can then be selectively processed; how these partitions are selected is defined using an incremental refresh policy.

An incremental refresh policy cannot be created once a dataset has been published to the service and needs to be created using Power BI Desktop. Creating this policy may be the responsibility of the Power BI developers.

Reasons to consider incremental refresh:
  • Smaller volumes of data retrieved mean faster data refreshes.

  • A shorter refresh period results in more reliable refreshes with fewer timeouts.

  • Fewer rows loaded during refreshes results in lower impacts on source systems.

Incremental refresh cannot be used on every data source, and there are a few limitations.

Incremental refresh requirements:
  • Table must include a date or datetime field.

  • Refreshes can only work on dates in the past.

  • The data source should support Query-folding such as SQL server.

Step-by-step

For step by step instructions on how to set up Incremental refresh, see https://docs.microsoft.com/en-us/power-bi/admin/service-premium-incremental-refresh.

Alerting

With any automated solution, it is essential to ensure that the appropriate people are notified of any errors with the refresh so that these issues can be addressed promptly to minimize user impact. When creating the refresh schedule, it is possible to supply a list of email addresses that should be notified in the event of an error.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig8_HTML.jpg
Figure 13-8

Configure Failure notifications

It is also possible to send such notifications using the APIs by passing the list of user to be notified. See Chapter 15 for more on the Power BI APIs.

XMLA endpoint

As of March 2020, the Public preview for the read-write XMLA endpoint became available for premium capacities (the read XMLA endpoint has been available for a few more months). The XMLA endpoint supports programmatic access to the underlying dataset in much the same way as we are able to connect directly with Azure Analysis services. Before being able to use the XMLA for a workspace hosted on a premium capacity, it first needs to be enabled in the Capacity admin portal.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig9_HTML.jpg
Figure 13-9

Enabling the XML endpoint

There are many reasons why an organization would want to enable the XMLA endpoint. We look at some of the features that the XMLA endpoint enables.

Read-only XMLA endpoint

Using the read-only XMLA endpoint , it is possible to use several common tools to connect directly to the dataset; tools such as SQL Management Studio can be used to query the metadata of the database. More practically, other reporting tools that support XMLA can query the model directly. Such tools include Excel and Tableau, allowing business users to continue using the tools that they know while supporting a single version of the truth across the enterprise. When using the XMLA endpoint, authentication is still managed through Azure Active Directory, and users still need to be given access to the underlying workspace.

To be able to use the XMLA endpoint, all that is needed is a valid user account and the URL for the endpoint. The URL for the endpoint can be obtained by navigating to the Premium tab of the workspace and copying the workspace connection. An example of this endpoint can be seen in Figure 13-10.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig10_HTML.jpg
Figure 13-10

Getting workspace connection

Read-write XML endpoint

Setting the XMLA endpoint to read-write supports all the features available in the read-only endpoint as well as many functions that allow the dataset to be updated using either XMLA or TOM (Tabular Object Model). Some of the tools that can use the XML endpoint are
  • SQL Server Management Studio : Management Studio is often used by SQL Server administrators and developers to connect to and manage SQL Server environments. Management Studio can be used to perform tasks such as exploring the dataset, returning metadata about the model such as its size, creating and deleting partitions, as well as procession partitions.

  • SQL Server profiler : Profiler can be used to connect to Power BI workspaces and monitor the activities taking place. This can be used to help troubleshoot performance problems.

  • Visual Studio : For years Visual Studio has been the primary tool used to develop any Microsoft Business Intelligence solution, including multi-dimensional and tabular models providing an easy-to-use graphical interface.

  • Tabular Editor : Tabular Editor is an open-source application that makes developing models easy, including automation and tests. Tabular Editor can also be used to deploy changes to the model without always requiring the model to be re-uploaded or refreshed.

  • DAX Studio : DAX studio is another third-party tool developed to primarily write and test DAX queries. DAX Studio includes tools to help understand the performance and resource usage of measures.

  • PowerShell : Administrators can use PowerShell to automate the creation and processing of partitions to manage custom refresh scenarios.

Large data models in Power BI Premium

Reports published to a workspace in the shared tenant have a 1GB limit on the size of the dataset. Moving that workspace to a premium capacity increases the maximum size to which the dataset can grow. In Figure 13-11 you can see the maximum size that each SKU supports.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig11_HTML.jpg
Figure 13-11

Maximum dataset size per SKU

Using Power BI premium, it is possible to publish data models larger than these limits. Using the large models feature in Power BI, it is possible to grow the model to the maximum amount of memory available to the capacity.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig12_HTML.jpg
Figure 13-12

Memory limits per Premium SKU

This means that it is possible to support datasets up to 400 GB using a P5 Premium SKU, although this would not be advisable as this would not leave any memory for other processes. Enabling of large datasets is done per dataset and can currently only be done using the Power BI PowerShell Cmdlets.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig13_HTML.jpg
Figure 13-13

Enabling large datasets using PowerShell

When large datasets are enabled, the file upload sizes are still limited to the sizes listed in Figure 13-12. However, refreshes on that dataset are supported up until the max memory size for the capacity. For this reason, it is recommended that large datasets be combined with an incremental refresh or custom partitioning.

Managing access

Controlling access to data is an essential administrative function; datasets can contain data that is sensitive and must not be accessible by unauthorized users. There are two primary mechanisms that we use to secure data. The first is by limiting users to workspaces or workspace apps that require access. Secondly, row-level security is used to prevent legitimate users from accessing data that they should not view, such as limiting a sales agent from seeing the sales of other agents.

In Chapter 11, we looked at how access to workspaces can be managed by assigning roles to users for that workspace. To promote reuse of datasets within an organization, it is possible to build reports on top datasets that are published in other workspaces. This promotes the idea of having one central model that serves as a single version of the truth. To enable this feature, a Power BI administrator needs to enable this setting in the Tenant settings portal.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig14_HTML.jpg
Figure 13-14

Enabling cross workspace datasets

Once this setting is enabled (it may take up to 15 minutes to take effect), users can be added to the dataset by selecting the Manage Permissions option on the dataset. A new dialog is displayed where you can capture the user or groups that you wish to grant access to.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig15_HTML.jpg
Figure 13-15

Adding a user to a dataset

Users or groups who are added to this list can consume reports that are based on datasets that are in workspaces that they do not have access to. Two additional settings are available for this:
  • Allow recipients to reshare the artifact: This allows users who have been granted access to reshare this access with others. Resharing may be appropriate in a smaller organization, but should not be delegated in more controlled environments where there is usually a formal process to request access.

  • Allow recipients to build new content from the underlying dataset: With this option enabled, users can build new reports connected to the data model, this setting would be required for Power BI report developers who need to reuse existing datasets.

Power BI dataflows

Like Power BI datasets, Power BI dataflows are a way to moving data into the Power BI Service. However, unlike datasets, Power BI reports cannot be built directly against dataflows. Instead, dataflows provide an easy-to-use self-service data preparation process that allows for the creation of standardized entities that can be reused as the source for multiple datasets.

Advantages of using Power BI dataflows:
  • Consolidate and standardize repeated data transformations

  • Reduce impact on source systems

  • Uses the Common Data Model to create standardized entities

The development process for Power BI dataflows differs somewhat from that of Power BI datasets. The only supported development environment for dataflows is Power Query online, which is a web-enabled version of the Power Query editor that we see in Power BI Desktop. The code to import datasets is identical, and it is even possible to copy and paste the code directly from Power BI Desktop into a dataflow.

When Power BI dataflows ingest data, the data is not loaded directly into a dataset. Instead the data is landed as a CSV file together with metadata about the data, its source query and its storage location. By default, the data is stored within the Power BI Service and is not accessible through any other mechanism. This can be changed to allow the storage of dataflows within an Azure Data Lake Storage Gen 2 account that is linked to the organization’s Azure account.

Configuring dataflows to use external storage

By default, all dataflows are stored within the Power BI Service and managed by Microsoft. Therefore, it is not possible to interact with these datasets in any way except through the Power BI Service. It is possible to change the storage location of the dataflows to an Azure Data Lake Storage (ADLS) Gen 2 storage account in your organization’s Azure Subscription. When this is done, it is possible to
  • Extend the amount of storage available for dataflows

  • Access the dataflow storage directly through other tools such as Azure machine learning

  • Create CDM folders using external tools such as data factory, data bricks, or any other tool

Before the external storage can be set up, the ADLS account needs to be configured. This is a task that cannot usually be performed by a Power BI administrator so the details are not discussed here. See the link to the step-by-step instruction later in this section for details on how this can be done. The next step is to configure the dataflow settings in the Power BI admin portal. This needs to be done by someone with Global administrator permissions on the tenant.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig16_HTML.jpg
Figure 13-16

Configuring ADLS details

Provide the details to the
  • Subscription ID

  • Resource Group

  • Storage Account

Once this step is complete, Power BI is connected to the ADLS. The next step is to enable workspace administrators to assign workspaces to this storage account.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig17_HTML.jpg
Figure 13-17

Allowing workspace admins to use the storage account

Step-by-step

For detailed instructions on how to set up external storage, see https://docs.microsoft.com/en-us/power-bi/transform-model/service-dataflows-connect-azure-data-lake-storage-gen2.

Enhanced compute engine

The enhanced compute engine allows Power BI Premium users to optimize the performance of their dataflows by changing the underlying storage mechanism. The primary benefit of this is
  • Faster loads

  • Using Query directly against dataflows

The enhanced compute engine can be configured for the tenant in the Power BI Premium Capacity admin portal.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig18_HTML.jpg
Figure 13-18

Configure enhanced compute engine

In addition to enabling the enhanced compute engine, you also need to set the container size, this is the container that executes the enhanced compute engine process. If this process is given too little memory, then there will be a large amount of paging between the container and disk, which results in slower dataflows.

Publishing dataflows

Unlike Power BI reports and datasets, dataflows are authored in the Power BI Service using Power Query Online. Suppose the dataflow needs to be moved from one environment to another. For example, when moving the dataflow from development to production, the dataflow can first be exported to a JSON file that contains definitions for the entities as well as the M code for those data sources. It is important to note that, like the PIBX file, there are no credentials stored in the file.

Once the dataflow has been exported to this .json file, it should also be checked into your organization’s source code repository.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig19_HTML.jpg
Figure 13-19

Exporting a dataflow to a .json file

Uploading the dataflow is not like uploading a PBIX file. Instead, you select the option to create a new dataflow inside of the workspace in which you want to upload the dataflow. You are then given four options, the third of which is to import a model.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig20_HTML.jpg
Figure 13-20

Uploading a dataflow model

After importing the model, you will then be directed to the settings page for the model where you can supply missing credential details and select a gateway as you would for a newly published dataset.

Refreshing dataflows

Refreshing a dataflow is very similar to the dataset refresh process and can be done by
  • Manual refresh

  • Scheduled refresh

  • Automated refresh

  • Incremental refresh

The only significant difference between refreshing datasets and dataflows comes from incremental refreshes. Incremental refresh is only available for dataflows when using Power BI Premium.

Endorsing datasets or dataflows

To increase the visibility, use, and trust of datasets and dataflows, Power BI introduced the ability to mark these items as either promoted or certified. It is not possible to make a dataset as both certified and promoted simultaneously.

Promotion

Promoting a dataset is an easy way for content creators or any user with write permission to the workspace to highlight the dataset or dataflow. As an administrator, there is no way to prevent users from promoting datasets or dataflows in this way. When a dataset or dataflow is promoted, it is moved to the top of search lists such as the one when users using Power BI Desktop connect to a Power BI dataset or Power BI dataflow.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig21_HTML.jpg
Figure 13-21

Promoting a dataset

As a Power BI administrator, you would want to monitor the use of this functionality to ensure that it is not being abused, as promoting too many or even all datasets would have the same effect as not promoting them at all. To manage this, a policy on the use of this feature and how long a dataset or dataflow should be promoted should be decided.

Certification

The certification of a dataset or dataflow is to signal to users that the data within is safe and should be considered as a proven data source. Unlike promotion, the process of certifying a dataset should be more strenuous and cannot be done by anyone but designated individuals, if at all. Before being able to certify a dataset, there is a setting in the Power BI admin portal that first needs to be enabled and configured.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig22_HTML.jpg
Figure 13-22

Enabling certification of datasets

  • URL for documentation: You can provide a link to a knowledge repository such as SharePoint, confluence, or even a workflow tool where users can find out more about getting their datasets certified.

  • Apply to: It is possible to specify who can certify datasets and dataflows. It is not recommended that you enable this for the entire organization. The second option is to specify a list of security groups that are either able to or prohibited from using this functionality. It is important to note that this requires security groups and Office 365 groups; otherwise individuals cannot be captured here like other options in the Power BI admin portal.

The key to this setting is having a formal process that developers need to go through to certify their datasets. This process could be as informal as sending an email to the members of the certification team, or the creation workflows to manage this process.

Part of your organization's Power BI governance document needs to specify how the certification is done and how these are evaluated and monitored. Once a policy for certification has been established, and the certification functionality is enabled, it is then possible to certify the dataset or dataflow by navigating to the settings page for the dataset or dataflow. This is done by expanding the endorsement pane and selecting the certify option. A description should also be provided to aid users in understanding how this dataset should be used.
../images/496939_1_En_13_Chapter/496939_1_En_13_Fig23_HTML.jpg
Figure 13-23

Certifying a dataset

Once certified, the dataset will appear at the top of the search list when browsing available datasets.

Certification and promotion is a great way to enhance the usage and trust in the datasets and dataflows built by your organization.

Summary

The creation of datasets and dataflows is primarily a development task, but managing these artifacts once they are in production is an administrative one. The two key responsibilities here are
  • Ensuring that the data is accessible and up-to-date

  • Ensuring that the impact on the source system is acceptable and does not affect the operations of the organization.

There are many tools available to the administrator to help ensure the availability of data.

Next, we look at the On-premises data gateway that is used to connect the Power BI Service to your on-premises data sources.

Call to action

  • Formalize deployment and monitoring process for datasets and dataflows

  • Ensure appropriate refresh schedules for datasets and dataflows.

  • Configure external storage for dataflows to enable new enhanced data preparation functionality.

  • Monitor for large data refreshes and evaluate if incremental refresh is possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.1.239