As data increases in both volume and variety, organizations from all verticals are adopting cloud analytics services for their data analytics. AWS offers a number of analytics services covering data lakes, data warehousing, big data processing, extract, transform, load (ETL), and data visualization. In this chapter, we will introduce the AWS analytics ecosystem. Some of the services we discuss here will be mentioned again later in the book.
First, we will map the AWS services into categories. Then, we will discuss how Amazon QuickSight fits into the wider AWS analytics ecosystem. We will look more closely at a modern modern data architecture and we will discuss its benefits and its components. Finally, we will provide a step-by-step guide to set up a modern data architecture on AWS and load and query a demo data sample. Some of this information may already be familiar to you, but let's go back over the basics.
In this chapter, we will cover the following topics:
To follow along with this chapter, you will need the following pre-requisites:
The code sample for this chapter can be accessed on the GitHub repository for this book at https://github.com/PacktPublishing/Actionable-Insights-with-Amazon-QuickSight/tree/main/chapter_1.
AWS provides a large number of analytics services. In addition to that, AWS has a number of partners who specialize in data analytics and offer analytics solutions that run on the AWS infrastructure. Partner solutions are not in the scope of this section, however. This section focuses on the AWS fully managed analytics services. In order to list the services, we will first define the specific categories related to analytics functions. Machine learning and predictive analytics are also out of the scope of this chapter. For every service category, we will then list the AWS services available, and for each service, we will provide a high-level description. Figure 1.1 depicts the commonly used AWS analytics services.
More and more organizations aspire to be data-driven and use data to drive their strategic decisions. Business intelligence (BI) tools help organizations to transform data into actionable insights. With the use of BI tools, users can analyze data and then present their findings in reports or dashboards. These reports or dashboards can then be consumed by business users who are interested in getting a picture of the state of the business.
In 2015, AWS launched Amazon QuickSight, a cloud-native BI tool. Since then, AWS has added new features to QuickSight, enriching the standard dashboard functionality with machine learning capabilities and offering embedded dashboard functionality. Amazon QuickSight is the main technology we will be covering in this book. Over the next few chapters, we will start with the basic functionality of Amazon QuickSight, and then we will explore more advanced features. Where possible, we will use practical examples that can be repeated in your own development environment, to give you hands-on experience with Amazon QuickSight.
Data warehouses are repositories of data; they are important components of the BI process. Data stored in data warehouses is typically structured. Traditionally, data is ingested and centralized into data warehouses from different operational data stores. Data warehouses are optimized to run analytical queries over large amounts of data. The results of analytical queries are usually calculated after an aggregation over multiple rows from one or more tables. BI applications use analytical queries to aggregate data and visualize it. It is a common architectural approach to use a data warehouse to serve data to a BI application.
Back in 2012, AWS launched Amazon Redshift, a cloud-native, fully managed data warehouse service. Today, Redshift is one of the most popular cloud data warehouses with thousands of organizations from different verticals using it to analyze their data. Other popular cloud data warehouses include Snowflake and Google BigQuery. Amazon Redshift integrates with most BI tools and it integrates natively with Amazon QuickSight. We will discuss this topic in more detail in Chapter 3, Preparing Data with Amazon QuickSight, when we look more closely into Amazon QuickSight-supported data sources.
A data lake is a repository of data where organizations can easily centralize all of their data and apply it in different use cases such as reporting, visualization, big data analytics, and predictive analytics. Data stored in data lakes can be structured or semi-structured. Usually, data is ingested into the data lake in its raw format, and is then transformed and stored back into the data lake for further processing and analysis. A cloud data lake typically uses a cloud object store to store data. AWS introduced Amazon Simple Storage Service (S3) in March 2006, offering developers a highly scalable, reliable, and low-latency data storage infrastructure at very low cost. Amazon S3 can store an unlimited amount of data, a particularly useful feature for data lakes. Organizations have one less thing to worry about because they don't need to think about scaling their storage as the amount of data stored grows.
While scaling data lake storage is something that organizations and CIOs don't need to worry about much anymore, data lake governance needs to be considered carefully. Data lakes do not enforce data schemas or data formats and, without any governance, data lakes can degrade into unusable data repositories, often referred to as data swamps. AWS offers a number of services for data governance.
The AWS Glue Catalog is part of the AWS Glue service. It is a fully managed Apache Hive metastore-compatible data catalog. Big data applications (for example, Apache Spark, Apache Hive, Presto, and so on) use the metadata in the catalog to locate and parse data. The AWS Glue Catalog is a technical metadata repository and can catalog data in Amazon S3, and a number of relational or non-relational data stores including Redshift, Aurora, and DynamoDB, among others.
AWS Lake Formation runs on top of AWS Glue and Amazon S3 and provides a governance layer and access layer for data lakes on Amazon S3. It also provides a set of reusable ETL jobs, called blueprints, that can be used to perform common ETL tasks (for example, loading data from a relational data store into an S3 data lake). Lake Formation allows users to manage access permissions, using a familiar GRANT REVOKE syntax that you might have seen in relational database management systems (RDBMSes).
Amazon Macie is an AWS service for data protection. It provides an inventory of Amazon S3 buckets and it uses machine learning to identify and alert its users about sensitive data, such as personally identifiable information (PII).
Finally, and perhaps most importantly, AWS Identity and Access Management (IAM) is a fundamental AWS service that allows users to assign permissions to principals (for example, users, groups, or roles) and explicitly allow or deny access to AWS resources including data lake locations or tables in the data catalog.
Ad hoc analytics refers to getting answers from the data on an as-needed basis. Contrary with what happens with scheduled reports, ad hoc querying is initiated by a user when they need to get specific answers from their data. The user typically uses SQL via a workbench type of application or other analytics frameworks (for instance, Apache Spark) using notebook environments or other BI applications. AWS has a number of analytics services that can be used for ad hoc analytics.
Amazon Redshift can be used for ad hoc analysis of data. For ad hoc querying, users will typically connect to Amazon Redshift using a query editor application with the Redshift JDBC/ODBC drivers. Notebook integrations or BI tool integrations are also possible for ad hoc analysis. AWS offers a number of managed notebook environments such as EMR notebooks and SageMaker notebooks. Amazon Redshift also allows its users to query data that is stored outside the data warehouse. Amazon Redshift Spectrum allows Redshift users to query data stored in Amazon S3, eliminating the need to load the data first before querying. Redshift's federated querying capability allows users to query live data in operational data stores such as PostgreSQL and MySQL.
For big data and data lakes, Presto is a popular choice for ad hoc analysis. Presto provides a high-performance parallel SQL query engine. Amazon Athena lets users run Presto queries in a scalable serverless environment. Amazon QuickSight natively supports Amazon Athena. We will talk more about this native integration in Chapter 3, Preparing Data with Amazon QuickSight. Amazon EMR is a fully managed Hadoop cluster, and it comes with a range of applications from the open source big data ecosystem. Presto has two community projects, PrestoSQL and PrestoDB, both of which are part of the Amazon EMR service. Other options included with EMR are Hive on EMR and Spark on EMR.
ETL is a term used to describe a set of processes to extract, transform, and load data usually for analytical purposes. Organizations gather data from different data sources and centralize them in a central data repository. Data from different sources typically has different schemas and different conventions and standards, and therefore it can be challenging to combine them to get the required answers. For that reason, data needs to transformed so that it can work together. For example, cleaning the data, applying certain data quality thresholds, and standardizing to a specific standard (for instance, date and time formats used) are all important tasks to ensure the data is useable. A visual representation of the ETL process is shown in the following figure.
AWS Glue is a fully managed ETL service offered by AWS. When it was first introduced in 2017, Glue ETL offered an Apache Spark environment optimized for ETL. Now, Glue ETL offers a wider range of options:
Amazon EMR transient clusters, with applications such as Spark or Hive, can be leveraged for ETL workloads. ETL workloads can be bulk or streaming: streaming ETL workloads usually need to be up and running constantly, or at least for as long as the source stream is on; batch ETL workloads don't need to run at all times and they can stop once the data is loaded into the target system. This type of workload fits nicely with the flexibility of the cloud. With the cloud, data architects don't need to think of Hadoop clusters as big monolithic clusters. Instead, users prefer purpose-built transient clusters, optimized and sized to handle specific workloads and data loads.
Now that we've had our overview of the AWS analytics ecosystem, let's learn about modern data architecture and how they are built.
The modern data architecture is a modern data analytics architecture: as the name suggests, it combines the data lake and the data warehouse into a seamless system. This approach extends the traditional data warehouse approach and opens up new possibilities for data analytics. For this reason, it is important to understand this architecture, which can be used as a data backend for Amazon QuickSight or other BI applications. To understand the architecture better, let's first start by understanding the differences between a data lake and data warehouse.
Data lakes and data warehouses are designed to consume large amounts of data for analytics purposes. Data warehouses are traditional database systems, used by organizations and enterprises for years. Data lakes, on the other side, are relatively young implementations that emerged from the big data and Hadoop ecosystems. Tables stored in data warehouses need to have clearly defined schemas. The schema needs to be defined upfront, before any data is added. This approach is called schema on write, and it ensures that data conforms to a specific structure before being ingested into the data warehouse. However, it can be less flexible, and it may introduce complexity when dealing with evolving schemas. Evolving schemas are an increasingly common scenario because organizations need to capture more and more data points from their customer interactions to drive data-driven decisions.
On the other side, data lakes don't enforce a schema upfront. Instead, applications that have the required permissions can write data to a data lake. Structure and data formats aren't enforced by the data lake: it is a responsibility of the writing application.
Data stored in a data lake has few to no limitations regarding its format: it can be structured, semi-structured, or completely unstructured. For many datasets, a schema can be inferred, either because the data is semi-structured (CSV, JSON, and others), or they follow patterns that can be identified after applying regular expressions and extracting specific columns. In data lakes, the schema is inferred when the data is read by the querying application. This approach is called schema on read, and it gives an organization flexibility regarding the data type stored. However, it also introduces challenges with data complexity and enforcing data quality.
For that reason, it is common that data that lands into the data lake goes through a series of transformations to get to a stage where it is useable. The first stage, often referred to as the raw layer, is where the data first lands, and it is stored as is.
After the data has landed, the first series of transformations is applied and the data is stored at the processed layer. Since the data can be of any format, the types of possible transformations are limitless. To give just some examples, data quality functions can be applied at this stage to remove incomplete rows and standardize the data in line with a specific datetime or time zone format. Other data engineering activities can also be performed at this stage, such as converting data into different file data formats optimized for analytics, or organizing them into folders using specific information (usually temporal) that can be later used as a partition column by the querying application.
Finally, data can then be converted for specific use cases and landed into the target layer. As an example, data can be transformed in a way that is relevant for a specific machine learning algorithm to work with the data. Another use case could be BI applications, such as Amazon QuickSight, where data can be pre-joined or aggregated and therefore reduced from a large dataset into a smaller dataset that is easier to visualize. Additional data engineering can be applied at this stage to optimize for performance.
The data warehouse and data lake architectures are now being challenged by a new, hybrid type of storage: the modern data architecture.
This section will look more closely at an example modern data architecture on AWS using AWS managed services. Let's start by defining the key components of the modern data architecture:
Note
AWS Glue Catalog tables can be stored in Amazon Redshift, providing a unified metadata catalog across both the data warehouse and the S3 data lake.
Amazon Redshift supports functionality that allows it to interact with the data warehouse. Let's look at those features in more detail.
Redshift Spectrum is a feature of Redshift that allows you to perform SQL queries against data in the S3 data lake. The queries are triggered directly from the data warehouse, and therefore you don't need to connect to a different environment to submit your queries. You need to define the Spectrum tables as external tables on their data warehouse. The Redshift cluster also needs to have permission to access the data lake S3 location(s). The Redshift cluster will need to be assigned an IAM role, which needs to have access to the desired S3 locations.
Another key characteristic of Redshift Spectrum is that the Spectrum queries are running in the Spectrum nodes that are outside of the Redshift cluster. This effectively extends the Redshift cluster with additional compute capacity when data lake data needs to be queried.
Finally, Spectrum tables and Redshift tables can be combined and joined. Without this feature, you would have to move data and collocate it before joining it.
Redshift can efficiently load data from the S3 data lake. Specifically, Redshift's COPY command can load data in parallel from Amazon S3. You (at a minimum) need to define a table name, the data location (commonly S3), and the authorization to access the data in the source location. When loading multiple files from S3, Redshift parallelizes the loading by allocating each file to a Redshift slice (the unit of processing in Redshift).
Redshift also comes with the ability to unload data from the data warehouse back to the data lake. Specifically, the UNLOAD command unloads the result of the query onto Amazon S3. You (as a minimum) need to specify the S3 location and the authorization. There are more options, such as defining the file format (using the FORMAT AS option) or applying partitioning (using the PARTITION BY option), and others.
In the following diagram, we see an example data pipeline that is using both a data warehouse and a data lake on AWS. Data is loaded from the operational data stores into the Amazon S3 object store in the raw layer of the data lake. Then, with a set of ETL jobs, the data reaches a stage that can be loaded into the data warehouse for BI purposes. For cost-effectiveness, you might not want to load all the data into the warehouse. Instead, you might want to leave the data in the data lake but have the ability to query the data when needed. This architecture considers the temperature of the data (how frequently the data is accessed) to determine the best storage. Hot data that needs to be accessed frequently is loaded into the data warehouse, while colder data remains in the data lake, a cheaper long-term storage option.
Now that we have had an overview of the modern data architecture on AWS, let's build a basic modern data architecture on AWS.
In this section, we will go through a hands-on example to create a basic modern data architecture. This tutorial will use the AWS CLI and the AWS console. By the end of this section, we will have spun up a working data lake and a data warehouse environment with demo data loaded.
Important note
The resources for this tutorial might introduce charges to your AWS account. Once you finish with the exercise, make sure you clean up the resources to prevent incurring further charges.
In this step, we will add the data lake storage. Then we will upload a demo dataset and will discover its schema automatically.
Let's begin:
We will use the random string in the data lake name, to ensure it is globally unique.
Let's use my-data-lake-<random string> as the bucket name.
% aws s3api create-bucket --bucket data-lake-xxxxxxxxxx --region us-east-1
And the response should look like this:
{
"Location": "/data-lake-xxxxxxxxxx"
}
Now let's add some data. For this exercise, we will use a subset of the New York City Taxi and Limousine Commission (TLC) Trip Record Data:
This command will return all the files in the open S3 location:
2016-08-11 15:32:21 85733063 fhv_tripdata_2015-01.csv
2016-08-11 15:33:04 97863482 fhv_tripdata_2015-02.csv
2016-08-11 15:33:40 102220197 fhv_tripdata_2015-03.csv
…
2021-02-26 16:54:00 138989555 yellow_tripdata_2020-11.csv
2021-02-26 16:54:00 134481400 yellow_tripdata_2020-12.csv
We don't need to download all of them. For this tutorial, we will copy only the files for 2020.
% aws s3 cp "s3://nyc-tlc/trip data/" s3://data-lake-xxxxxxxxxx/yellowtrips/ --recursive --exclude "*" --include "yellow_tripdata_2020*"
% aws s3 ls s3://data-lake-xxxxxxxxxx/yellowtrips/
2021-03-27 16:53:41 593610736 yellow_tripdata_2020-01.csv
2021-03-27 16:53:41 584190585 yellow_tripdata_2020-02.csv
2021-03-27 16:53:42 278288608 yellow_tripdata_2020-03.csv
2021-03-27 16:53:41 21662261 yellow_tripdata_2020-04.csv
2021-03-27 16:53:43 31641590 yellow_tripdata_2020-05.csv
2021-03-27 16:53:42 50277193 yellow_tripdata_2020-06.csv
2021-03-27 16:53:44 73326707 yellow_tripdata_2020-07.csv
2021-03-27 16:53:46 92411545 yellow_tripdata_2020-08.csv
2021-03-27 16:53:50 123394595 yellow_tripdata_2020-09.csv
2021-03-27 16:53:54 154917592 yellow_tripdata_2020-10.csv
2021-03-27 16:53:57 138989555 yellow_tripdata_2020-11.csv
2021-03-27 16:53:58 134481400 yellow_tripdata_2020-12.csv
Note
You can use data in a shared data lake as part of your data lake without the need to actually copy it across to your data lake.
The next step is to identify the schema of the dataset. For this purpose, we will use the AWS Glue crawlers. AWS Glue crawlers crawl through the data to detect the schema. If a schema can be determined (remember there is no guarantee that the data has a specific schema) then Glue crawlers will populate the Glue Catalog with the schemas identified after crawling the data. Glue tables always belong to a Glue database. A database in Glue is just a logical repository of tables in the Glue Catalog:
% aws glue create-database --database-input "{"Name":"my-data-lake-db"}" --region us-east-1
% aws glue get-databases --region us-east-1
{
"DatabaseList": [
{
"Name": "default",
"CreateTime": 1553517157.0
},
{
"Name": "my-data-lake-db",
"CreateTime": 1616865129.0
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::data-lake-xxxxxxxxxx",
"arn:aws:s3:::data-lake-xxxxxxxxxx/*"
]
}
]
}
The preceding policy document allows the policy holder to use the S3 ListBucket and the GetObject API. The crawler will use ListBucket to list the objects in our data lake bucket and getObject to read objects as it crawls data. This policy restricts access to the data lake bucket only.
% vim policy
% aws iam create-policy --policy-name DataLakeReadAccess --policy-document file://policy
The preceding command created the policy and we should get a confirmation JSON object back. Note the policy ARN, as we will use it in a later step.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
}
}
]
}
% vim role-policy
This role policy document allows the Glue service to assume the role we will create.
% aws iam create-role --role-name GlueCrawlerRole --assume-role-policy-document file://role-policy
We should get a confirmation JSON message after running the command.
% aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
arn:aws:iam::<accountid>:policy/DataLakeReadAccess
And the command should look like the following:
% aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::<ACCOUNT-ID>:policy/DataLakeReadAccess
% aws glue create-crawler --name qs-book-crawler --role arn:aws:iam::xxxxxxxxxxxx:role/GlueCrawlerRole --database-name my-data-lake-db --targets "{"S3Targets":[{"Path":"s3://data-lake-xxxxxxxxxx/yellowtrips"}]}" --region us-east-1
% aws glue start-crawler --name qs-book-crawler --region us-east-1
After 1-2 minutes, the Glue crawler should populate the database.
% aws glue get-tables --database-name my-data-lake-db
Tip
You can click the checkbox to select the table and then, under Action, you can choose Preview Data. This will open the Amazon Athena console and run an Athena query that returns 10 values from the table.
Let's create our data warehouse next.
To create the data warehouse, we will use the redshift create-cluster CLI command, or you can use the AWS Console:
%aws redshift create-cluster --node-type dc2.large --number-of-nodes 2 --master-username admin --master-user-password R3dsh1ft --cluster-identifier mycluster --region us-east-1
This command should give a response with the cluster metadata. After a few minutes, our cluster will be up and running.
Note
This command will create a Redshift cluster with a public IP address. This is something that should be avoided in real-world scenarios. The instructions provided are oversimplified for the purposes of this tutorial as this book is not focused on Amazon Redshift.
First, let's create an IAM role that we will assign to the Redshift cluster. We will use this role when using the Redshift Spectrum feature to query data in S3. We want the cluster to be able to write and read to our S3 location. We also want the cluster to be able to have read access to the Glue Catalog:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
}
}
]
}
% vim role-policy-redshift
% aws iam create-role --role-name RedshiftSpectrumRole --assume-role-policy-document file://role-policy-redshift
Note the role ARN, as we will use it later to attach it to the cluster.
% aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
% aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
% aws redshift modify-cluster-iam-roles --cluster-identifier mycluster --add-iam-roles arn:aws:iam::<ACCOUNT-ID>:role/RedshiftSpectrumRole --region us-east-1
Note
We didn't set a database name earlier. Redshift uses dev as the default value.
create table yellowtrips_3mo
(vendorid varchar(10),
tpep_pickup_datetime datetime,
tpep_dropoff_datetime datetime,
passenger_count int,
trip_distance float,
ratecodeid varchar(10),
store_and_fwd_flag char(1),
pulocationid varchar(10),
dolocationid varchar(10),
payment_type varchar(10),
fare_amount float,
extra float,
mta_tax float,
tip_amount float,
tolls_amount float,
improvement_surcharge float,
total_amount float,
congestion_surcharge float);
copy yellowtrips_3mo from
's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-10.csv'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
copy yellowtrips_3mo from
's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-11.csv'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
copy yellowtrips_3mo from
's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-12.csv'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
create external schema spectrum_schema from data catalog
database 'my-data-lake-db'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSpectrumRole'
create external database if not exists;
select count(*) from public.yellowtrips_3mo;
select count(*) from spectrum_schema.yellowtrips;
The first query will run against the recent data in the data warehouse. The second will run against the first query using the Spectrum nodes using the data in the data lake. As expected, the number of records in the data lake should be much higher than the number of records in the data warehouse. Specifically, the query result was 24,648,499 for the year 2020 and 4,652,013 records for the last 3 months of 2020.
Note
The Spectrum queries use the Spectrum nodes and are charged separately from the Redshift cluster. Every query incurs an added cost based on the data it needs to scan. Refer to the AWS pricing for details.
Feel free to experiment with the data and trigger a few queries to understand the dataset. When you finish with the Redshift cluster, you can pause the cluster so that you stop the on-demand billing. Once the cluster is paused you will only pay for the cluster storage.
Congratulations, you have reached the end of the first chapter! By now, you should have a good understanding of the AWS analytics ecosystem and its data lake and data warehousing options. In this chapter, we discussed in detail the key differences between data warehouses and data lakes. We also discussed the modern data architecture on AWS, and we looked at its main components in more detail. Finally, during the step-by-step section in this chapter, you had a chance to create a data lake and a data warehouse from scratch, and you loaded an open dataset for further analysis later on. We also defined Spectrum tables and queried the data lake directly for the data warehouse.
In the next chapter, we will discuss the basic concepts of Amazon QuickSight, understand its main benefits, and learn how to set up a QuickSight account.
3.138.137.127