Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1: Introducing the AWS Analytics Ecosystem

As data increases in both volume and variety, organizations from all verticals are adopting cloud analytics services for their data analytics. AWS offers a number of analytics services covering data lakes, data warehousing, big data processing, extract, transform, load (ETL), and data visualization. In this chapter, we will introduce the AWS analytics ecosystem. Some of the services we discuss here will be mentioned again later in the book.

First, we will map the AWS services into categories. Then, we will discuss how Amazon QuickSight fits into the wider AWS analytics ecosystem. We will look more closely at a modern modern data architecture and we will discuss its benefits and its components. Finally, we will provide a step-by-step guide to set up a modern data architecture on AWS and load and query a demo data sample. Some of this information may already be familiar to you, but let's go back over the basics.

In this chapter, we will cover the following topics:

Discovering the AWS analytics ecosystem
Exploring the modern data architecture on AWS
Creating a basic modern data architecture

Technical requirements

To follow along with this chapter, you will need the following pre-requisites:

An AWS account with console access
AWS CLI access

The code sample for this chapter can be accessed on the GitHub repository for this book at https://github.com/PacktPublishing/Actionable-Insights-with-Amazon-QuickSight/tree/main/chapter_1.

Discovering the AWS analytics ecosystem

AWS provides a large number of analytics services. In addition to that, AWS has a number of partners who specialize in data analytics and offer analytics solutions that run on the AWS infrastructure. Partner solutions are not in the scope of this section, however. This section focuses on the AWS fully managed analytics services. In order to list the services, we will first define the specific categories related to analytics functions. Machine learning and predictive analytics are also out of the scope of this chapter. For every service category, we will then list the AWS services available, and for each service, we will provide a high-level description. Figure 1.1 depicts the commonly used AWS analytics services.

Figure 1.1 – AWS analytics services

Business intelligence

More and more organizations aspire to be data-driven and use data to drive their strategic decisions. Business intelligence (BI) tools help organizations to transform data into actionable insights. With the use of BI tools, users can analyze data and then present their findings in reports or dashboards. These reports or dashboards can then be consumed by business users who are interested in getting a picture of the state of the business.

In 2015, AWS launched Amazon QuickSight, a cloud-native BI tool. Since then, AWS has added new features to QuickSight, enriching the standard dashboard functionality with machine learning capabilities and offering embedded dashboard functionality. Amazon QuickSight is the main technology we will be covering in this book. Over the next few chapters, we will start with the basic functionality of Amazon QuickSight, and then we will explore more advanced features. Where possible, we will use practical examples that can be repeated in your own development environment, to give you hands-on experience with Amazon QuickSight.

Data warehousing

Data warehouses are repositories of data; they are important components of the BI process. Data stored in data warehouses is typically structured. Traditionally, data is ingested and centralized into data warehouses from different operational data stores. Data warehouses are optimized to run analytical queries over large amounts of data. The results of analytical queries are usually calculated after an aggregation over multiple rows from one or more tables. BI applications use analytical queries to aggregate data and visualize it. It is a common architectural approach to use a data warehouse to serve data to a BI application.

Back in 2012, AWS launched Amazon Redshift, a cloud-native, fully managed data warehouse service. Today, Redshift is one of the most popular cloud data warehouses with thousands of organizations from different verticals using it to analyze their data. Other popular cloud data warehouses include Snowflake and Google BigQuery. Amazon Redshift integrates with most BI tools and it integrates natively with Amazon QuickSight. We will discuss this topic in more detail in Chapter 3, Preparing Data with Amazon QuickSight, when we look more closely into Amazon QuickSight-supported data sources.

Data lake storage and governance

A data lake is a repository of data where organizations can easily centralize all of their data and apply it in different use cases such as reporting, visualization, big data analytics, and predictive analytics. Data stored in data lakes can be structured or semi-structured. Usually, data is ingested into the data lake in its raw format, and is then transformed and stored back into the data lake for further processing and analysis. A cloud data lake typically uses a cloud object store to store data. AWS introduced Amazon Simple Storage Service (S3) in March 2006, offering developers a highly scalable, reliable, and low-latency data storage infrastructure at very low cost. Amazon S3 can store an unlimited amount of data, a particularly useful feature for data lakes. Organizations have one less thing to worry about because they don't need to think about scaling their storage as the amount of data stored grows.

While scaling data lake storage is something that organizations and CIOs don't need to worry about much anymore, data lake governance needs to be considered carefully. Data lakes do not enforce data schemas or data formats and, without any governance, data lakes can degrade into unusable data repositories, often referred to as data swamps. AWS offers a number of services for data governance.

The AWS Glue Catalog is part of the AWS Glue service. It is a fully managed Apache Hive metastore-compatible data catalog. Big data applications (for example, Apache Spark, Apache Hive, Presto, and so on) use the metadata in the catalog to locate and parse data. The AWS Glue Catalog is a technical metadata repository and can catalog data in Amazon S3, and a number of relational or non-relational data stores including Redshift, Aurora, and DynamoDB, among others.

AWS Lake Formation runs on top of AWS Glue and Amazon S3 and provides a governance layer and access layer for data lakes on Amazon S3. It also provides a set of reusable ETL jobs, called blueprints, that can be used to perform common ETL tasks (for example, loading data from a relational data store into an S3 data lake). Lake Formation allows users to manage access permissions, using a familiar GRANT REVOKE syntax that you might have seen in relational database management systems (RDBMSes).

Amazon Macie is an AWS service for data protection. It provides an inventory of Amazon S3 buckets and it uses machine learning to identify and alert its users about sensitive data, such as personally identifiable information (PII).

Finally, and perhaps most importantly, AWS Identity and Access Management (IAM) is a fundamental AWS service that allows users to assign permissions to principals (for example, users, groups, or roles) and explicitly allow or deny access to AWS resources including data lake locations or tables in the data catalog.

Ad hoc analytics

Ad hoc analytics refers to getting answers from the data on an as-needed basis. Contrary with what happens with scheduled reports, ad hoc querying is initiated by a user when they need to get specific answers from their data. The user typically uses SQL via a workbench type of application or other analytics frameworks (for instance, Apache Spark) using notebook environments or other BI applications. AWS has a number of analytics services that can be used for ad hoc analytics.

Amazon Redshift can be used for ad hoc analysis of data. For ad hoc querying, users will typically connect to Amazon Redshift using a query editor application with the Redshift JDBC/ODBC drivers. Notebook integrations or BI tool integrations are also possible for ad hoc analysis. AWS offers a number of managed notebook environments such as EMR notebooks and SageMaker notebooks. Amazon Redshift also allows its users to query data that is stored outside the data warehouse. Amazon Redshift Spectrum allows Redshift users to query data stored in Amazon S3, eliminating the need to load the data first before querying. Redshift's federated querying capability allows users to query live data in operational data stores such as PostgreSQL and MySQL.

For big data and data lakes, Presto is a popular choice for ad hoc analysis. Presto provides a high-performance parallel SQL query engine. Amazon Athena lets users run Presto queries in a scalable serverless environment. Amazon QuickSight natively supports Amazon Athena. We will talk more about this native integration in Chapter 3, Preparing Data with Amazon QuickSight. Amazon EMR is a fully managed Hadoop cluster, and it comes with a range of applications from the open source big data ecosystem. Presto has two community projects, PrestoSQL and PrestoDB, both of which are part of the Amazon EMR service. Other options included with EMR are Hive on EMR and Spark on EMR.

Extract, transform, load

ETL is a term used to describe a set of processes to extract, transform, and load data usually for analytical purposes. Organizations gather data from different data sources and centralize them in a central data repository. Data from different sources typically has different schemas and different conventions and standards, and therefore it can be challenging to combine them to get the required answers. For that reason, data needs to transformed so that it can work together. For example, cleaning the data, applying certain data quality thresholds, and standardizing to a specific standard (for instance, date and time formats used) are all important tasks to ensure the data is useable. A visual representation of the ETL process is shown in the following figure.

Figure 1.2 – The ETL process

AWS Glue is a fully managed ETL service offered by AWS. When it was first introduced in 2017, Glue ETL offered an Apache Spark environment optimized for ETL. Now, Glue ETL offers a wider range of options:

PySpark – Apache Spark using Python
Spark with Scala – Apache Spark with Scala
Python shell – For smaller ETL jobs that don't need a Spark cluster
Glue Studio and Glue Databrew – Visual approach to ETL without the need to write code

Amazon EMR transient clusters, with applications such as Spark or Hive, can be leveraged for ETL workloads. ETL workloads can be bulk or streaming: streaming ETL workloads usually need to be up and running constantly, or at least for as long as the source stream is on; batch ETL workloads don't need to run at all times and they can stop once the data is loaded into the target system. This type of workload fits nicely with the flexibility of the cloud. With the cloud, data architects don't need to think of Hadoop clusters as big monolithic clusters. Instead, users prefer purpose-built transient clusters, optimized and sized to handle specific workloads and data loads.

Now that we've had our overview of the AWS analytics ecosystem, let's learn about modern data architecture and how they are built.

Exploring the modern data architecture on AWS

The modern data architecture is a modern data analytics architecture: as the name suggests, it combines the data lake and the data warehouse into a seamless system. This approach extends the traditional data warehouse approach and opens up new possibilities for data analytics. For this reason, it is important to understand this architecture, which can be used as a data backend for Amazon QuickSight or other BI applications. To understand the architecture better, let's first start by understanding the differences between a data lake and data warehouse.

Data lakes versus data warehouses

Data lakes and data warehouses are designed to consume large amounts of data for analytics purposes. Data warehouses are traditional database systems, used by organizations and enterprises for years. Data lakes, on the other side, are relatively young implementations that emerged from the big data and Hadoop ecosystems. Tables stored in data warehouses need to have clearly defined schemas. The schema needs to be defined upfront, before any data is added. This approach is called schema on write, and it ensures that data conforms to a specific structure before being ingested into the data warehouse. However, it can be less flexible, and it may introduce complexity when dealing with evolving schemas. Evolving schemas are an increasingly common scenario because organizations need to capture more and more data points from their customer interactions to drive data-driven decisions.

On the other side, data lakes don't enforce a schema upfront. Instead, applications that have the required permissions can write data to a data lake. Structure and data formats aren't enforced by the data lake: it is a responsibility of the writing application.

Data stored in a data lake has few to no limitations regarding its format: it can be structured, semi-structured, or completely unstructured. For many datasets, a schema can be inferred, either because the data is semi-structured (CSV, JSON, and others), or they follow patterns that can be identified after applying regular expressions and extracting specific columns. In data lakes, the schema is inferred when the data is read by the querying application. This approach is called schema on read, and it gives an organization flexibility regarding the data type stored. However, it also introduces challenges with data complexity and enforcing data quality.

For that reason, it is common that data that lands into the data lake goes through a series of transformations to get to a stage where it is useable. The first stage, often referred to as the raw layer, is where the data first lands, and it is stored as is.

After the data has landed, the first series of transformations is applied and the data is stored at the processed layer. Since the data can be of any format, the types of possible transformations are limitless. To give just some examples, data quality functions can be applied at this stage to remove incomplete rows and standardize the data in line with a specific datetime or time zone format. Other data engineering activities can also be performed at this stage, such as converting data into different file data formats optimized for analytics, or organizing them into folders using specific information (usually temporal) that can be later used as a partition column by the querying application.

Finally, data can then be converted for specific use cases and landed into the target layer. As an example, data can be transformed in a way that is relevant for a specific machine learning algorithm to work with the data. Another use case could be BI applications, such as Amazon QuickSight, where data can be pre-joined or aggregated and therefore reduced from a large dataset into a smaller dataset that is easier to visualize. Additional data engineering can be applied at this stage to optimize for performance.

Figure 1.3 – Data lake layers

The data warehouse and data lake architectures are now being challenged by a new, hybrid type of storage: the modern data architecture.

modern data architecture on AWS

This section will look more closely at an example modern data architecture on AWS using AWS managed services. Let's start by defining the key components of the modern data architecture:

Amazon Redshift is the data warehouse service.
Amazon S3 is the object store that can be used for cloud data lake storage.
AWS Glue is the data lake catalog to store technical metadata.
Note
AWS Glue Catalog tables can be stored in Amazon Redshift, providing a unified metadata catalog across both the data warehouse and the S3 data lake.

Amazon Redshift supports functionality that allows it to interact with the data warehouse. Let's look at those features in more detail.

Ability to query the data lake from the data warehouse

Redshift Spectrum is a feature of Redshift that allows you to perform SQL queries against data in the S3 data lake. The queries are triggered directly from the data warehouse, and therefore you don't need to connect to a different environment to submit your queries. You need to define the Spectrum tables as external tables on their data warehouse. The Redshift cluster also needs to have permission to access the data lake S3 location(s). The Redshift cluster will need to be assigned an IAM role, which needs to have access to the desired S3 locations.

Another key characteristic of Redshift Spectrum is that the Spectrum queries are running in the Spectrum nodes that are outside of the Redshift cluster. This effectively extends the Redshift cluster with additional compute capacity when data lake data needs to be queried.

Finally, Spectrum tables and Redshift tables can be combined and joined. Without this feature, you would have to move data and collocate it before joining it.

Ability to load data from the data lake

Redshift can efficiently load data from the S3 data lake. Specifically, Redshift's COPY command can load data in parallel from Amazon S3. You (at a minimum) need to define a table name, the data location (commonly S3), and the authorization to access the data in the source location. When loading multiple files from S3, Redshift parallelizes the loading by allocating each file to a Redshift slice (the unit of processing in Redshift).

Ability to unload data to the data lake

Redshift also comes with the ability to unload data from the data warehouse back to the data lake. Specifically, the UNLOAD command unloads the result of the query onto Amazon S3. You (as a minimum) need to specify the S3 location and the authorization. There are more options, such as defining the file format (using the FORMAT AS option) or applying partitioning (using the PARTITION BY option), and others.

In the following diagram, we see an example data pipeline that is using both a data warehouse and a data lake on AWS. Data is loaded from the operational data stores into the Amazon S3 object store in the raw layer of the data lake. Then, with a set of ETL jobs, the data reaches a stage that can be loaded into the data warehouse for BI purposes. For cost-effectiveness, you might not want to load all the data into the warehouse. Instead, you might want to leave the data in the data lake but have the ability to query the data when needed. This architecture considers the temperature of the data (how frequently the data is accessed) to determine the best storage. Hot data that needs to be accessed frequently is loaded into the data warehouse, while colder data remains in the data lake, a cheaper long-term storage option.

Figure 1.4 – Example data pipeline on AWS

Now that we have had an overview of the modern data architecture on AWS, let's build a basic modern data architecture on AWS.

Creating a basic modern data architecture

In this section, we will go through a hands-on example to create a basic modern data architecture. This tutorial will use the AWS CLI and the AWS console. By the end of this section, we will have spun up a working data lake and a data warehouse environment with demo data loaded.

Important note

The resources for this tutorial might introduce charges to your AWS account. Once you finish with the exercise, make sure you clean up the resources to prevent incurring further charges.

Creating the data lake storage

In this step, we will add the data lake storage. Then we will upload a demo dataset and will discover its schema automatically.

Step 1 – creating the S3 bucket

Let's begin:

If you haven't installed it already, follow the AWS documentation to install and configure the AWS CLI. To complete this tutorial, you will need to use a role that has access to the S3, Glue, Redshift, and IAM services: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.
First, let's create the S3 bucket as the data lake storage. Your S3 bucket needs to have a globally unique name. For that reason, we should introduce some randomness to it. Let's pick a random set of 10 characters and numbers. For this tutorial, you should choose your own random set of characters for this string; for example, SF482XHS7M.

We will use the random string in the data lake name, to ensure it is globally unique.

Let's use my-data-lake-<random string> as the bucket name.

To create a bucket, we can type the following command into the CLI. Replace the following random string with your random string:
% aws s3api create-bucket --bucket data-lake-xxxxxxxxxx --region us-east-1

And the response should look like this:

{

"Location": "/data-lake-xxxxxxxxxx"

}

Step 2 – adding data into the data lake

Now let's add some data. For this exercise, we will use a subset of the New York City Taxi and Limousine Commission (TLC) Trip Record Data:

Let's first have a look at the dataset:
% aws s3 ls 's3://nyc-tlc/trip data/' --no-sign-request

This command will return all the files in the open S3 location:

2016-08-11 15:32:21 85733063 fhv_tripdata_2015-01.csv

2016-08-11 15:33:04 97863482 fhv_tripdata_2015-02.csv

2016-08-11 15:33:40 102220197 fhv_tripdata_2015-03.csv

…

2021-02-26 16:54:00 138989555 yellow_tripdata_2020-11.csv

2021-02-26 16:54:00 134481400 yellow_tripdata_2020-12.csv

We don't need to download all of them. For this tutorial, we will copy only the files for 2020.

We can use the S3 CLI exclude and include parameters to apply a pattern to match the desired filenames. The command to copy is as follows:
% aws s3 cp "s3://nyc-tlc/trip data/" s3://data-lake-xxxxxxxxxx/yellowtrips/ --recursive --exclude "*" --include "yellow_tripdata_2020*"
Once completed, we can then verify that the files exist in our environment with the aws s3 ls command, which lists the files under a specific S3 location:
% aws s3 ls s3://data-lake-xxxxxxxxxx/yellowtrips/
2021-03-27 16:53:41  593610736 yellow_tripdata_2020-01.csv
2021-03-27 16:53:41  584190585 yellow_tripdata_2020-02.csv
2021-03-27 16:53:42  278288608 yellow_tripdata_2020-03.csv
2021-03-27 16:53:41   21662261 yellow_tripdata_2020-04.csv
2021-03-27 16:53:43   31641590 yellow_tripdata_2020-05.csv
2021-03-27 16:53:42   50277193 yellow_tripdata_2020-06.csv
2021-03-27 16:53:44   73326707 yellow_tripdata_2020-07.csv
2021-03-27 16:53:46   92411545 yellow_tripdata_2020-08.csv
2021-03-27 16:53:50  123394595 yellow_tripdata_2020-09.csv
2021-03-27 16:53:54  154917592 yellow_tripdata_2020-10.csv
2021-03-27 16:53:57  138989555 yellow_tripdata_2020-11.csv
2021-03-27 16:53:58  134481400 yellow_tripdata_2020-12.csv
Note
You can use data in a shared data lake as part of your data lake without the need to actually copy it across to your data lake.

Step 3 – identifying the schema

The next step is to identify the schema of the dataset. For this purpose, we will use the AWS Glue crawlers. AWS Glue crawlers crawl through the data to detect the schema. If a schema can be determined (remember there is no guarantee that the data has a specific schema) then Glue crawlers will populate the Glue Catalog with the schemas identified after crawling the data. Glue tables always belong to a Glue database. A database in Glue is just a logical repository of tables in the Glue Catalog:

Let's start by creating a database using the create-database command:
% aws glue create-database --database-input "{"Name":"my-data-lake-db"}" --region us-east-1
We can verify the successful database creation using the get-databases command:
% aws glue get-databases --region us-east-1
{
    "DatabaseList": [
        {
            "Name": "default",
            "CreateTime": 1553517157.0
        },

        {
            "Name": "my-data-lake-db",
            "CreateTime": 1616865129.0
        }
    ]
}
Before we create the Glue Crawler, we need to create an IAM role that will be assigned to the Crawler and allow it to access the data in the data lake. The crawler doesn't need to write to the data lake location, therefore only the read access permission is needed. To give the required permissions to a role, we need to attach policies that define the permissions. Let's define a policy document that allows read access to our data lake:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::data-lake-xxxxxxxxxx",
                "arn:aws:s3:::data-lake-xxxxxxxxxx/*"
            ]
        }
        ]
}

The preceding policy document allows the policy holder to use the S3 ListBucket and the GetObject API. The crawler will use ListBucket to list the objects in our data lake bucket and getObject to read objects as it crawls data. This policy restricts access to the data lake bucket only.

Now, let's create a file and copy the policy text. Replace the random string in the data lake name with the random string in your environment. I used vim, but you can use any text editor:
% vim policy
Then, let's create the IAM policy using the create-policy CLI command:
% aws iam create-policy --policy-name DataLakeReadAccess --policy-document file://policy

The preceding command created the policy and we should get a confirmation JSON object back. Note the policy ARN, as we will use it in a later step.

Next, let's create the IAM role that the Glue crawler will assume. First, let's define the role policy document:
{
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Action": "sts:AssumeRole",
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "glue.amazonaws.com"
                    }
                }
            ]
        }
Then create a file called role-policy and copy in the preceding JSON document:
% vim role-policy

This role policy document allows the Glue service to assume the role we will create.

To create the role, we will use the iam create-role CLI command:
% aws iam create-role --role-name GlueCrawlerRole --assume-role-policy-document file://role-policy

We should get a confirmation JSON message after running the command.

Capture the role ARN, as it will be used later when defining the crawler.
Then, let's attach the required policies to this role. For this role, we want to allocate two policies: the AWSGlueServiceRole policy (this is managed by AWS) and the DataLakeReadAccess policy we created earlier. To attach policies to the IAM role we will use the iam attach-role-policy command. Let's start with the AWSGlueServiceRole policy:
% aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
Then we will attach the DataLakeReadAccess policy. We will need the policy ARN that we captured earlier. The policy ARN should look like the following line:
arn:aws:iam::<accountid>:policy/DataLakeReadAccess

And the command should look like the following:

% aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::<ACCOUNT-ID>:policy/DataLakeReadAccess

Now, let's create the AWS Glue crawler. For this purpose, we will use the glue create-crawler CLI command. Make sure you replace the role ARN and the data lake location with the values for your environment:
% aws glue create-crawler --name qs-book-crawler --role arn:aws:iam::xxxxxxxxxxxx:role/GlueCrawlerRole --database-name my-data-lake-db --targets "{"S3Targets":[{"Path":"s3://data-lake-xxxxxxxxxx/yellowtrips"}]}" --region us-east-1
Then, just start the crawler using the glue start-crawler command:
% aws glue start-crawler --name qs-book-crawler --region us-east-1

After 1-2 minutes, the Glue crawler should populate the database.

We can confirm this by calling the glue get-tables cli command:
% aws glue get-tables --database-name my-data-lake-db
You can view the Catalog from the AWS Console. Log in to the AWS Console and navigate to AWS Glue.
Then on the left-hand side menu, under Data Catalog, choose Databases and then find my-data-lake-db. Then click on View tables under my-data-lake-db. It should look like the following screenshot:

Figure 1.5 – Glue console

Tip

You can click the checkbox to select the table and then, under Action, you can choose Preview Data. This will open the Amazon Athena console and run an Athena query that returns 10 values from the table.

Step 4 – creating the data warehouse

Let's create our data warehouse next.

To create the data warehouse, we will use the redshift create-cluster CLI command, or you can use the AWS Console:

%aws redshift create-cluster --node-type dc2.large --number-of-nodes 2 --master-username admin --master-user-password R3dsh1ft --cluster-identifier mycluster --region us-east-1

This command should give a response with the cluster metadata. After a few minutes, our cluster will be up and running.

Note

This command will create a Redshift cluster with a public IP address. This is something that should be avoided in real-world scenarios. The instructions provided are oversimplified for the purposes of this tutorial as this book is not focused on Amazon Redshift.

Step 5 – loading the data into the data warehouse

First, let's create an IAM role that we will assign to the Redshift cluster. We will use this role when using the Redshift Spectrum feature to query data in S3. We want the cluster to be able to write and read to our S3 location. We also want the cluster to be able to have read access to the Glue Catalog:

Similarly to what we did earlier, we will create the following role policy document to allow the role to be assumed by the Redshift service:
{
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Action": "sts:AssumeRole",
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "redshift.amazonaws.com"
                    }
                }
            ]
        }
Then copy this JSON object into a policy document:
% vim role-policy-redshift
Now, let's create the role using the iam create-role command:
% aws iam create-role --role-name RedshiftSpectrumRole --assume-role-policy-document file://role-policy-redshift

Note the role ARN, as we will use it later to attach it to the cluster.

Next, we need to give the desired permissions by attaching the correct policies. This time, for simplicity, we will just attach two AWS managed policies. These policies are overly permissive, and normally we would attach policies with narrower permissions, as we did for the Glue crawlers in Step 3. Let's attach AWSFullS3Access and AWSFullGlueAccess:
% aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
% aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
Next, we will attach this role to our cluster using the redshift modify-cluster-iam-roles CLI command. Note that you need to update the role ARN with the value from your environment:
% aws redshift modify-cluster-iam-roles --cluster-identifier mycluster --add-iam-roles arn:aws:iam::<ACCOUNT-ID>:role/RedshiftSpectrumRole --region us-east-1
The cluster change will take a few minutes to be applied. After the change is applied, the cluster will be ready to fetch data from the S3 data lake. To connect to the cluster, we will use the built-in query editor found in the AWS Management Console. To find the editor, navigate to the Redshift console, and see the left-hand side menu. The editor will need to establish a connection. Make sure you select the cluster we created earlier, and type dev as the database name and admin as the username.
Note
We didn't set a database name earlier. Redshift uses dev as the default value.
In the editor page, we will need to create a table to store the data. Let's name the table yellowtrips_3mo, as we will only store 3 months' worth of data:
create table yellowtrips_3mo
(vendorid varchar(10),
tpep_pickup_datetime datetime,
tpep_dropoff_datetime datetime,
passenger_count int,
trip_distance float,
ratecodeid varchar(10),
store_and_fwd_flag char(1),
pulocationid varchar(10),
dolocationid varchar(10),
payment_type varchar(10),
fare_amount float,
extra float,
mta_tax float,
tip_amount float,
tolls_amount float,
improvement_surcharge float,
total_amount float,
congestion_surcharge float);
Then, let's copy 3 months' worth of data into the data warehouse. Let's use the COPY command, as follows:
copy yellowtrips_3mo from
's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-10.csv'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
copy yellowtrips_3mo from
's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-11.csv'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
copy yellowtrips_3mo from
's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-12.csv'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
At this stage, we have a data lake with 12 months' worth of data and a data warehouse that contains only the most recent data (3 months). One of the characteristics of the modern data architecture is that it allows its users to query the data lake from the data warehouse. Feel free to query the data and start getting an understanding of the dataset. Let's create the external schema so that we can enable the Spectrum feature. Use the following command in your Redshift editor. Replace the role ARN with the values from your environment:
create external schema spectrum_schema from data catalog
database 'my-data-lake-db'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSpectrumRole'
create external database if not exists;
Let's just compare the size of the two tables using a simple count(*) query:
select count(*) from public.yellowtrips_3mo;
select count(*) from spectrum_schema.yellowtrips;

The first query will run against the recent data in the data warehouse. The second will run against the first query using the Spectrum nodes using the data in the data lake. As expected, the number of records in the data lake should be much higher than the number of records in the data warehouse. Specifically, the query result was 24,648,499 for the year 2020 and 4,652,013 records for the last 3 months of 2020.

Note

The Spectrum queries use the Spectrum nodes and are charged separately from the Redshift cluster. Every query incurs an added cost based on the data it needs to scan. Refer to the AWS pricing for details.

Feel free to experiment with the data and trigger a few queries to understand the dataset. When you finish with the Redshift cluster, you can pause the cluster so that you stop the on-demand billing. Once the cluster is paused you will only pay for the cluster storage.

Summary

Congratulations, you have reached the end of the first chapter! By now, you should have a good understanding of the AWS analytics ecosystem and its data lake and data warehousing options. In this chapter, we discussed in detail the key differences between data warehouses and data lakes. We also discussed the modern data architecture on AWS, and we looked at its main components in more detail. Finally, during the step-by-step section in this chapter, you had a chance to create a data lake and a data warehouse from scratch, and you loaded an open dataset for further analysis later on. We also defined Spectrum tables and queried the data lake directly for the data warehouse.

In the next chapter, we will discuss the basic concepts of Amazon QuickSight, understand its main benefits, and learn how to set up a QuickSight account.

Questions

What is the difference between data lakes and data warehouses?
What is schema on read and what is schema on write?
How can we identify the schema of new data in a data lake on AWS?
Which AWS storage service is ideal for a data lake on AWS?
What data is better served from the data lake in the modern data architecture?
How do Redshift Spectrum tables differ from Redshift tables?

Table of Contents for
Chapter 1: Introducing the AWS Analytics Ecosystem

Chapter 1: Introducing the AWS Analytics Ecosystem

Technical requirements

Discovering the AWS analytics ecosystem

Business intelligence

Data warehousing

Data lake storage and governance

Ad hoc analytics

Extract, transform, load

Exploring the modern data architecture on AWS

Data lakes versus data warehouses

modern data architecture on AWS

Ability to query the data lake from the data warehouse

Ability to load data from the data lake

Ability to unload data to the data lake

Creating a basic modern data architecture

Creating the data lake storage

Step 1 – creating the S3 bucket

Step 2 – adding data into the data lake

Step 3 – identifying the schema

Step 4 – creating the data warehouse

Step 5 – loading the data into the data warehouse

Summary

Questions

Further reading

Table of Contents for Chapter 1: Introducing the AWS Analytics Ecosystem

Create new playlist

Sign In

Sign Up

Chapter 1: Introducing the AWS Analytics Ecosystem

Technical requirements

Discovering the AWS analytics ecosystem

Business intelligence

Data warehousing

Data lake storage and governance

Ad hoc analytics

Extract, transform, load

Exploring the modern data architecture on AWS

Data lakes versus data warehouses

modern data architecture on AWS

Ability to query the data lake from the data warehouse

Ability to load data from the data lake

Ability to unload data to the data lake

Creating a basic modern data architecture

Creating the data lake storage

Step 1 – creating the S3 bucket

Step 2 – adding data into the data lake

Step 3 – identifying the schema

Step 4 – creating the data warehouse

Step 5 – loading the data into the data warehouse

Summary

Questions

Further reading

Table of Contents for
Chapter 1: Introducing the AWS Analytics Ecosystem