The lake house is an architectural pattern that makes data easily accessible across customers' analytics solutions, thereby preventing data silos. Amazon Redshift is the backbone of the lake house architecture—it allows enterprise customers to query data across data lakes, operational databases, and multiple data warehouses to build an analytics solution without having to move data in and out of these different systems. In this chapter, you will learn how you can leverage the lake house architecture to extend the data warehouse to services outside Amazon Redshift to build your solution, while taking advantage of the built-in integration. For example, you can use the Federated Query capability to join the operational data in your relational systems to historical data in Amazon Redshift to analyze a promotional trend.
The following recipes are discussed in this chapter:
Here are the technical requirements in order to complete the recipes in this chapter:
a. An IAM policy attached to the IAM user that will give them access to Amazon Redshift, Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon Simple Notification Service (Amazon SNS), Amazon CloudWatch, Amazon CloudWatch Logs, AWS Key Management Service (AWS KMS), AWS IAM, AWS CloudFormation, AWS CloudTrail, Amazon Relational Database Service (Amazon RDS), AWS Lake Formation, AWS Secrets Manager, and AWS Glue
b. An IAM policy attached to the IAM role that will allow the Amazon Redshift cluster to access Amazon S3, Amazon RDS, and AWS Glue
The data lake design pattern has been widely adopted in the industry. Data lakes help to break data silos by allowing you to store all of your data in a single, unified place. You can collect the data from different sources and data can arrive at different frequencies—for example, clickstream data. The data format can be structured, unstructured, or semi-structured. Analyzing a unified view of the data allows you to derive more value and helps to derive more insight from the data to drive business value.
Your data lake should be secure and should meet your compliance requirements, with a centralized catalog that allows you to search and easily find data that is stored in the lake. One of the advantages of data lakes is that you can run a variety of analytical tools against them. You may also want to do new types of analysis on your data. For example, you may want to move from answering questions on what happened in the past to what is happening in real time, and using statistical models and forecasting techniques to understand and answer what could happen in the future. To do this, you need to incorporate machine learning (ML), big data processing, and real-time analytics. The pattern that allows you to integrate your analytics into a data lake is the lake house architecture. Amazon S3 object storage is used for centralized data lakes due to its scalability, high availability, and durability.
You can see an overview of the lake house architecture here:
Typical challenges and steps involved in building a data lake include the following:
Here is a representation of a lake house workflow moving data from raw format to analytics:
The AWS Lake Formation service allows you to simplify the build, centralize management, and configure security policies. AWS Lake Formation leverages AWS Glue for cataloging, data ingestion, and data transformation.
In this recipe, you will learn how to use Lake Formation to hydrate the data lake from a relational database, catalog the data, and apply security policies.
To complete this recipe, you will need the following to be set up:
In this recipe, the version of the MySQL engine is 5.7.31.
In this recipe, we will learn how to set up a data flow MySQL-based transactional database to be cataloged using a Lake Formation catalog and query it easily using Amazon Redshift:
mysql -h [yourMySQLRDSEndPoint] -u admin -p
create database ods;
CREATE TABLE ods.part
(
P_PARTKEY BIGINT NOT NULL,
P_NAME VARCHAR(55),
P_MFGR VARCHAR(25),
P_BRAND VARCHAR(10),
P_TYPE VARCHAR(25),
P_SIZE INTEGER,
P_CONTAINER VARCHAR(10),
P_RETAILPRICE DECIMAL(18,4),
P_COMMENT VARCHAR(23)
);
LOAD DATA LOCAL INFILE 'part.tbl'
INTO TABLE ods.part
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ' ';
MySQL [(none)]> select count(*) from ods.part;
+----------+
| count(*) |
+----------+
| 20000 |
+----------+
1 row in set (0.00 sec)
Once successfully connected, you will see a connected successfully to your instance message, as shown here:
a. For Database connection, from the drop-down menu select datalake-mysql.
b. For Source data path, enter ods/part:
a. The workflow will crawl the mysql table metadata, which will catalog it in the cookbook-data-lake database.
b. It will then unload the data from the mysql ods.part table in Parquet format on the S3 location you provided.
c. Finally, it will crawl the Parquet data on S3 and create a table in the cookbook-data-lake database.
You can see an overview of this here:
Later in the chapter, using the Extending a data warehouse using Amazon Redshift Spectrum recipe, you will learn how to query this data using Amazon Redshift.
AWS Lake Formation simplifies the management and configuration of data lakes in a centralized place. AWS Glue's extract, transform, load (ETL) functionality, leveraging Python and Spark Shell, ML transform enables you to customize workflows to meet your needs. The AWS Glue/Lake Formation catalog integrates with Amazon Redshift for your data warehousing, Amazon Athena for ad hoc analysis, Amazon SageMaker for predictive analysis, and Amazon Elastic MapReduce (EMR) for big data processing.
Amazon Redshift empowers a lake house architecture, allowing you to query data within the data warehouse and data lake using Amazon Redshift Spectrum and also to export your data back to the data lake on Amazon S3, to be used by other analytical and ML services. You can store data in open file formats in your Amazon S3 data lake when performing the data lake export to integrate with your existing data lake formats.
To complete this recipe, you will need the following to be set up:
In this recipe, we will use the sample dataset created from Chapter 3, Loading and Unloading Data, to write the data back to the Amazon S3 data lake:
SELECT c_mktsegment,
COUNT(o_orderkey) AS orders_count,
SUM(l_quantity) AS quantity,
COUNT(DISTINCT P_PARTKEY) AS parts_count,
COUNT(DISTINCT L_SUPPKEY) AS supplier_count,
COUNT(DISTINCT o_custkey) AS customer_count
FROM lineitem
JOIN orders ON l_orderkey = o_orderkey
JOIN customer c ON o_custkey = c_custkey
JOIN dwdate
ON d_date = l_commitdate
AND d_year = 1992
JOIN part ON P_PARTKEY = l_PARTKEY
JOIN supplier ON L_SUPPKEY = S_SUPPKEY
GROUP BY c_mktsegment limit 5;
Here's the expected sample output:
c_mktsegment | orders_count | quantity | parts_count | supplier_count | customer_count
--------------+--------------+--------------+-------------+----------------+----------------
MACHINERY | 82647 | 2107972.0000 | 75046 | 72439 | 67404
AUTOMOBILE | 82692 | 2109248.0000 | 75039 | 72345 | 67306
HOUSEHOLD | 82521 | 2112594.0000 | 74879 | 72322 | 67035
BUILDING | 83140 | 2115677.0000 | 75357 | 72740 | 67411
FURNITURE | 83405 | 2129150.0000 | 75759 | 73048 | 67876
CREATE external SCHEMA datalake_ext_schema
FROM data catalog DATABASE 'datalake_ext_schema'
iam_role 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift_Role] '
CREATE external DATABASE if not exists;
CREATE external TABLE datalake_ext_schema.order_summary
(c_mktsegment VARCHAR(10),
orders_count BIGINT,
quantity numeric(38,4),
parts_count BIGINT,
supplier_count BIGINT,
customer_count BIGINT
)
STORED
AS
PARQUET LOCATION
's3://[Your-Amazon_S3_Bucket]/order_summary/';
Note
You are able to specify the output data format as PARQUET. You can use any of the supported data formats—see https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html for more information.
INSERT INTO datalake_ext_schema.order_summary
SELECT c_mktsegment,
COUNT(o_orderkey) AS orders_count,
SUM(l_quantity) AS quantity,
COUNT(DISTINCT P_PARTKEY) AS parts_count,
COUNT(DISTINCT L_SUPPKEY) AS supplier_count,
COUNT(DISTINCT o_custkey) AS customer_count
FROM lineitem
JOIN orders ON l_orderkey = o_orderkey
JOIN customer c ON o_custkey = c_custkey
JOIN dwdate
ON d_date = l_commitdate
AND d_year = 1992
JOIN part ON P_PARTKEY = l_PARTKEY
JOIN supplier ON L_SUPPKEY = S_SUPPKEY
GROUP BY c_mktsegment;
select * from datalake_ext_schema.order_summary limit 5;
Here's the expected sample output:
c_mktsegment | orders_count | quantity | parts_count | supplier_count | customer_count
--------------+--------------+--------------+-------------+----------------+----------------
HOUSEHOLD | 82521 | 2112594.0000 | 74879 | 72322 | 67035
MACHINERY | 82647 | 2107972.0000 | 75046 | 72439 | 67404
FURNITURE | 83405 | 2129150.0000 | 75759 | 73048 | 67876
BUILDING | 83140 | 2115677.0000 | 75357 | 72740 | 67411
AUTOMOBILE | 82692 | 2109248.0000 | 75039 | 72345 | 67306
$ aws s3 ls s3://[Your-Amazon_S3_Bucket]/order_summary/
Here is the expected output:
2021-03-02 00:00:11 1588 20210302_000002_331241_25860550_0002_part_00.parquet
2021-03-02 00:00:11 1628 20210302_000002_331241_25860550_0013_part_00.parquet
2021-03-02 00:00:11 1581 20210302_000002_331241_25860550_0016_part_00.parquet
2021-03-02 00:00:11 1581 20210302_000002_331241_25860550_0020_part_00.parquet
The preceding sample output shows a list of all the Parquet files underlying the external table.
Amazon Redshift Spectrum allows Amazon Redshift customers to query data directly from an Amazon S3 data lake. This allows us to combine data warehouse data with data lake data, which makes use of open source file formats such as Parquet, comma-separated values (CSV), Sequence, Avro, and so on. Amazon Redshift Spectrum is a serverless solution, so customers don't have to provision or manage it. It allows customers to perform unified analytics on data in an Amazon Redshift cluster and data in an Amazon S3 data lake, and easily create insights from disparate datasets.
To complete this recipe, you will need the following to be set up:
In this recipe, we will create external table in an external schema, and query data directly from Amazon S3 using Amazon Redshift:
create external schema packt_spectrum
from data catalog
database 'packtspectrumdb'
iam_role 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift_Role]'
create external database if not exists;
aws cp s3://packt-redshift-cookbook/spectrum/sales s3://[Your-Amazon_S3_Bucket]/spectrum/sales --recursive
create external table packt_spectrum.sales(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
row format delimited
fields terminated by ' '
stored as textfile
location 's3://[Your-Amazon_S3_Bucket]/spectrum/sales/'
table properties ('numRows'='172000');
select count(*) from packt_spectrum.sales; --
expected sample output –
count
------
172462
create table packt_event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);
copy packt_event from 's3://packt-redshift-cookbook/spectrum/event/allevents_pipe.txt'
iam_role 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift_Role]
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' Region 'us-east-1';
SELECT top 10 packt_spectrum.sales.eventid,
SUM(packt_spectrum.sales.pricepaid)
FROM packt_spectrum.sales,
packt_event
WHERE packt_spectrum.sales.eventid = packt_event.eventid
AND packt_spectrum.sales.pricepaid > 30
GROUP BY packt_spectrum.sales.eventid
ORDER BY 2 DESC;
Here's the expected output:
eventid | sum
--------+---------
289 | 51846.00
7895 | 51049.00
1602 | 50301.00
851 | 49956.00
7315 | 49823.00
6471 | 47997.00
2118 | 47863.00
984 | 46780.00
7851 | 46661.00
5638 | 46280.00
Now, Amazon Redshift is able to join the external and local tables to produce the desired results.
Amazon Redshift RA3 clusters decouple storage and compute, and provide the ability to scale either of them independently. The decoupled storage allows for data to be read by different consumer clusters that allow workload isolation. The data producer cluster controls access to the data that is shared. This feature opens up the possibility to set up a flexible multi-tenant system—for example, within an organization, data produced by a business unit can be shared with any of the different teams such as marketing, finance, data science, and so on that can be independently consumed using their own Amazon Redshift clusters.
To complete this recipe, you will need the following:
a. The first cluster should be deployed with the retail sample dataset from Chapter 3, Loading and Unloading Data. This cluster will be called the Producer Amazon Redshift cluster, where data will be shared from (outbound). Note down the namespace of this cluster—this can be found by running a SELECT current_namespace command. Let's say this cluster namespace value is [Your_Redshift_Producer_Namespace].
b. The second cluster can be an empty cluster. This cluster will be called the Consumer Amazon Redshift cluster, where data will be consumed (inbound). Note down the namespace of this cluster—this can be found by running a SELECT current_namespace command. Let's say this cluster namespace value is [Your_Redshift_Consumer_Namespace].
In the recipe, we will use the Producer Amazon Redshift RA3 cluster, with the sample dataset to be shared with the consumer cluster:
SELECT DATE_TRUNC('month',l_shipdate),
SUM(l_quantity) AS quantity
FROM lineitem
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-06-30'
GROUP BY DATE_TRUNC('month',l_shipdate);
--Sample output dataset
date_trunc | quantity
---------------------+----------------
1992-05-01 00:00:00 | 196639390.0000
1992-06-01 00:00:00 | 190360957.0000
1992-03-01 00:00:00 | 122122161.0000
1992-02-01 00:00:00 | 68482319.0000
1992-04-01 00:00:00 | 166017166.0000
1992-01-01 00:00:00 | 24426745.0000
CREATE DATASHARE SSBDataShare;
ALTER DATASHARE SSBDataShare ADD TABLE lineitem;
GRANT USAGE ON DATASHARE SSBDataShare TO NAMESPACE ' [Your_Redshift_Consumer_Namespace]';
SHOW DATASHARES;
Here's the expected output:
owner_account,owner_namespace,sharename,shareowner,share_type,createdate,publicaccess
123456789012,redshift-cluster-data-share-1,ssbdatashare,100,outbound,2021-02-26 19:03:16.0,false
DESC DATASHARE ssbdatashare OF NAMESPACE [Your_Redshift_Producer_Namespace];
Here's the expected output:
producer_account | producer_namespace | share_type | share_name | object_type | object_name
-------------------+--------------------------------------+------------+------------+-------------+---------------------------------
123456789012 | [Your_Redshift_Producer_Namespace]| INBOUND | ssbdatashare | table | public.lineitem
CREATE DATABASE ssb_db FROM DATASHARE ssbdatashare OF NAMESPACE [Your_Redshift_Producer_Namespace];
CREATE EXTERNAL SCHEMA ssb_schema FROM REDSHIFT DATABASE 'ssb_db' SCHEMA 'public';
SELECT DATE_TRUNC('month',l_shipdate),
SUM(l_quantity) AS quantity
FROM ssb_db.public.lineitem
WHERE l_shipdate BETWEEN '1992-01-01' AND '1992-06-30'
GROUP BY DATE_TRUNC('month',l_shipdate);
Here's the sample dataset:
date_trunc | quantity
---------------------+----------------
1992-05-01 00:00:00 | 196639390.0000
1992-06-01 00:00:00 | 190360957.0000
1992-03-01 00:00:00 | 122122161.0000
1992-02-01 00:00:00 | 68482319.0000
1992-04-01 00:00:00 | 166017166.0000
1992-01-01 00:00:00 | 24426745.0000
As you can see from the preceding code snippet, the data that is shared by the producer cluster is now is available for querying in the consumer cluster.
With Amazon Redshift, you can share data at different levels. These levels include databases, schemas, tables, views (including regular, late-binding, and materialized views), and SQL user-defined functions (UDFs). You can create multiple datashares for a given database. A datashare can contain objects from multiple schemas in the database on which sharing is created.
By having this flexibility in sharing data, you get fine-grained access control. You can tailor this control for different users and businesses that need access to Amazon Redshift data. Amazon Redshift provides transactional consistency on all producer and consumer clusters and shares up-to-date and consistent views of the data with all consumers. You can also use SVV_DATASHARES, SVV_DATASHARE_CONSUMERS, and SVV_DATASHARE_OBJECTS to view datashares, the objects within the datashares, and the datashare consumers.
Amazon Redshift Federated Query enables unified analytics across databases, data warehouses, and data lakes. With the Federated Query feature in Amazon Redshift, you can query live data across from Amazon RDS and Aurora PostgreSQL databases. For example, you might have an up-to-date customer address data that you might want to join with historical order data to enrich your reports—this can be easily joined up using the Federated Query feature.
To complete this recipe, you will need the following:
In this recipe, we will use an Amazon Aurora serverless PostgreSQL database as the operational data store to federate with Amazon Redshift:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AccessSecret",
"Effect": "Allow",
"Action": [
"secretsmanager:GetResourcePolicy",
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"secretsmanager:ListSecretVersionIds"
],
"Resource": "arn:aws:secretsmanager:us-east-1:[Your-AWS_Account_Id]:secret:aurora-pg/RedshiftCookbook"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"secretsmanager:GetRandomPassword",
"secretsmanager:ListSecrets"
],
"Resource": "*"
}
]
}
DROP SCHEMA IF EXISTS ext_postgres;
CREATE EXTERNAL SCHEMA ext_postgres
FROM POSTGRES
DATABASE 'postgres'
URI '[AuroraClusterEndpoint]'
IAM_ROLE 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift-Role]'
SECRET_ARN '[AuroraPostgreSQLSecretsManagerARN]';
select *
from svv_external_schemas;
select *
from svv_external_tables
where schemaname = 'ext_postgres';
select count(*) from ext_postgres.orders;
Here's the expected output:
1000
SELECT O_ORDERSTATUS,
COUNT(o_orderkey) AS orders_count
FROM ext_postgres.orders
JOIN dwdate
ON d_date = O_ORDERDATE
AND d_year = 1992
GROUP BY O_ORDERSTATUS;
o_orderstatus orders_count
F 1000
create materialized view public.live_orders as
SELECT O_ORDERSTATUS,
COUNT(o_orderkey) AS orders_count
FROM ext_postgres.orders
JOIN dwdate
ON d_date = O_ORDERDATE
AND d_year = 1992
GROUP BY O_ORDERSTATUS;
As observed, the materialized view can federate between the Aurora PostgreSQL and Amazon Redshift databases.
3.135.200.211