Chapter 1. Ingesting Data Into The Cloud

In this chapter, we will show you how to ingest data into the cloud. For that purpose, we will look at a typical scenario in which an application writes files into an Amazon S3 Data Lake which in turn need to be accessed by both the ML Engineering / Data Science team, as well as the Business Intelligence / Data Analyst team as shown in Figure 3-1 below.

Figure 1-1. Chapter Overview > Data is being written to S3, and accessed by different teams.

Amazon Simple Storage Service, short Amazon S3, is object storage to store and retrieve any amount of data. It offers you extreme durability, high availability, and infinitely scalable data storage infrastructure at very low costs. Hence, it is the perfect foundation for data lakes, and for storing all of your Machine Learning data - whether it’s training and test datasets, model code, or even the model files themselves. You will learn more about the advantages of building data lakes on Amazon S3 in the next section.

Let’s assume our application continually captures some form of log data (i.e. customer interactions on our website) and writes it in tabular-separated file format (TSV) into S3.

As a data scientist or machine learning engineer, you want to have access to all of the raw data, and be able to quickly explore it. We will introduce you to Amazon Athena and show you how to leverage Athena as an interactive query service to analyze data in S3 using standard SQL, without moving the data. In the first step, we will register the TSV data in our S3 bucket with Athena, and then run some ad-hoc queries on the dataset. We will also show how you can easily convert the TSV data into the more query-optimized, columnar file format Apache Parquet.

Your business intelligence team and data analysts might also want to have a subset of the data in a data warehouse which they can then transform, and query with their standard SQL clients to create reports and visualize trends. We will introduce you to Amazon Redshift, a fully managed data warehouse service, and show you how to insert TSV data into Redshift, as well as combine the data warehouse queries with the less frequently accessed data that’s still in our S3 data lake via Amazon Redshift Spectrum. Your Business Intelligence team can also use Amazon Redshift’s data lake export functionality to unload (transformed, enriched) data back into our S3 data lake in Parquet file format.

We will also discuss key services and functionalities that can help you secure your data, in particular using the AWS Identity & Access Management (IAM) service, the Key Management System (KMS) for encryption, and leveraging S3 Access Points to control access to data in shared S3 buckets.

We will conclude this chapter with some tips and tricks on how you can increase performance using compression algorithms, and reduce cost by leveraging S3 Intelligent Tiering.

Note

In order for you to run the code examples in your own environment, make sure you have access to an AWS account with corresponding privileges. Please see Appendix C on how to setup a new AWS account, and the AWS services discussed in this book.

Data Lakes

In the previous chapter, we discussed the democratization of Artificial Intelligence and Data Science over the last few years, the explosion of data, and how cloud services provide the infrastructure agility to store and process data of any amount.

Yet, in order to use all this data efficiently, companies are tasked to break down existing data silos, find ways to analyze very diverse datasets, dealing with both structured and unstructured data, while ensuring highest standards of data governance, data security and compliance with privacy regulations. These (big) data challenges set the stage for data lakes.

As illustrated in Figure 3-2, a data lake is a centralized and secure repository that enables you to store, govern, discover and share data at any scale. One of the biggest advantages of data lakes is that you don’t need to pre-define any schemas. You can store your raw data at scale and then decide later in which ways you need to process and analyze it. Data Lakes may contain structured relational data, files, and any form of semi-structured and unstructured data. You can also ingest data in real time.

Figure 1-2. Data Lake.

Data lakes provide a perfect base for Data Science and Machine Learning, as they give you access to large and diverse datasets to train and deploy more accurate models.

Building a data lake typically consists of the following (high-level) steps as shown in Figure 3-3.

  1. Setup storage.

  2. Move your data.

  3. Cleanse, prepare and catalog the data.

  4. Configure and enforce security and compliance policies.

  5. Make data available for analytics.

Figure 1-3. Typical steps of building a data lake.

Each of those steps involves a range of tools and technologies, and while you can build a data lake manually from the ground up, there are cloud services available to help you streamline this process, i.e. AWS Lake Formation.

Lake Formation helps you to collect and catalog data from databases and object storage, move the data into your Amazon S3 data lake, clean and classify your data using machine learning algorithms, and secure access to your sensitive data.

Note

You can find more information on AWS Lake Formation here: https://aws.amazon.com/lake-formation

From a data analysis perspective, another key benefit of storing your data in Amazon S3 is, that it shortens the “time to insight’ dramatically, as you can run ad-hoc queries directly on the data in S3, and you don’t have to go through complex ETL (Extract-Transform-Load) processes and data pipelines to get your data into traditional enterprise data warehouses, as we will see in the next sections of this chapter.

Import Your Data into the S3 Data Lake

We are now ready to import our data into S3. We have chosen the Amazon Customer Reviews Dataset as the primary dataset for this book.

The Amazon Customer Reviews Dataset consists of more than 130+ million of customer reviews of products across 43 different product categories on the Amazon.com website from 1995 until 2015. It is a great resource for demonstrating Machine Learning concepts in general, and Natural Language Processing (NLP) techniques in particular.

You have likely seen these customer reviews on the Amazon.com website before when contemplating whether to purchase a certain product via the Amazon.com marketplace. Figure 3-4 shows the product reviews section on the Amazon.com website for an Amazon Echo Dot device.

Figure 1-4. Product reviews section on the Amazon.com website.

Let’s have a closer look at our dataset and its schema.

Describe the Dataset

Customer reviews are one of Amazon’s most valuable tools they offer customers for making informed purchase decisions.

In Amazon’s annual shareholder letters, Jeff Bezos (Amazon Founder and CEO) regularly elaborates on the importance of “word of mouth” as a customer acquisition tool, and his love for “customers’ constant discontent” as he calls it:

“We now offer customers … vastly more reviews, content, browsing options, and recommendation features. ... Word of mouth remains the most powerful customer acquisition tool we have, and we are grateful for the trust our customers have placed in us. Repeat purchases and word of mouth have combined to make Amazon.com the market leader in online bookselling.”

Jeff Bezos, 1997 Shareholder (“Share Owner”) Letter

Here is the schema for the dataset:

marketplace
2-letter country code (in this case all “US”).
customer_id
Random identifier that can be used to aggregate reviews written by a single author.
review_id
A unique ID for the review.
product_id
The Amazon Standard Identification Number (ASIN). http://www.amazon.com/dp/<ASIN> links to the product’s detail page.
product_parent
he parent of that ASIN. Multiple ASINs (color or format variations of the same product) can roll up into a single product parent.
product_title
Title description of the product.
product_category
Broad product category that can be used to group reviews.
star_rating
The review’s rating (1 to 5 stars).
helpful_votes
Number of helpful votes for the review.
total_votes
Number of total votes the review received.
vine
Was the review written as part of the Vine program?
verified_purchase
Was the review from a verified purchase?
review_headline
The title of the review itself.
review_body
The text of the review.
review_date
The date the review was written.

The dataset is shared in a public Amazon S3 bucket, and is available in two file formats:

  • Tab separated value (TSV), a text format - s3://amazon-reviews-pds/tsv/

  • Parquet, an optimized columnar binary format - s3://amazon-reviews-pds/parquet/

The Parquet dataset is partitioned (divided into subfolders) by the column product_category to further improve query performance. With this, you can use a WHERE clause on product_category in your SQL queries to only read data specific to that category.

We can use the AWS Command Line Interface (CLI) to list the S3 bucket content using the following CLI commands:

  • aws s3 ls s3://amazon-reviews-pds/tsv/

  • aws s3 ls s3://amazon-reviews-pds/parquet/

Note

The AWS CLI tool provides a unified command line interface to Amazon Web Services. You can find more information on how to install and configure the tool here: https://aws.amazon.com/cli/.

The listings below show us the available dataset files in TSV format, and the Parquet partitioning folder structure.

Dataset Files in TSV Format

2017-11-24 13:48:03  241896005 amazon_reviews_multilingual_DE_v1_00.tsv.gz
2017-11-24 13:48:17   70583516 amazon_reviews_multilingual_FR_v1_00.tsv.gz
2017-11-24 13:48:34   94688992 amazon_reviews_multilingual_JP_v1_00.tsv.gz
2017-11-24 13:49:14  349370868 amazon_reviews_multilingual_UK_v1_00.tsv.gz
2017-11-24 13:48:47 1466965039 amazon_reviews_multilingual_US_v1_00.tsv.gz
2017-11-24 13:49:53  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2017-11-24 13:56:36  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2017-11-24 14:04:02  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2017-11-24 14:08:11  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2017-11-24 14:17:41 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2017-11-24 14:45:50 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2017-11-24 15:10:21 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2017-11-24 15:22:13  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2017-11-24 15:27:13 2689739299 amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz
2017-11-24 15:49:13 1294879074 amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz
2017-11-24 15:59:24  253570168 amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv.gz
2017-11-24 16:01:47   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2017-11-24 16:01:53  506979922 amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz
2017-11-24 16:06:31   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2017-11-24 16:06:42  698828243 amazon_reviews_us_Electronics_v1_00.tsv.gz
2017-11-24 16:12:44  148982796 amazon_reviews_us_Furniture_v1_00.tsv.gz
2017-11-24 16:13:52   12134676 amazon_reviews_us_Gift_Card_v1_00.tsv.gz
2017-11-24 16:13:59  401337166 amazon_reviews_us_Grocery_v1_00.tsv.gz
2017-11-24 19:55:29 1011180212 amazon_reviews_us_Health_Personal_Care_v1_00.tsv.gz
2017-11-24 20:30:55  193168458 amazon_reviews_us_Home_Entertainment_v1_00.tsv.gz
2017-11-24 20:37:56  503339178 amazon_reviews_us_Home_Improvement_v1_00.tsv.gz
2017-11-24 20:55:43 1081002012 amazon_reviews_us_Home_v1_00.tsv.gz
2017-11-24 21:47:51  247022254 amazon_reviews_us_Jewelry_v1_00.tsv.gz
2017-11-24 21:59:56  930744854 amazon_reviews_us_Kitchen_v1_00.tsv.gz
2017-11-24 23:41:48  486772662 amazon_reviews_us_Lawn_and_Garden_v1_00.tsv.gz
2017-11-24 23:59:42   60320191 amazon_reviews_us_Luggage_v1_00.tsv.gz
2017-11-25 00:01:59   24359816 amazon_reviews_us_Major_Appliances_v1_00.tsv.gz
2017-11-25 00:02:45  557959415 amazon_reviews_us_Mobile_Apps_v1_00.tsv.gz
2017-11-25 00:22:19   22870508 amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz
2017-11-25 00:23:06 1521994296 amazon_reviews_us_Music_v1_00.tsv.gz
2017-11-25 00:58:36  193389086 amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz
2017-11-25 01:03:14  512323500 amazon_reviews_us_Office_Products_v1_00.tsv.gz
2017-11-25 07:21:21  448963100 amazon_reviews_us_Outdoors_v1_00.tsv.gz
2017-11-25 07:32:46 1512903923 amazon_reviews_us_PC_v1_00.tsv.gz
2017-11-25 08:10:33   17634794 amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv.gz
2017-11-25 08:11:02  515815253 amazon_reviews_us_Pet_Products_v1_00.tsv.gz
2017-11-25 08:22:26  642255314 amazon_reviews_us_Shoes_v1_00.tsv.gz
2017-11-25 08:39:15   94010685 amazon_reviews_us_Software_v1_00.tsv.gz
2017-11-27 10:36:58  872478735 amazon_reviews_us_Sports_v1_00.tsv.gz
2017-11-25 08:52:11  333782939 amazon_reviews_us_Tools_v1_00.tsv.gz
2017-11-25 09:06:08  838451398 amazon_reviews_us_Toys_v1_00.tsv.gz
2017-11-25 09:42:13 1512355451 amazon_reviews_us_Video_DVD_v1_00.tsv.gz
2017-11-25 10:50:22  475199894 amazon_reviews_us_Video_Games_v1_00.tsv.gz
2017-11-25 11:07:59  138929896 amazon_reviews_us_Video_v1_00.tsv.gz
2017-11-25 11:14:07  162973819 amazon_reviews_us_Watches_v1_00.tsv.gz
2017-11-26 15:24:07 1704713674 amazon_reviews_us_Wireless_v1_00.tsv.gz
2017-11-28 15:55:13       6162 index.txt
2017-11-27 11:08:16      17553 sample_fr.tsv
2017-11-27 11:08:17      15906 sample_us.tsv

Dataset Files in Parquet Format

PRE product_category=Apparel/
                           PRE product_category=Automotive/
                           PRE product_category=Baby/
                           PRE product_category=Beauty/
                           PRE product_category=Books/
                           PRE product_category=Camera/
                           PRE product_category=Digital_Ebook_Purchase/
                           PRE product_category=Digital_Music_Purchase/
                           PRE product_category=Digital_Software/
                           PRE product_category=Digital_Video_Download/
                           PRE product_category=Digital_Video_Games/
                           PRE product_category=Electronics/
                           PRE product_category=Furniture/
                           PRE product_category=Gift_Card/
                           PRE product_category=Grocery/
                           PRE product_category=Health_&_Personal_Care/
                           PRE product_category=Home/
                           PRE product_category=Home_Entertainment/
                           PRE product_category=Home_Improvement/
                           PRE product_category=Jewelry/
                           PRE product_category=Kitchen/
                           PRE product_category=Lawn_and_Garden/
                           PRE product_category=Luggage/
                           PRE product_category=Major_Appliances/
                           PRE product_category=Mobile_Apps/
                           PRE product_category=Mobile_Electronics/
                           PRE product_category=Music/
                           PRE product_category=Musical_Instruments/
                           PRE product_category=Office_Products/
                           PRE product_category=Outdoors/
                           PRE product_category=PC/
                           PRE product_category=Personal_Care_Appliances/
                           PRE product_category=Pet_Products/
                           PRE product_category=Shoes/
                           PRE product_category=Software/
                           PRE product_category=Sports/
                           PRE product_category=Tools/
                           PRE product_category=Toys/
                           PRE product_category=Video/
                           PRE product_category=Video_DVD/
                           PRE product_category=Video_Games/
                           PRE product_category=Watches/
                           PRE product_category=Wireless/

Note that PRE stands for prefix. For now, you can think of prefixes as folders in S3.

Note

At any time, you can use EXPLAIN on your queries to make sure the partitions are being utilized. If your query patterns change over time, you may want to revisit updating the existing partitions - or even adding new partitions to match your business needs.

So which data format should you choose? The query-optimized Parquet columnar file format is definitely preferred when running analytics queries as it improves your query performance. On the other hand, many applications write out data in simple comma-separated (CSV) or tab-separated (TSV) files, i.e. application log files. So let’s actually assume we don’t have the Parquet files ready-to-use as this allows us to show you how you can easily get there from CSV or TSV files.

In a first step, let’s copy the TSV data from Amazon’s public S3 bucket into a privately hosted S3 bucket, to simulate that process, as shown in figure 3-5.

Figure 1-5. Copy our dataset from the public S3 bucket to a privately hosted S3 bucket.

You can use the AWS CLI tool again to perform the following steps.

  1. Create a new private S3 bucket:

aws s3 mb s3://data-science-on-aws
  1. Copy the content of the public S3 bucket to our newly created private S3 bucket like follows (only include the files starting with amazon_reviews_us_, i.e. skipping any index, multilingual and sample data files in that directory).

aws s3 cp --recursive s3://amazon-reviews-pds/tsv/ s3://data-science-on-aws/amazon-reviews-pds/tsv/ --exclude "*" --include "amazon_reviews_us_*"

We are now ready to use Amazon Athena to register and query the data, and transform the TSV files into Parquet.

Query the S3 Data Lake with Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so you don’t need to manage any infrastructure, and you only pay for the queries you run.

With Athena, you can query data wherever it is stored (S3 in our case) without needing to move the data to a relational database. Athena separates compute from storage -- lowering the overall Time-To-Insight (TTI) from when the data lands in S3 from your application to when the user seeks an insight for your business. When you register an Athena table with your S3 data, Athena stores the Table-to-S3 mapping in a data catalog. Athena is using the AWS Glue Data Catalog service for that purpose. You can think of the Glue Data Catalog as a persistent metadata store in your AWS account which other AWS services such as Athena and Redshift Spectrum can use to locate and query data.

See the AWS Glue Data Catalog in Figure 3-6.

Figure 1-6. AWS Glue Data Catalog.

Athena queries run in parallel over a dynamic, serverless cluster which makes Athena extremely fast -- even on large datasets. Athena will automatically scale the cluster depending on the query and dataset -- freeing the user from worrying about these details.

In addition Athena supports the Parquet columnar file format with 10’s of millions of partitions (ie. by product_category, year, or marketplace) to improve the performance of your queries. For example, if you plan to run frequent queries that group the results by product_category, then you should create a partition in Athena for product_category. Upon creation, Athena will update the Glue Catalog accordingly so that future queries will inherit the performance benefits of this new partition.

Athena is based on Presto, an open source, distributed SQL query engine designed for fast, ad-hoc data analytics on large datasets. Similar to Apache Spark, Presto uses high RAM clusters to perform its queries. However, Presto does not require a large amount of disk as it is designed for ad-hoc queries (vs. automated, repeatable queries) and therefore does not perform the checkpointing required for fault-tolerance.

Note

You can find more information on Presto here: https://aws.amazon.com/big-data/what-is-presto/

While Athena is very fast for ad-hoc queries, it is perhaps not the best option for mission-critical queries that need to recover from partial failures. This is the classic trade-off between performance and fault-tolerance per the CAP Theorem as shown in Figure 3-7 below.

Figure 1-7. CAP Theorem.

If you truly need fault-tolerant queries, you will need something like Apache Spark with Elastic Map Reduce (EMR) or SageMaker Processing Jobs. Apache Spark, while known for its efficient cluster-memory utilization, actually uses a combination of checkpointing and disk-spilling to recover from cluster-node failures. The drawback is performance: Apache Spark is slower than Athena for many ad-hoc queries.

For longer-running Athena jobs, you can listen for query-completion events using CloudWatch Events. When the query completes, all listeners are notified with the event details including query success status, total execution time, and total bytes scanned.

With a functionality called Athena Federated Query, you can also run SQL queries across data stored in relational databases (such as Amazon RDS and Amazon Aurora), non-relational databases (such as Amazon DynamoDB), object storage (Amazon S3), and custom data sources. This gives you a unified analytics view across data stored in your data warehouse, data lake and operational databases without the need to actually move the data.

You can access Athena via the AWS Management Console, an API, or an ODBC or JDBC driver for programmatic access. Let’s have a look at how to use Amazon Athena via the AWS Management Console.

Access Athena from the AWS Console

In order for us to use Amazon Athena, we first need to setup the service as follows:

  1. Click on Amazon Athena in the AWS Management Console.

    If this is your first time visiting the Athena console, you’ll see a “Getting Started” page as shown in Figure 3-8. Choose “Get Started” to open the Query Editor.

Figure 1-8. Amazon Athena Console > click the “Getting Started” page.

You might also get asked to set up a query result location in Amazon S3 if you are using Amazon Athena for the first time, click on the link shown in Figure 3-9 and define a S3 bucket, e.g. data-science-on-aws/athena/staging

Figure 1-9. Amazon Athena Console > define a S3 bucket to store results.
  1. Create a Database.

    In the Athena Query Editor, you see a query pane with an example query. Start typing your query anywhere in the query pane as shown in Figure 3-10.

Figure 1-10. Amazon Athena Query Editor.

To create our database, enter the following CREATE DATABASE statement, and then choose “Run Query”:

CREATE DATABASE dsoaws;

Confirm that the catalog display refreshes and dsoaws appears in the DATABASE list in the Catalog dashboard on the left side as shown in Figure 3-11.

Figure 1-11. Amazon Athena Console. Database created successfully.

When we run CREATE DATABASE and CREATE TABLE queries in Athena with the AWS Glue Data Catalog as our source, we automatically see the database and table metadata entries being created in the AWS Glue Data Catalog as shown in Figure 3-12.

Figure 1-12. Amazon Glue Console. Database entry created successfully.

Register Your S3 Data as an Athena Table

Now that we have a database, we’re ready to create a table that’s based on the Amazon Customer Reviews Dataset. We define the columns that map to the data, specify how the data is delimited, and provide the location in Amazon S3 for the file(s). We’re basically defining a “schema-on-read” here.

In the Athena Console, make sure that dsoaws is selected for DATABASE and then choose “New Query” and run the following SQL statement:

CREATE EXTERNAL TABLE IF NOT EXISTS dsoaws.amazon_reviews_tsv(
         marketplace string,
         customer_id string,
         review_id string,
         product_id string,
         product_parent string,
         product_title string,
         product_category string,
         star_rating int,
         helpful_votes int,
         total_votes int,
         vine string,
         verified_purchase string,
         review_headline string,
         review_body string,
         review_date string 
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '	' LINES TERMINATED BY '
' LOCATION 's3://data-science-on-aws/amazon-reviews-pds/tsv' TBLPROPERTIES ('compressionType'='gzip', 'skip.header.line.count'='1')

You should see the newly created table amazon_reviews_tsv appear on the left under “Tables” as shown in Figure 3-13.

Figure 1-13. Amazon Athena Console. TSV table created successfully.

You can run a sample query like this to check if everything works correctly:

		SELECT *
FROM dsoaws.amazon_reviews_parquet
WHERE product_category = 'Digital_Video_Download' LIMIT 10

You should see a result similar to what is shown in Figure 3-14.

Figure 1-14. Amazon Athena showing query result.
Note

When using LIMIT, you can better-sample the rows by adding TABLESAMPLE BERNOULLI(10) after the FROM. Otherwise, you will always return the data in the same order that it was ingested into S3 which could be skewed towards a single product_category, for example. To reduce code clutter, we will just use LIMIT without TABLESAMPLE.

Create a Parquet-based Table in Athena

In a next step, we will show you how you can easily convert that data now into the Apache Parquet columnar file format to improve the query performance. Parquet is optimized for columnar-based queries such as counts, sums, averages, and other aggregation-based summary statistics that focus on the column values vs. row information.

Note

While we already have the data in Parquet format from the public dataset, we feel that creating a Parquet table is an important enough topic to demonstrate in this book. If you are doing any type of large scale data analytics, you should be using a columnar file format like Parquet. We discuss the benefits of Parquet in the performance section.

  • Again, make sure that dsoaws is selected for DATABASE and then choose “New Query” and run the following “CREATE TABLE AS” (short CTAS) SQL statement:

CREATE TABLE IF NOT EXISTS dsoaws.amazon_reviews_parquet
WITH (format = 'PARQUET', external_location = 's3://data-science-on-aws/amazon-reviews-pds/parquet', partitioned_by = ARRAY['product_category']) AS
SELECT marketplace,
         customer_id,
         review_id,
         product_id,
         product_parent,
         product_title,
         star_rating,
         helpful_votes,
         total_votes,
         vine,
         verified_purchase,
         review_headline,
         review_body,
         CAST(YEAR(DATE(review_date)) AS INTEGER) AS year,
         DATE(review_date) AS review_date,
         product_category
FROM dsoaws.amazon_reviews_tsv

As you can see from the query, we’re also adding a new year column to our dataset by converting the review_date string to a date format, and then cast the year out of the date. Let’s store the year value as an integer.

  • You should now see the newly created table amazon_reviews_parquet appear as well on the left under “Tables” as shown in Figure 3-15.

Figure 1-15. Amazon Athena. Parquet table created successfully.
  • As a last step, we need to load the Parquet partitions. To do so, just issue the following SQL command:

MSCK REPAIR TABLE amazon_reviews_parquet;

Note

You can automate the MSCK REPAIR TABLE command to load the partitions after data ingest from any workflow manager (or using an AWS Lambda function that gets triggered by uploading data to the S3 bucket) by executing the following AWS CLI command aws athena start-query-execution --query-string "MSCK REPAIR TABLE <database>.<table>" --result-configuration "OutputLocation=s3://<bucket>"

  • You can run our sample query again to check if everything works correctly:

SELECT *
FROM dsoaws.amazon_reviews_parquet
WHERE product_category = 'Digital_Video_Download' LIMIT 10
  • We can also double check that both tables’ metadata has been stored in the AWS Glue Data Catalog as shown in Figure 3-16.

Figure 1-16. AWS Glue console. Both tables’ metadata information was created.

Done! In just a few steps we have set up Amazon Athena to connect to our Amazon Customer Reviews TSV files, and transformed them into Apache Parquet file format.

You might have noticed that our second sample query finished in a fraction of the time compared to the one before we ran on the TSV table. We accelerated our query response time by leveraging our data stored in Parquet and partitioned by product_category.

Now, for the next section, let’s assume our business intelligence team wants to load the last 2 years of customer review data into our data warehouse to analyze customer behavior and buying trends.

Load Dataset into Redshift Data Warehouse

One of the fundamental differences between data lakes and data warehouses is that while you ingest and store huge amounts of raw, unprocessed data in your data lake, you normally only load some fraction of your recent data into your data warehouse. Depending on your business and analytics use case, this might be data from the past couple of months, a year, or maybe the past 2 years. Let’s assume we want to have the past 2 years of our Amazon Customer Reviews Dataset in a data warehouse to analyze year-over-year customer behavior and review trends. We will use Amazon Redshift as our data warehouse for this.

Amazon Redshift is a fully managed data warehouse which allows you to run complex analytic queries against petabytes of structured data. Your queries are distributed and parallelized across multiple nodes. In contrast to relational databases which are optimized to store data in rows and mostly serve transactional applications, Redshift implements columnar data storage which is optimized for analytical applications where you are mostly interested in the data within the individual columns.

Amazon Redshift also includes Amazon Redshift Spectrum, which allows you to directly execute SQL queries from Redshift against exabytes of unstructured data in your Amazon S3 data lake without the need to physically move the data. Amazon Redshift Spectrum automatically scales the compute resources needed based on how much data is being received, so queries against Amazon S3 run fast, regardless of the size of your data. We will use Amazon Redshift Spectrum to access our data in S3, and then show you how you can combine data that is stored in Redshift with data that is still in S3.

This might sound similar to the approach we showed earlier with Amazon Athena, but note that in this case we show how your Business Intelligence team can enrich their queries with data that is not stored in the data warehouse itself.

Note

Before you can run the queries shown next, you need to have an Amazon Redshift cluster set up in your AWS account. Please refer to Appendix C - Setup Amazon Redshift for detailed instructions.

Query the Data Lake and Data Warehouse with Redshift Spectrum

Once you have your Redshift cluster set up and configured, you can navigate to the AWS Console, Amazon Redshift, and click on Query Editor to put in SQL commands to execute as shown in Figure 3-17.

Figure 1-17. AWS Console > Amazon Redshift > Query Editor.

We can leverage our previously created table in Amazon Athena with its metadata and schema information stored in the AWS Glue Data Catalog to access our data in S3 through Redshift Spectrum. All we need to do is create an external schema in Redshift, point it to our AWS Glue Data Catalog, and point Redshift to the database we’ve created.

In the Redshift Query Editor (or via any other ODBC/JDBC SQL client that you might prefer to use), execute the following command:

          CREATE EXTERNAL SCHEMA IF NOT EXISTS athena FROM DATA CATALOG
          DATABASE 'dsoaws'
          IAM_ROLE '<YOUR-IAM-ROLE>'
          CREATE EXTERNAL DATABASE IF NOT EXISTS

With this command, we are creating a new schema in Redshift called athena to highlight the data access we set up through our tables in Amazon Athena.

FROM DATA CATALOG indicates that the external database is defined in the Amazon Athena (AWS Glue) Data Catalog.

DATABASErefers to our previously created database in Athena.

IAM_ROLE needs to point to an Amazon Resource Name (ARN) for an IAM role that your cluster uses for authentication and authorization

Note

IAM is AWS’s Identity and Access Management service which enables you to manage and control access to AWS services and resources in your account. With an IAM role you can specify the permissions a user or service is granted. In this example, the IAM role must have at a minimum permission to perform a LIST operation on the Amazon S3 bucket to be accessed and a GET operation on the Amazon S3 objects the bucket contains. If the external database is defined in an Amazon Athena data catalog, the IAM role must have permission to access Athena unless CATALOG_ROLE is specified. We will go into more details on IAM in a later section of this chapter when we discuss how you can secure your data.

If you now select athena in the Schema dropdown menu in the Redshift Query Editor view as shown in Figure 3-18, you can see our two tables amazon_reviews_tsv and amazon_reviews_parquet appear which we created with Amazon Athena.

Figure 1-18. External schema athena shows our two tables previously created in Athena.

Let’s go ahead and run a sample query again to make sure everything works. In the Query Editor, run the following command:

          SELECT
            product_category,
          COUNT(star_rating) AS count_star_rating
          FROM
            athena.amazon_reviews_tsv
          GROUP BY
           product_category
          ORDER BY
           count_star_rating DESC

You should see a similar response to the one shown in figure 3-19 below.

Figure 1-19. Redshift query response reading from Amazon S3.

So with just one command, we now have access and can query our S3 data lake from Amazon Redshift without moving any data into our data warehouse. This is the power of Redshift Spectrum.

But now, let’s actually copy some data from S3 into Amazon Redshift. Let’s pull in customer reviews data from the year 2015.

First, create another Redshift schema called redshift with the following SQL command

CREATE SCHEMA IF NOT EXISTS redshift

Next, we will create a new table that represents our Customer Reviews data. We will also add a new column and add year to our table.

 
          CREATE TABLE IF NOT EXISTS redshift.amazon_reviews_tsv_2015( 
           marketplace varchar(2) ENCODE zstd,
            customer_id varchar(8) ENCODE zstd,
          review_id varchar(14) ENCODE zstd,
           product_id varchar(10) ENCODE zstd DISTKEY,
           product_parent varchar(10) ENCODE zstd,
           product_title varchar(400) ENCODE zstd,
           product_category varchar(24) ENCODE raw,
           star_rating int ENCODE az64,
           helpful_votes int ENCODE zstd,
           total_votes int ENCODE zstd,
           vine varchar(1) ENCODE zstd,
           verified_purchase varchar(1) ENCODE zstd,
           review_headline varchar(128) ENCODE zstd,
           review_body varchar(65535) ENCODE zstd,
           review_date varchar(10) ENCODE bytedict,
           year int ENCODE az64) SORTKEY (product_category)

In the performance section, we will dive deep into the SORTKEY, DISTKEY, and ENCODE attributes above. For now, let’s copy the data from S3 into our new Redshift table and run some sample queries.

For such bulk inserts, you can either use a COPY command, or an INSERT INTO command. In general, the COPY command is preferred, as it loads data in parallel and more efficiently from Amazon S3, or other supported data sources.

If you are loading data or a subset of data from one table into another, you can use the INSERT INTO command with a SELECT clause for high-performance data insertion. As we’re loading our data from the athena.amazon_reviews_tsv table, let’s choose this option.

To actually copy the data from S3 into our new Redshift table, we will run the following INSERT INTO command:

          INSERT 
          INTO
           redshift.amazon_reviews_tsv_2015
           SELECT
           marketplace,
           customer_id,
           review_id,
           product_id,
           product_parent,
           product_title,
           product_category,
           star_rating,
           helpful_votes,
           total_votes,
           vine,
           verified_purchase,
           review_headline,
           review_body,
           review_date,
           CAST(DATE_PART_YEAR(TO_DATE(review_date,
           'YYYY-MM-DD')) AS INTEGER) AS year 
           FROM
           athena.amazon_reviews_tsv 
           WHERE
           year = 2015

You might notice that we use a date conversion to parse the year out of our review_date column and store it in a separate year column which we then use to filter on all records from 2015. This is an example on how you can simplify ETL tasks, as we’re putting our data transformation logic directly in a SELECT query and ingest the result into Redshift.

Another way to optimize our tables would be to create them as a sequence of time-series tables, especially when our data has a fixed retention period. Let’s say we want to store data of the last 2 years (24 months) in our data warehouse, and update with new data once a month.

If you create one table per month, you can easily remove old data simply by running a DROP TABLE command on the corresponding table. This approach is much faster than running a large-scale DELETE process and also saves you from having to run a subsequent VACUUM process to reclaim space and re-sort the rows.

To combine query results across tables, we can use a UNION ALL view. Similarly, when we need to delete old data, we remove the dropped table from the UNION ALL view.

Here’s an example of a UNION ALL view across two tables with customer reviews from year 2014 and 2015 (assuming we have one table with data for 2014, and one with data for 2015):

          SELECT
           product_category,
           COUNT(star_rating) AS count_star_rating,
           year 
          FROM
           redshift.amazon_reviews_tsv_2014 
          GROUP BY
           redshift.amazon_reviews_tsv_2014.product_category,
           year 
          UNION
          ALL SELECT
           product_category,
           COUNT(star_rating) AS count_star_rating,
           year 
          FROM
           redshift.amazon_reviews_tsv_2015 
          GROUP BY
           redshift.amazon_reviews_tsv_2015.product_category,
           year 
          ORDER BY
           count_star_rating DESC,
           year ASC

The query result should look similar to this:

product_category count_star_rating year
Digital_Ebook_Purchase 6615914 2014
Digital_Ebook_Purchase 4533519 2015
Books 3472631 2014
Wireless 2998518 2015
Wireless 2830482 2014
Books 2808751 2015
Apparel 2369754 2015
Home 2172297 2015
Apparel 2122455 2014
Home 1999452 2014

Now, let’s actually run a query and combine data from Redshift with data which is still in S3. Let’s take the data from the previous query for years 2015 and 2014, and query Athena/S3 for the years 2013-1995 by running this command:

          SELECT
           year,
           product_category,
           COUNT(star_rating) AS count_star_rating 
          FROM
           redshift.amazon_reviews_tsv_2015 
          GROUP BY
           redshift.amazon_reviews_tsv_2015.product_category,
           year 
          UNION
          ALL SELECT
           year,
           product_category,
           COUNT(star_rating) AS count_star_rating 
          FROM
           redshift.amazon_reviews_tsv_2014 
          GROUP BY
           redshift.amazon_reviews_tsv_2014.product_category,
           year 
          UNION
          ALL SELECT
           CAST(DATE_PART_YEAR(TO_DATE(review_date,
           'YYYY-MM-DD')) AS INTEGER) AS year,
           product_category,
           COUNT(star_rating) AS count_star_rating 
          FROM
           athena.amazon_reviews_tsv 
          WHERE
           year <= 2013 
          GROUP BY
           athena.amazon_reviews_tsv.product_category,
           year 
          ORDER BY
           product_category ASC,
           year DESC 

The query result should look similar to this:

year product_category count_star_rating
2015 Apparel 4739508
2014 Apparel 4244910
2013 Apparel 854813
2012 Apparel 273694
2011 Apparel 109323
2010 Apparel 57332
2009 Apparel 42967
2008 Apparel 33761
2007 Apparel 25986
2006 Apparel 7293
2005 Apparel 3533
2004 Apparel 2357
2003 Apparel 2147
2002 Apparel 907
2001 Apparel 5
2000 Apparel 6
2015 Automotive 2609750
2014 Automotive 2350246

Export Redshift Data to S3 Data Lake as Parquet

Redshift Data Lake Export gives you the ability to unload the result of a Redshift query to your S3 data lake in the optimized Apache Parquet columnar file format. This enables you to share any data transformation and enrichment you have done in Redshift back into your S3 data lake in an open format.

You can specify one or more partition columns so that unloaded data is automatically partitioned into folders in your Amazon S3 bucket. For example, you can choose to unload our customer reviews data and partition it by product_category.

Simply run the following SQL command to unload our 2015 customer reviews data in Parquet file format into S3, partitioned by product_category:

         UNLOAD (
          'SELECT marketplace, customer_id, review_id, product_id, product_parent, 
          product_title, product_category, star_rating, helpful_votes, total_votes, 
          vine, verified_purchase, review_headline, review_body, review_date, year 
          FROM redshift.amazon_reviews_tsv_2015') 
         TO 's3://data-science-on-aws/amazon-reviews-pds/parquet-from-redshift/2015/' 
         IAM_ROLE '<IAM_ROLE>' 
         PARQUET PARALLEL ON 
         PARTITION BY (product_category)

You can use the AWS CLI tool again to list the S3 folder and see our unloaded data from 2015 in Parquet format:

aws s3 ls s3://data-science-on-aws/amazon-reviews-pds/parquet-from-redshift/2015/

This concludes the section on how to load data from our S3 data lake into our Redshift data warehouse, how to query data with Redshift, query data across both Redshift and S3, and how to export data from Redshift back to S3 in the optimized Parquer columnar file format.

You might ask yourself now, when should I use Athena, and when should I use Redshift? Let’s discuss.

Choosing Athena vs. Redshift

Amazon Athena should be your preferred choice when running ad-hoc SQL queries on data that is stored in Amazon S3. It doesn’t require you to set up or manage any infrastructure resources, and you don’t need to move any data. It supports structured, unstructured, and semi-structured data. With Athena, you are defining a “schema on read” -- you basically just log in, create a table and you are good to go.

Amazon Redshift is targeted for modern data analytics on large, peta-byte scale, sets of structured data. Here, you need to have a predefined “schema on write”. Unlike serverless Athena, Redshift requires you to create a cluster (compute and storage resources), ingest the data and build tables before you can start to query, but caters to performance and scale. So for any highly-relational data with a transactional nature (data gets updated), workloads which involve complex joins, and latency requirements to be sub-second, Redshift is the right choice.

Athena and Redshift are optimized for read-heavy analytics workloads - and therefore not replacements for write-heavy, relational databases such as Amazon Relational Database System (RDS) and Aurora. At a high level, use Athena for exploratory analytics and operational debugging; use Redshift for business-critical reports and dashboards.

Note

Some ultra-write-heavy workloads may require a “non-relational” or NoSQL database such as Amazon DynamoDB or Managed Apache Cassandra. These databases offer reduced referential integrity for increased performance. The referential integrity is usually managed at the application layer - perhaps as a “reaper” background process that periodically fixes or identifies dangling references. NoSQL databases also allow flexible schemas for an ever-changing business domain. Similar to referential integrity, schema validation usually occurs at the application level for NoSQL databases. These are the trade-offs you sometimes need to make when designing a data solution at scale.

Secure Your Dataset

In today’s world, keeping data secure and safe is a top priority. AWS is designed to help you build secure, high-performing, resilient, and efficient infrastructure for your applications. AWS implements the shared responsibility model through which they provide a global secure infrastructure and foundational compute, storage, networking and database services, as well as a range of security services which you can use to secure anything you build and run on top of these services.

Looking at S3, AWS provides a number of security controls and guidelines you can leverage when defining your own security policies:

  • Ensure that your Amazon S3 buckets use the correct policies and are not publicly accessible

  • Implement least privilege access

  • Use IAM roles for applications and AWS services that require Amazon S3 access

  • Enable multi-factor authentication (MFA) Delete

  • Consider encryption of data at rest

  • Enforce encryption of data in transit

  • Consider Amazon S3 Object Lock

  • Enable versioning

  • Consider Amazon S3 cross-region replication

  • Consider VPC endpoints and S3 Access Points for Amazon S3 access

Let’s briefly discuss how you can use AWS Identity Access Management (IAM) Roles, KMS encryption, and S3 Access Points in this context.

Authenticating and Authorizing with Identity and Access Management (IAM) Roles

AWS Identity and Access Management (IAM) is a service that helps you to manage access to AWS resources. IAM controls who is authenticated and authorized to use resources.

You can create individual IAM users for people accessing your AWS account. Each user will have unique security credentials. You can also assign IAM users to IAM groups with defined access permissions (i.e. for specific job functions) and the IAM users inherit those permissions.

A more preferred way to delegate access permissions is via IAM roles. In contrast to an IAM user which is uniquely associated with one person, a role can be assumed by anyone who needs it, and provides you with only temporary security credentials for the duration of the role session. AWS Service Roles control which actions a service can perform on your behalf.

Access permissions are defined using IAM policies. It’s a standard security best practice to only grant least privilege, in other words -- only grant the permissions required to perform a task.

By default, all Amazon S3 resources are private: only the resource owner, an AWS account that created it, can access the resource. The resource owner can optionally grant access permissions to others by writing an access policy.

Here’s a sample policy that shows how you can grant a user additional access (add, update, and delete objects) to our sample S3 bucket called data-science-on-aws by attaching this policy to the corresponding IAM role:

          {
           "Version":"2012-10-17",
           "Statement":[
           {
           "Effect":"Allow",
           "Action":[
           "s3:ListAllMyBuckets"
           ],
           "Resource":"arn:aws:s3:::*"
           },
           {
           "Effect":"Allow",
           "Action":[
           "s3:ListBucket",
           "s3:GetBucketLocation"
           ],
           "Resource":"arn:aws:s3:::data-science-on-aws"
           },
           {
           "Effect":"Allow",
           "Action":[
           "s3:PutObject",
           "s3:PutObjectAcl",
           "s3:GetObject",
           "s3:GetObjectAcl",
           "s3:DeleteObject"
           ],
           "Resource":"arn:aws:s3:::data-science-on-aws/*"
           }
           ]
          }
        

In addition to users, you also need to manage the AWS Service Role privileges. Please see Appendix C with more details on which privileges Amazon Athena and Amazon Redshift require, and how you can customize them.

Storing Keys with Key Management Service

AWS Key Management Service (KMS) is a managed service that enables you to easily create and control the keys used for cryptographic operations. There are two ways to use AWS KMS with Amazon S3 to implement data-at-rest encryption. You can use server-side encryption to protect your data with a master key or you can use an AWS KMS customer master key (CMK) with the Amazon S3 Encryption Client to protect your data on the client side.

If you select Server-Side Encryption, you can choose between the following options:

SSE-S3

Requires that Amazon S3 manage the data and master encryption keys

SSE-C

Requires that you manage the encryption key.

SSE-KMS

Requires that AWS manage the data key but you manage the customer master key (CMK) in AWS KMS.

To require server-side encryption of all objects in a particular Amazon S3 bucket (enforcing data-at-rest encryption), use a bucket policy. For example, the following bucket policy denies upload object (s3:PutObject) permission to everyone if the request does not include the x-amz-server-side-encryption header requesting server-side encryption with SSE-KMS:

          {
           "Version":"2012-10-17",
           "Id":"PutObjPolicy"
           "Statement":[{
           "Sid":"DenyUnEncryptedObjectUploads",
           "Effect":"Deny",
           "Principal":"*",
           "Action":"s3:PutObject",
           "Resource":"arn:aws:s3:::data-science-on-aws/*",
           "Condition":{
           "StringNotEquals":{
           "s3:x-amz-server-side-encryption":"aws:kms"
           }
           }
           }
           ]
          }
        

When you upload an object, you can specify the AWS KMS CMK using the x-amz-server-side-encryption-aws-kms-key-id header. If the header is not present in the request, Amazon S3 assumes the AWS managed CMK.

Figure 3-20 and 3-21 show how to enable data-at-rest encryption with default encryption at bucket creation time via the AWS Console, and the two default options of server-side encryption SSE-S3 and SSE-KMS.

Figure 1-20. AWS Console > S3 > Create Bucket > Default Encryption > Use SSE-S3.
Figure 1-21. AWS Console > S3 > Create Bucket > Default Encryption > Use SSE-KMS

Securing Buckets with S3 Access Points

Amazon S3 Access Points simplify access control for large, shared buckets such as data lakes. Up until recently, you accessed your S3 buckets through a unique bucket host name, and defined access control with a combination of IAM policies and a single bucket policy. You can imagine that for shared data sets and a growing number of users, teams, and applications which needed access this could end up fast in a complex environment for you to maintain.

Amazon S3 Access Points now simplify managing data access by providing you with a customized path into a bucket, each with a unique hostname and IAM access policy that enforces the specific permissions and network controls for any request made through the access point. This is particularly useful if you need to manage access to shared data sets.

You can also require that all access points be restricted to a Virtual Private Cloud (VPC), providing an extra level of security by basically firewalling your data to within your private networks.

Let’s assume we have our sample S3 bucket called data-science-on-aws with prefixes (subfolders) called feature-store and data-warehouse. Our data science team needs read/write access to the feature store data, and our Business Intelligence team needs read access to the data-warehouse data stored in that bucket.

Figure 3-22 shows how that scenario would look without the use of S3 Access Points.

Figure 1-22. Accessing objects in Amazon S3 previously.

A single S3 bucket policy would have maybe looked like this:

          "Sid":”PrefixBasedAccessDataScience",
          "Effect":"Allow",
          "Principal":{"AWS":”arn:aws:iam::123456789012:group/ds},
          "Action":["s3:GetObject","s3:PutObject"],
          "Resource":"arn:aws:s3:::data-science-on-aws/feature-store/*"
          ...
          "Sid":”TagBasedAccessBusinessIntelligence",
          "Effect":"Allow",
          "Principal":{"AWS":”arn:aws:iam::123456789012:group/bi},
          "Action":["s3:GetObject"],
          "Resource":"arn:aws:s3:::data-science-on-aws/data-warehouse/*”
          ...
        

Now let’s see how we can simplify this with the use of S3 Access Points. The following sample command shows how to create Access Points called ap1-ds and ap2-bi via the AWS CLI command on our sample bucket called data-science-on-aws:

aws s3control create-access-point --name ap1-ds --account-id 123456789012 --bucket data-science-on-aws

aws s3control create-access-point --name ap2-bi --account-id 123456789012 --bucket data-science-on-aws

In an access point policy we then grant the IAM group for our Data Scientist team (‘ds’) in account 123456789012 permissions to GET and PUT objects with the prefix feature-store/ through access point ap1-ds, and the IAM group for our Business Intelligence team (‘bi’) permissions to GET objects with the prefix data-warehouse/ through access point ap2-bi:

          {
           "Version":"2012-10-17",
           "Statement": [
           {
           "Effect": "Allow",
           "Principal": {
           "AWS": "arn:aws:iam::123456789012:group/ds"
           },
           "Action": ["s3:GetObject", "s3:PutObject"],
           "Resource": "arn:aws:s3:us-east-1:123456789012:accesspoint/ap1-ds/object/feature-store/*"
           }]
          }
          {
           "Version":"2012-10-17",
           "Statement": [
           {
           "Effect": "Allow",
           "Principal": {
           "AWS": "arn:aws:iam::123456789012:group/bi"
           },
           "Action": ["s3:GetObject"],
           "Resource": "arn:aws:s3:us-east-1:123456789012:accesspoint/ap2-bi/object/data-warehouse/*"
           }]
          }
        

Figure 3-23 shows you can now manage access to your S3 objects with S3 Access Points.

Figure 1-23. Accessing objects in Amazon S3 with Access Points.

An AWS CLI request to an object in that bucket through the S3 Access Point would then look like this (assuming we are in the us-east-1 region and have the access permissions):

aws s3api get-object --key sample_us.tsv --bucket arn:aws:s3:us-east-1:123456789012:accesspoint/ap1-ds feature-store/raw/sample_us.tsv

You can also access the objects in an Amazon S3 bucket with an access point using the AWS Management Console, AWS SDKs, or the S3 REST APIs. For an application or user to be able to access objects through an access point, both the access point and the underlying bucket must permit the request.

Increase Performance and Reduce Cost

In this section, we want to provide some tips & tricks to increase performance and reduce cost during data ingestion including file formats, partitions, compression, and sort/distribution keys.

We will also demonstrate how to use Amazon S3 Intelligent Tiering to lower your storage bill.

Parquet Partitions and Compression

Athena supports the Parquet columnar format for large-scale analytics workloads. Parquet enables the following performance optimizations for your queries:

Partitions and Pushdowns

Partitions are physical groupings of data on disk to match your query patterns (ie. SELECT * FROM reviews WHERE product_category=’Books’). Modern query engines like Athena, Redshift, and Apache Spark will “pushdown” the WHERE into the physical storage system to allow the disk controller to seek once and read all relevant data in 1 scan without randomly skipping to different areas of the disk. This improves query performance even with Solid State Device (SSD) disks which have a lower seek time than traditional, media-based disk.

Dictionary encoding/compression

When a small number of categorical values are stored together on disk (ie. product_category which has 43 total values in our dataset), the values can be compressed into a small number of bits to represent each value (ie. Books, Lawn_and_Garden, Software, etc.) vs. storing the entire string.

Type compression

When values of a similar type (ie. String, Date, Integer) are stored together on disk, the values can be compressed together: (String, String), (Date, Date), (Integer, Integer). This compression is more efficient than if the values were stored separately on disk in a row-wise manner: (String, Date, Integer), (String, Date, Integer)

Vectorized aggregations

Because column values are stored together on disk, the disk controller needs to only perform one disk seek to find the beginning of the data. From that point, it will scan the data to perform the aggregation. Additionally, modern chips / processors offer high performance vectorization instructions to perform calculations on large amounts of data vs. flushing data in and out of the various data caches (L1, L2) or main memory.

See an example of row vs. columnar storage in figure 3-24.

Figure 1-24. Row Storage (CSV/TSV File) vs. Columnar Storage (Parquet).

Redshift Table Design and Compression

Here is the CREATE TABLE script we used to create the Redshift tables:

          CREATE TABLE IF NOT EXISTS redshift.amazon_reviews_tsv_2015( 
           marketplace varchar(2) ENCODE zstd,
           customer_id varchar(8) ENCODE zstd,
           review_id varchar(14) ENCODE zstd,
           product_id varchar(10) ENCODE zstd DISTKEY,
           product_parent varchar(9) ENCODE zstd,
           product_title varchar(400) ENCODE zstd,
           product_category varchar(24) ENCODE raw,
           star_rating int ENCODE az64,
           helpful_votes int ENCODE zstd,
           total_votes int ENCODE zstd,
           vine varchar(1) ENCODE zstd,
           verified_purchase varchar(1) ENCODE zstd,
           review_headline varchar(128) ENCODE zstd,
           review_body varchar(65535) ENCODE zstd,
           review_date varchar(10) ENCODE bytedict,
           year int ENCODE az64) SORTKEY (product_category)
         

When you create a table, you can specify one or more columns as the SORT KEY. Amazon Redshift stores your data on disk in sorted order according to the SORT KEY. This means, you can optimize your table by choosing a SORT KEY that reflects your most frequently used query types. If you query a lot of recent data, you can specify a timestamp column as the sort key. If you frequently query based on range or equality filtering on one column, you should choose that column as the sort key. As we are going to run a lot of queries in the next chapter filtering on product_category, let’s choose that one as our sort key.

Note

Redshift Advisor continuously recommends SORT KEYs for frequently-queries tables. Advisor will generate an ALTER TABLE command which you run without having to recreate the tables - and without impacting concurrent read and write queries. Note that Advisor does not provide recommendations if it doesn’t see enough data (queries) or if the benefits are relatively small.

You can also define a distribution style for every table. When you load data into a table, Redshift distributes the rows of the table among your cluster nodes according to the table’s distribution style. When you run a query, the query optimizer redistributes the rows to the cluster nodes as needed to perform any joins and aggregations. So our goal should be to optimize the rows distribution to minimize data movements. There are three distribution styles from which you can choose from:

KEY distribution

Distribute the rows according to the values in one column

ALL distribution

Distribute a copy of the entire table to every node

EVEN distribution

The rows are distributed across all nodes in a round-robin-fashion which is the default distribution style

For our table, we’ve chosen KEY distribution based on product_id as this column has a high cardinality, shows an even distribution and can be used to join with other tables.

Note

At any time, you can use EXPLAIN on your queries to make sure the DIST KEY and SORT KEY are being utilized. If your query patterns change over time, you may want to revisit changing these keys.

In addition, we are using compression for most columns to reduce the overall storage footprint and reduce our cost. Table 3-1 analyzes the compression used for each Redshift column in our schema.

Table 1-1. Compression used for each Redshift column in our schema.
Column Data Type Encoding Explanation
marketplace varchar(2) zstd Low cardinality, too small for higher compression overhead
customer_id varchar(8) zstd High cardinality, relatively few repeat values
review_id varchar(14) zstd Unique, unbounded cardinality, no repeat values
product_id varchar(10) zstd Unbounded cardinality, relatively low number of repeat values
product_parent varchar(10) zstd Unbounded cardinality, relatively low number of repeat words
product_title varchar(400) zstd Unbounded cardinality, relatively low number of repeat words
product_category varchar(24) raw Low cardinality, many repeat values, but first SORT key is raw
star_rating int az64 Low cardinality, many repeat values
helpful_votes int zstd Relatively high cardinality
total_votes int zstd Relatively high cardinality
vine varchar(1) zstd Low cardinality, too small to incur higher compression overhead
verified_purchase varchar(1) zstd Low cardinality, too small to incur higher compression overhead
review_headline varchar(128) zstd Varying length text, high cardinality, low repeat words
review_body varchar(65535) zstd Varying length text, High cardinality, low repeat words
review_date varchar(10) bytedict Fixed length, relatively low cardinality, many repeat values
year int az64 Low cardinality, many repeat values
Note

While our CEO Andy Jassy maintains, “There is no compression algorithm for experience,”, there is a compression algorithm for data. Compression is a powerful tool for the ever-growing world of big data. All modern big data processing tools are compression-friendly including Amazon Athena, Redshift, Parquet, Pandas, and Apache Spark. Using compression on small values such as varchar(1) may not improve performance. However, due to native hardware support, there are almost no drawbacks to using compression.

zstd is a generic compression algorithm that works across many different data types and column sizes. star_rating and year fields are set to the default az64 encoding applied to most numeric and date fields. For most columns, we gain a quick win by using the default az64 encoding for integers and overriding the default lzo encoding in favor of the flexible zstd encoding for everything else including text.

We are using bytedict for review_date to perform dictionary encoding on the string-based dates (YYYY-MM-DD). While it seemingly has a large number of unique values, review_date actually contains a small number of unique values as there are only ~7,300 (365 days per year * 20 years) days in a 20 year span. This cardinality is low enough to capture all possible dates in just a few bits versus using a full varchar(10) for each date.

While product_category is a great candidate for bytedict dictionary encoding, product_category is our first (and only, in this case) SORTKEY. As a performance best practice, the first SORTKEY should not be compressed.

While marketplace, product_category, vine, and verified_purchase seem to be good candidates for bytedict, they are too small to benefit from the extra overhead. For now, we leave them as zstd.

If you have an existing Redshift table that you would like to optimize, you can run the ANALYZE COMPRESSION command in Redshift to generate a report of suggestion compression encodings as follows:

ANALYZE COMPRESSION redshift.customer_reviews_tsv_2015

The result will be a table like the following showing the % improvement in compression if you switch to another encoding:

Column Encoding Estimated Reduction (%)
marketplace zstd 90.84
customer_id zstd 38.88
review_id zstd 36.56
product_id zstd 44.15
product_parent zstd 44.03
product_title zstd 30.72
product_category zstd 99.95
star_rating az64 0
helpful_votes zstd 47.58
total_votes zstd 39.75
vine zstd 85.03
verified_purchase zstd 73.09
review_headline zstd 30.55
review_body zstd 32.19
review_date bytedict 64.1
year az64 0
Note

We performed the following analysis on a version of the CREATE TABLE that did not specify any ENCODE attributes. By default, Redshift will use az64 for numerics/dates and lzo for everything else (hence the 0% gain for the az64 suggestions below.)

Keep in mind that these are just suggestions and not always appropriate for your specific environment. You should try different encodings for your dataset and query the STV_BLOCKLIST table to compare the % reduction in physical number of blocks. For example, the analyzer recommends using zstd for our SORT KEY, product_category, but our experience shows that query performance suffers when we compress the SORT KEY. We are using the extra disk space to improve our query performance.

S3 Intelligent Tiering

We introduced Amazon Simple Storage Service (Amazon S3) in this chapter as a scalable, durable storage service for building shared datasets such as data lakes in the cloud. And while we keep the S3 usage fairly simple in this book, the service actually offers you a variety of options to optimize your storage cost as your data grows.

Depending on your data’s access frequency patterns and SLA needs, you can choose from various Amazon S3 storage classes. At the time of writing, these include S3 Standard for general-purpose storage of frequently accessed data (which is the default storage class); S3 Intelligent-Tiering for data with unknown or changing access patterns; S3 Standard-Infrequent Access (S3 Standard-IA) and S3 One Zone-Infrequent Access (S3 One Zone-IA) for long-lived, but less frequently accessed data; and Amazon S3 Glacier (S3 Glacier) and Amazon S3 Glacier Deep Archive (S3 Glacier Deep Archive) for long-term digital archival.

Figure 3-25 compares the Amazon S3 storage classes.

Figure 1-25. Amazon S3 storage classes.
Note

Amazon S3 Pricing might have changed since the time of writing. Please refer to https://aws.amazon.com/s3/pricing/ for up-to-date information.

But how do you know which objects to move? Imagine your S3 data lake has grown over time, and you might have billions of objects across several S3 buckets in S3 Standard storage class. Some of those objects are extremely important, while you haven’t accessed others maybe in months or even years. This is where S3 Intelligent-Tiering comes into play.

Amazon S3 Intelligent-Tiering, automatically optimizes your storage cost for data with changing access patterns by moving objects between the frequent-access tier optimized for frequent use of data, and the lower-cost infrequent-access tier optimized for less-accessed data. Intelligent-Tiering monitors your access patterns and auto-tiers on a granular object level without performance impact, or any operational overhead.

Summary

In this chapter, we provided you with an overview on how you can load your data into Amazon S3, discussed the value of a S3 data lake, and showed you how you can leverage services like Amazon Athena to run ad-hoc SQL queries across the data in S3 without the need to physically move the data. We also introduced our dataset, the Amazon Customer Reviews Dataset, which we will be using through the rest of this book.

As different use cases require data in different formats, we elaborated on how you can use Athena to convert tab-separated data into query-optimized, columnar Parquet data.

As data in your S3 data lake often needs to be accessed not only by the Data Science / Machine Learning teams, but also from your Business Intelligence teams, we introduced you to Amazon Redshift, AWS’ peta-byte scale cloud data warehouse. We transformed and enriched the data in Redshift, and showed how you can use Redshift Spectrum to combine queries across the data you keep in Redshift as well as data you keep in your S3 data lake.

We highlighted some important concepts you should leverage to secure your data. As part of this discussion, we touched on AWS Identity and Access Management Roles, implementing data-at-rest encryption with AWS Key Management Service (KMS), and how the use of S3 Access Points can help manage access to data in shared S3 buckets (aka. your S3 data lake).

To conclude this chapter, we’ve discussed the various data compression formats and how they can improve your query results, and how you can reduce your AWS costs with Amazon S3 Intelligent Tiering.

In the next chapter we will explore the dataset in more detail. We will run queries to understand our data better, and visualize the results. We will also show you how you can detect data anomalies with Apache Spark and Amazon SageMaker Processing Jobs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.23.30