Preface

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL, without needing to manage any infrastructure.

This book begins with an overview of the serverless analytics experience offered by Athena and teaches you how to build and tune an S3 data lake using Athena, including how to structure your tables using open source file formats such as Parquet. You'll learn how to build, secure, and connect to a data lake with Athena and Lake Formation. Next, you'll cover key tasks such as ad hoc data analysis, working with ETL pipelines, monitoring and alerting KPI breaches using CloudWatch Metrics, running customizable connectors with AWS Lambda, and more. Moving ahead, you'll work through easy integrations, troubleshooting and tuning common Athena issues, and the most common reasons for query failure, as well as reviewing tips for diagnosing and correcting failing queries in your pursuit of operational excellence. Finally, you'll explore advanced concepts such as Athena Query Federation and Athena ML to generate powerful insights without needing to touch a single server.

By the end of this book, you'll be able to build and use a data lake with Amazon Athena to add data-driven features to your app and perform the kind of ad hoc data analysis that often precedes many of today's ML modeling exercises.

Who this book is for

BI analysts, application developers, and system administrators who are looking to generate insights from an ever-growing sea of data while controlling costs and limiting operational burdens will find this book helpful. Basic SQL knowledge is expected to make the most out of this book.

What this book covers

Chapter 1, Your First Query, is all about orienting you to the serverless analytics experience offered by Amazon Athena. For now, we will simplify things in order to run your first queries and demonstrate why so many people choose Amazon Athena for their workloads. This will help establish your mental model for the deeper discussions, features, and examples of later sections.

Chapter 2, Introduction to Amazon Athena, continues your introduction to Athena by discussing the service's capabilities, scalability, and pricing. You'll learn when to use Amazon Athena and how to estimate the performance and costs of your workloads before building them on Athena. We'll also take a look behind the scenes to see how Athena uses PrestoDB, an open source SQL engine from Facebook, to process your queries.

Chapter 3, Key Features, Query Types, and Functions, concludes our introduction to Amazon Athena by exploring built-in features you can use to make your reports or application more powerful. This includes approximate query techniques to speed up analysis of large datasets and Create Table As Select (CTAS) statements for running queries that generate significant amounts of result data.

Chapter 4, Metastores, Data Sources, and Data Lakes, teaches you what a metastore is and what they contain. We will introduce Apache Hive and AWS Glue Data Catalog implementations of a metastore. We'll then learn how to create tables through Athena or discover datasets in S3 using AWS Glue crawlers. We then focus on a typical data lake architecture, which contains three different stages for data.

Chapter 5, Securing Your Data, covers the various methods that can be employed to secure your data and ensure it can only be viewed by those that have permission to do so.

Chapter 6, AWS Glue and AWS Lake Formation, demonstrates step by step how to build a secure data lake in Lake Formation and how Athena interacts with Lake Formation to keep data safe.

Chapter 7, Ad Hoc Analytics, focuses on how you can use Athena to quickly get to know your data, look for patterns, find outliers, and generally surface insights that will help you get the most from your data.

Chapter 8, Querying Unstructured and Semi-Structured Data, shows how Amazon Athena combines a traditional query engine, and its requirement for an upfront schema, with extensions that allow it to handle data that contains varying or no schema.

Chapter 9, Serverless ETL Pipelines, continue with the theme of controlling chaos by using automation to normalize newly arrived data through a process known as extract, transform, load (ETL).

Chapter 10, Building Applications with Amazon Athena, tells you what to do when integrating Amazon Athena into your applications. How will the application make Athena calls? How should credentials be stored? Should you use JDBC, ODBC, or Athena's SDK? What are the best practices on setting up connectivity between your application and Athena and the security considerations? Lastly, what is the best way for me to store my data on S3 to optimize speed and cost? This chapter will answer all these questions and give examples – including working code – to get you started integrating with Athena fast, easily, and in a secure way.

Chapter 11, Operational Excellence – Maintenance, Optimization, and Troubleshooting, focuses on operational excellence by looking at what could go wrong when using Athena in a production environment. We'll learn how to monitor and alert KPI breaches – such as queue dwell times – using CloudWatch metrics so you can avoid surprises. You'll also see how to optimize your data and queries to avoid problems before they happen. We'll then look at how the layout of data stored in S3 can have a significant impact on both cost and performance. Lastly, we will look at the most common reasons for query failure and review tips to help diagnose and correct failing queries.

Chapter 12, Athena Query Federation, is all about getting the most out of Amazon Athena by using Athena's Query Federation capabilities to expand beyond queries over data in S3. We will illustrate how Query Federation allows you to combine data from multiple sources (for example, S3 and Elasticsearch) to provide a single source of truth for your queries. Then we will peel back the hood and explain how Amazon Athena uses AWS Lambda to run customizable connectors. We will even write our own connector in order to show you how easy it is to customize Athena with your own code.

Chapter 13, Athena UDFs and ML, continues the theme of enhancing Amazon Athena with our own functionality by adding our own user-defined functions and machine learning models. These capabilities allow us to do everything from applying ML inference to identify suspicious records in our dataset to converting port numbers in a VPC flow log to the common name for that port (for example, HTTP). In all of these examples, we add our own logic to Athena's row-level processing without the need to run any servers of our own.

Chapter 14, Lake Formation – Advanced Topics, covers some of the advanced features that Lake Formation brings to the table, and explores various use cases that are enabled by these features.

To get the most out of this book

To work on the technologies in this book, you will need a computer with a Chrome, Safari, or Microsoft Edge browser installed and AWS CLI version 2 installed.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Please ensure that you close any outstanding AWS instances after you are done working on them so that you don't incur unnecessary expenses.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Serverless-Analytics-with-Amazon-Athena. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781800562349_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We simply specify a SYSTEM_TIME that Athena will use to set the read point in the transaction log."

A block of code is set as follows:

try:

sink.writeFrame(new_and_updated_impressions_dataframe)

glueContext.commit_transaction(txid1)

except:

glueContext.abort_transaction(txid1)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

"inventory_id","item_name","available_count"

"1","A simple widget","5"

"2","A more advanced widget","10"

"3","The most advanced widget","1"

"4","A premium widget","0"

"5","A gold plated widget","9"

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Administrators can set a workgroup to encrypt query results. In the workgroup settings, set query results to be encrypted using SSE-KMS, CSE-KMS, or SSE-S3 and check the Override client-side settings."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Serverless Analytics with Amazon Athena, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.147.252