Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14: Lake Formation – Advanced Topics

You've reached the final chapter in our journey through Serverless Analytics with Amazon Athena. Some authors like to start each chapter with a thought-provoking quote. The pressure to find good, relevant quotes from well-known people was too much for us, so we opted not to employ that pattern until now. I recently came across a quote from Stephen King that does a great job distilling this chapter:

"Sooner or later, everything old is new."

– Stephen King

You see, many of Lake Formation's "new" features are a reimagining of well-known database technologies from the 70s and 80s but scaled up for modern data lakes. As a Lake Formation launch partner, Athena is often one of the first services to support new Lake Formation features. In this chapter, we will learn about Lake Formation's newest features, including row-level security and a new Amazon S3 table type that supports ACID transactions. AWS Lake Formation transactions provide for atomic, consistent, isolated, and durable queries via snapshot isolation, regardless of how many tables your query uses or how many concurrent queries you run. To complement this new table type, Lake Formation also introduced an automatic storage optimizer that continually monitors your tables and reorganizes the underlying storage for optimal performance.

Each of these features has been available in most traditional databases systems for decades. However, these capabilities frequently reduced performance or scalability. The early days of data lakes and their accompanying query engines, such as Athena, shed many of these auxiliary features in the name of scaling. As these systems and their usage evolved beyond solving scaling problems in traditional databases, the need for advanced features such as ACID transactions and row-level security have reemerged.

As many of these features are not generally available yet and should "just work" for your existing queries by toggling a setting, this chapter will focus less on exercises and more on what use cases these capabilities enable. Depending on your AWS Region of choice, these features may not be available to you yet or may still be in preview. Lastly, you may be wondering why we have repeatedly discussed Lake Formation in a book about Athena. Regardless of the analytics engine you choose, AWS looks to Lake Formation as the tide that raises all ships. Put another way, Lake Formation is increasingly where new and foundational data lake features are being built so that customers can seamlessly transition between any of the AWS analytics offerings that support Lake Formation.

In this chapter, we will cover the following topics:

Reinforcing your data perimeter with Lake Formation
Understanding the benefits of governed tables

Reinforcing your data perimeter with Lake Formation

We were first introduced to AWS Lake Formation in Chapter 3, Key Features, Query Types, and Functions, where we explored Lake Formation's ability to go beyond S3 object-level IAM policies to offer fine-grained access control for tables. While security is a focal point for the Lake Formation product, you may not realize that its ambitions extend far beyond this one essential element of data lakes. As we will see later in this chapter, Lake Formation's mandate is to make every aspect of building and managing data lakes simpler, faster, and cheaper. This has led the Lake Formation team to focus on the most frustrating parts of operating a data lake, such as access control.

Before discussing the most significant changes to Lake Formation since it went GA in 2019, let's make sure we genuinely understand how things worked before these new features. The following diagram illustrates the high-level interactions between Athena, Lake Formation, Glue Data Catalog, and S3 during the execution of a simple query:

Figure 14.1 – Lake Formation

As with all Athena queries, the process begins with Athena's engine parsing the query and forming a logical plan. This logical plan contains a list of tables and columns that need to be read and a sequence of operators to apply to the resulting data. During the planning process, Athena calls Lake Formation to obtain policy metadata for each referenced table. This metadata, along with column projections from the query, is used to affect access control. Assuming the access check passes, Athena moves on to forming a physical query plan, where it gathers partitioning information for each table from Glue Data Catalog. Before starting the actual query execution, Athena needs to call Lake Formation to obtain scoped-down temporary credentials for reading the required S3 objects. The Lake Formation API calls to get temporary credentials are the second place where an access enforcement check occurs. At this point, Athena has everything it needs to execute the query.

Much of the control flow shown in the preceding diagram is unsurprising, but a couple of nuances may have snuck by if you weren't looking closely. Firstly, Lake Formation is involved in both temporary credential vending and metadata operations, such as getting the list of columns in a table. The initial iteration of Lake Formation's fine-grained access control mechanisms enabled fully managed engines such as Athena and Redshift Spectrum to improve permission management. While this was a marked improvement over the previously available solutions, many customers still found themselves contorting their data models to create effective data perimeters.

Establishing a data perimeter

You've undoubtedly heard many vendors talk about democratizing access to data across your organization. We explored this topic by looking at some hands-on exercises as part of Chapter 7, Ad Hoc Analytics, but we avoid a pervasive issue by increasing access to your data. As we improve the accessibility of data, so too must we elevate our understanding of data perimeters. The word perimeter has historically referred to the outer edges of a company's physical assets, such as office buildings. When the internet and e-commerce revolutionized how business was conducted, companies erected virtual perimeters using firewalls. These concepts work well if your assets can be easily compartmentalized from those who should and shouldn't have access to them. In practice, the threats to your data are not always clear and certainly not always external to your company. There are different classes of data and times where you will need to control access at a department level and even down to individual employees. For example, has the data left your perimeter if an HR employee runs a payroll report and leaves intermediate data on storage, which is later accessed by someone outside HR? What if that same HR employee is working from home and downloads that payroll report to their laptop? At this point, you don't even have the protection of your company's physical security.

In these cases, data lake security is more important than ever. Lake Formation can help companies balance security and compliance needs with their growing desire to share data across departments, groups, and individuals. In many cases, safely sharing data across individuals with different job functions requires making redacted copies of the data. Aside from the additional storage and compute costs to ETL these copies, the organization had to manage consistency and correctness across a web of dependencies. We routinely help customers who have dozens of important datasets but somehow find themselves with thousands of derivative datasets, simply for accommodating different levels of access. Until recently, this was the state-of-the-art approach to creating a robust data perimeter because you get fine-grained control over which use cases and entities need access to specific slices of your data. Paradoxically, this approach created so many subtle variations of the original data that customers feared making mistakes that could lead to unintended information disclosure.

It's probably already pretty clear that security is far from easy to define, let alone build. It can be even more challenging when basic computational building blocks we all depend on seemingly stop playing by the rules. We'll dig deeper into this topic as part of understanding how customers often overlook their part in shared responsibility models.

Shared responsibility security model

Simply put, a shared responsibility security model refers to the basic idea that the customer and the service must work together to ensure any given workload is secure. We're using the word secure a bit tongue in cheek here because most security-conscious individuals will recoil at the thought of distilling all the complex nuances of security into one word. Security is rarely binary, meaning it's uncommon for any application to be described as secure or not secure. It's more common to think of these things as a gradient or, even better, in terms of specific threats and mitigations.

For example, one use case may require that data be encrypted when stored at rest. The reasons vary, but a typical example is that the underlying storage does not encrypt data replication traffic that's generated when the storage nodes failover. Another application may run workloads from multiple internal teams on shared infrastructure to improve costs. Since these workloads are all internal, the business valued utilization above protecting internal workloads from one another. If that same application started running workloads from external entities on that same, shared infrastructure, the definition of secure might change.

We've already called out that fully managed engines such as Athena and Redshift Spectrum avoid the disclaimer of a shared security model. Still, the reason has less to do with being fully managed and more to do with the level of control or abstraction these services offer. Both Athena and Redshift Spectrum essentially operate over SQL, whereas EMR and Glue ETL offer far more customer control. An EMR or Glue ETL customer can choose to run arbitrary code in their jobs. If you've ever used spark-submit or a Jupyter notebook with EMR, then you've executed arbitrary code on your EMR cluster. So, why the big fuss over arbitrary code? Well, the ability to run arbitrary code provides fairly low-level access to the machines that run your workloads.

Suppose you are running your analytics applications in a shared Spark cluster. During a Spark job, the state of any given node is represented as shown in the following diagram, with your arbitrary Spark code running side by side with the arbitrary code from some other workload:

Figure 14.2 – Process-level isolation of Spark workloads

Running each workload in separate processes that run as a different user improves the security posture by limiting how neighboring workloads can interact. If your organization is mainly concerned with avoiding accidental data leakage from bugs or typos, this level of isolation may be sufficient. But how do you know whether you've set it up correctly? If you depend on process-level isolation, it becomes increasingly important to ensure your customers cannot tamper with the operating system or Spark itself. Ensuring only administrators have root access to the host is a good start, but it isn't always easy to know whether that is enough.

Now, let's suppose that you'd like to go a step further and ensure customers can only access data they are authorized for. You might choose a tool such as Apache Ranger for access control. With Apache Ranger, policy enforcement takes place within Spark, alongside your workload. This pluggability makes it easy to get started with Apache Ranger for Spark, but what level of protection does it provide? For example, what prevents someone from running a Spark job that hijacks the Java classpath and injecting their copy of the RangerHiverAuthorizerFactory class? The RangerHiverAuthorizerFactory class plays a central role in data access policy enforcement. If an attacker can replace this class with one they control, the workload can bypass access control policies. Because their workload includes arbitrary code and has access to the Java class loader, such attacks become possible.

An analogy may be helpful here. This mitigation is akin to the lock on the front door of a house. It will keep most people from entering your home without permission, but it won't stop a determined adversary. There is a steep difference between keeping honest people honest and mitigating attacks from sophisticated attackers such as nation states. If you aren't using a managed service, your organization must play a more significant role in deciding where to draw the line.

This is one of the big distinctions between a service such as Athena and Glue ETL, which offers fully managed runtimes and lets you run highly customizable environments using your own Spark cluster or EMR. The attack surface is much different, so the customer shares responsibility in the security model.

It may be hard to believe, but we've only discussed the obvious examples that feed into the shared security model so far. Next, we will discuss the more insidious examples that have contributed, in part, to Lake Formation's release of governed tables. In recent years, the computing world as a whole has learned that processor design is not immune from security flaws. While we've seen exploits in software for decades, many of us had been spoiled by the reliability of hardware security controls. When Spectre and Meltdown were announced to the world on January 3, 2018, our ability to depend on previously trusted operating system process-level isolations was shaken. Researchers had managed to use timing variations in memory cache reads to extract information from mispredicted code branches. There is a lot to unpack in that one statement, and while this is not a book on security or processor design, this is a topic worth understanding a bit more deeply. Recognizing the fundamental issues at play here will also help you understand the motivation for several of the new Lake Formation features we'll be discussing shortly.

The following diagram shows two possible ways an x86 process could order the instructions your query engine may need to perform while enforcing column-level access policies. As you read this, please keep in mind that we have greatly simplified what your processor and an attacker would do during an exploit. We recommend reading online to learn more about side-channel exploits such as Spectre and Meltdown:

Figure 14.3 – Speculative execution example

On the left-hand side of the preceding diagram, we can see the instruction ordering that's been requested by our query engine. Naturally, it begins with checking whether the caller has access to read the column. Assuming that conditional passes, the query engine then attempts to read the data and compute a result. The right-hand side of the diagram shows how your CPU likely executed these instructions. Notice that the order changed! Some of these pseudo instructions take more time, often measured in clock cycles compared to others. Modern x86 CPUs can work on multiple instructions each clock cycle. While one instruction is fetching its operands from the cache, another instruction might be using a floating-point unit to calculate the result of a division operation. Coordinating which instructions are utilizing each part of the CPU is often referred to as pipelining.

The deeper the pipeline, the more efficient a CPU can be, and the faster customers will perceive the CPU. The trick is keeping all the parts of the CPU busy by guessing what instructions might be run in the future when earlier instructions take too long and stall execution. Your CPU is making a bet. It can remain idle while waiting for the earlier instruction to finish, or it can guess at what it will be asked to do next. Naively waiting has a 100% probability of wasting CPU cycles. Guessing is highly likely to perform better than waiting. Chasing this opportunity is what has driven modern x86 CPUs to reorder instructions and, at times, speculatively execute instructions.

In our access check example shown in the preceding diagram, the memory read and compute result steps have to wait for the access check branch to decide which path to take. While that branch is being evaluated, the memory dispatcher is idle, despite having an impending memory read. Your CPU has a surprisingly large surface area of the physical chip that's dedicated to branch prediction so that it can guess whether read operations will be required. So, your CPU will start reading and maybe even calculating the result while waiting to find out whether the branch will need those instructions to be carried out. This might seem like a bad idea, especially when instructions have side effects such as writing to memory. Luckily, your CPU can unwind mispredicted branches so that they have no materialized side effects.

Unfortunately, Spectre and Meltdown highlighted subtle side effects in the form of changing the cache's state. Imagine that I can fool your CPU into speculatively executing a conditional read of a memory address I don't own – maybe even the address where you are storing an encryption key. Later, I can run a similar operation and use the timing of when the instruction was completed to tell me whether the value was already in the CPU cache. If the value was in the cache, I can infer the result of the conditional check and thus learn about the value that was stored at an address I don't have access to – all because the CPU cache state wasn't rolled back. In this example, the cache created a side channel between the erased world of the failed branch prediction and the resumed execution path.

With this primitive memory gadget, an attacker can steal a few bits of memory from a neighboring process at a time. In practice, this class of vulnerability has been used to crack cryptographic keys that are used for SSH, SSL, and credential storage. Many organizations lack the deep security expertise to identify or worry about these kinds of vulnerabilities. Luckily, Lake Formation can help you stay a step ahead in the race to securing your data lake.

What Is a Gadget?

In the context of malicious code exploits, a gadget is a utility that can be used to exploit a known vulnerability. Most gadgets are small, typically comprised of a few dozen code lines, and appear pretty innocuous on their own. When a malicious actor initially accesses a system that intends to compromise, either through legitimate means or via an initial vulnerability, they often begin constructing gadgets that allow them to elevate their privilege or extract information from the target system.

How Lake Formation can help

At re:Invent 2020, the AWS Lake Formation team announced a preview release of Lake Formation's next-generation security features. Among these new features were a set of APIs for reading and writing data to Lake Formation-managed tables, with the ability to enforce row-level access. The following screenshot shows how to grant access to US customer data in a table containing data from customers around the globe:

Figure 14.4 – Row-level access control

These two features can be combined to address many of the challenges we discussed earlier as part of the shared responsibility security model and data perimeters. The new APIs essentially offload the TableScan operation from your analytics engine into Lake Formation's secure filtering fleet. By doing so, Lake Formation can make strong security guarantees, regardless of the analytics engine you are using. Since Lake Formation's read and write APIs apply policy enforcement remotely to any arbitrary and potentially untrusted code within your workload, the attack surface is much smaller. You no longer need to worry about side channels or admin access to the underlying host. This model also makes it easier to build multi-tenant analytics applications. Its built-in filtering capability also allows Lake Formation to enforce previously impossible row-level access control policies without the need to ETL redacted copies of your dataset.

This functionality is slated to become generally available in late 2021, alongside Lake Formation's new ACID-compliant governed table type.

Understanding the benefits of governed tables

The entire AWS analytics suite of services, including Athena, EMR, Glue, Redshift, and Lake Formation, continually makes building and managing data lakes on S3 easier. What used to take months with traditional data warehouses can be accomplished in days using these tools with S3. Despite all the advances in these services, customers still face difficult choices when it comes to the following:

Ingesting streaming data such as Change-Data-Capture (CDC), click data, or application logs
Complying with new regulations such as GDPR and CCPA
Understanding how your data changes over time
Adapting table storage to meet evolving usage and access patterns

In addition to the security-oriented features we discussed earlier in this chapter, Lake Formation's new governed table type takes several steps toward addressing these common sources of data lake frustration. Governed tables are a new Amazon S3 table type that supports atomic, consistent, isolated, and durable (ACID) transactions and automatic storage optimization. To the uninitiated, this may seem like a home run of marketing buzzwords, but governed tables are poised to change how we build everything, from ETL pipelines to interactive analytics applications. Next, we'll look at a common problem that governed tables and their ACID transactions can help us overcome.

ACID transactions on S3-backed tables

Have you ever queried multiples data lake tables in the same query, perhaps via a join clause? If different source systems or ETL jobs populated those tables, there is a significant probability that any query against them reads inconsistent data. The picture becomes even bleaker when you factor in partial failures, which can be just as challenging to identify as they are to repair. This might be a good time for an example.

Suppose we work for an advertising company and routinely track the performance of different advertising campaigns by joining three tables. The first table contains details about all the campaigns, including their total budget, start date, end date, and sponsor. This table is relatively stable, changing only when new campaigns are booked. Next, the impressions table contains a row representing every time we served an ad placement from this campaign. This table changes rapidly, with new entries appearing in near-real time. The final table contains conversion data that identifies which impressions resulted in an ad click or, better still, a purchase! This table doesn't change as often as we like, but it is far from static and mostly populated with data from third-party systems.

When you open your Athena console and run your company's conversion rate reporting query in preparation for a client meeting, you are rolling the dice that the result you get is an accurate representation of the world. Suppose the impression table has fallen behind because of a traffic surge leading up to the holiday season. The conversion table has a much lower flow and doesn't encounter any issues. Even if your query uses set date ranges, you may still find yourself pulling more conversion data than impression data, resulting in an overly optimistic view of how well the campaign is doing. The opposite can also be true when an unexpected issue causes the third-party source data to be late or incomplete. In that case, you may be scrambling to make up for an inexplicably underperforming campaign and give unnecessary concessions to an important client.

In our experience, all data lake use cases fall into one of three buckets concerning consistency:

Consistency is irrelevant: The data is typically historical (backward-looking), immutable, or consistency is inherent due to the records containing correlation IDs that self-identify consistency issues.
Consistency is unknown: The producers and consumers do not know or understand the implications of datasets being used together. The organization spends many hours chasing phantom data quality heisenbugs that seem to resolve themselves when investigated.
Consistency is required and designed for: Producers and consumers take steps to ensure that the data in the lake is consistent. This often includes publishing metadata alongside the data that describes its currency. Many organizations also produce snapshot datasets that simplify consumers by treating data as immutable at the expense of increased ETL compute and storage costs.
Heisenbug
This is one of our favorite pieces of computer science jargon that plays on the famous observer effect of quantum mechanics that Werner Heisenberg first described as the Heisenberg Uncertainty Principle. The theory asserts that the act of observing a quantum particle changes its behavior and reduces the reliability of multi-variable measurements. Naturally, frustrated software engineers rallied behind this theory, which accurately describes a class of bugs that are usually timing-related. In such cases, a new log line is added or a debugger is attached to observe how the bug changes how the system behaves and causes the bug to disappear. In practice, the typical mechanisms that are used to observe a bug also change the speed or timing of program execution, which has a real effect on timing bugs resulting from race conditions.

Now that we have a better understanding of data lake consistency, we can look at an example of how to use transactions against Lake Formation governed tables to simplify how we produce and consume data. At the time of writing, Athena can read governed tables but has not released its specification for writes to governed tables yet. Since most of the interesting consistency work is taken on by the producer or writer, we'll use an Apache Spark example from Glue ETL instead.

In the following code block, we are creating a Glue Spark context and then calling Lake Formation's new begin_transaction API. This API returns a transaction identifier that represents a specific point in time within our data lake, commonly called an epoch. With this single API call, we've established a point of observation that will be applied to all reads and writes that are performed within this script. This is important enough that it warrants repeating. No matter what any other reader or writer does to any table in our data lake, we are guaranteed a view of the world as soon as we start the transaction, thanks to the snapshot isolation mechanism offered by governed tables.

The script then uses the transaction ID to configure a Spark sink that points to our impressions table in the ads database. This is primarily boilerplate and is no different from non-governed table use cases, except for passing the transaction ID to the creation function:

glueContext = GlueContext(SparkContext.getOrCreate())

txid1 = glueContext.begin_transaction(read_only=False)

sink = glueContext.getSink(connection_type="s3",

path="s3://my_bucket/ads/impressions/",

enableUpdateCatalog=True,

updateBehavior="UPDATE_IN_DATABASE",

transactionId=txid1)

sink.setFormat("glueparquet")

sink.setCatalogInfo(catalogDatabase=ads,

catalogTableName=impressions)

Once the sink has been created, the script uses it to write new and updated impression data into the data lake via a DataFrame that we loaded offscreen from a third-party source. In the following code block, the script uses a try-except block to ensure that it either commits or aborts the transaction, depending on the success of the write operation. As the developer of the script, you can choose when to call commit_transaction or abort_transaction. For extra protection, you may choose to query the newly written data to ensure it's available before declaring the write successful and committing the transaction. Since governed tables support read-your-own-write semantics, you can easily add this quality check and simplify operations by automatically rolling back the errant or partial data without human intervention:

try:

sink.writeFrame(new_and_updated_impressions_dataframe)

glueContext.commit_transaction(txid1)

except:

glueContext.abort_transaction(txid1)

There are many other use cases where having transactional capabilities is helpful. Combining Lake Formation's new data read and write APIs with ACID transactions enables compliance with data protection laws such as GPDR, which were previously hampered by the immutable nature of S3 objects.

Despite S3 objects being inherently immutable, organizations have been modifying data in their data lakes for years. Most customers are familiar with adding new data as it arrives, but some must also apply backfills or restate past values by rewriting entire files or tables. With all these modifications flying around, we often find ourselves wondering, "what did that table contain when this job ran?". Your compliance officer might even mandate that specific tables be versioned, even though few, if any, tools exist to automate reading past versions of essentially random S3 objects. The same machinery that Lake Formation uses to create ACID transactions enables reading your data lake through any committed transaction. This is the basic building block of time-travel capabilities, which we will discuss in the next section.

Time-traveling queries

To resolve transaction conflicts and support rollbacks, more ACID-compliant transaction managers maintain a transaction log of some kind. The ledger records every change, addition, or deletion that occurred as part of each transaction. With this information, the system can rebuild the system's state before or after each transaction. Normally, this aids in error recovery or transaction rollback when you call the abort_transaction API. Lake Formation extends the utility of the transaction log to offer time-traveling capabilities.

When activated, time travel allows queries against one or more governed tables to read from a consistent snapshot of the data lake, as of the specified time or transaction ID. The following code block shows how to run an Athena query against the advertising impression table from the previous section. Despite what 80s movies may have taught you, you won't need a Delorean or 1.21 Gigawatts of power to calculate the number of impressions for our advertising campaign as of 30 days ago. We can simply specify a SYSTEM_TIME value that Athena will use to set the read point in the transaction log:

SELECT campaign_id,

count(*) as total_impressions,

avg(linger_time_ms) as avg_impression_duration

FROM

lakeformation.ads.impressions

WHERE

campaign_id = 87348519457

FOR SYSTEM_TIME AS OF datesub(day, 30, now())

GROUP BY campaign_id

We can use such queries to debug updates to a dataset, observing when and how data changed. If an update was done incorrectly, then the transaction that caused the data quality issue can be rolled back. For example, if you have impression data that gets updated regularly and a customer suggests that the data is incorrect, using time-travel queries can pinpoint the time when the inaccurate data was updated.

As you might imagine, the underlying storage for transaction tables is more complex than a basic list of S3 objects. Luckily, governed tables are supported by Lake Formation's new storage optimizer.

Automated compaction of data

We first covered the role of the physical table layouts as part of Chapter 2, Introduction to Amazon Athena, and Built-In Functions. This subject resurfaced in Chapter 11, Operational Excellence – Maintenance, Optimization, and Troubleshooting, where partitioning and file formats became a focal point of operating Athena workloads at scale. The size and arranging S3 objects into partitions and tables dictates both the performance and cost of your analytics queries. When customers ask why their queries are not running as quickly as expected, file size is one of the first things we must check. Most of the time, the files being read are tiny, 10 KB to 10 MB. Small files can be detrimental to query performance because there is overhead associated with each object in the form of metadata, connection time, and data roundtrips from the underlying storage. This overhead can account for as much as 80% of the overall time taken to read the data for small objects.

When enabled for your governed tables, Lake Formation monitors the file sizes and read performance to identify opportunities where reorganizing the data would improve performance. The first such optimization comes in the form of small file compaction. If you've ever processed a data stream from the likes of Kinesis or Kafka, you'll likely have dealt with an accumulation of thousands or millions of small files. Lake Formation will automatically rewrite the small files into more appropriately sized ones, according to the given format's recommended file size. Since these compaction operations happen as part of an ACID transaction, they all occur seamlessly, without your producers or consumers needing to be aware of the activity.

While this is the final Lake Formation feature we'll cover, it is far from the least, given the proliferation of self-managed compaction jobs that many customers run.

Summary

In this chapter, we concluded our exploration of Athena by looking at upcoming Lake Formation features. AWS is increasingly positioning Lake Formation as their one-stop shop for data lake creation and management. If they succeed in making Lake Formation a foundational component of AWS data lakes, customers could expect increased interoperability across the various AWS analytics engines.

It may not be the flashiest feature, but we expect to see many applications mimic Lake Formation's new security features. Using dedicated data access APIs to decouple policy enforcement from workload execution is like an easy button for reducing your attack surface. The addition of ACID transactions with the new governed table type will open a host of new possibilities such as time travel. Look for these features to reach general availability in late 2021.

If you'd like to learn more, consult the Further reading section and consider signing up for the public preview of these features.

Table of Contents for Chapter 14: Lake Formation – Advanced Topics

Create new playlist

Sign In

Sign Up

Chapter 14: Lake Formation – Advanced Topics

Reinforcing your data perimeter with Lake Formation

Establishing a data perimeter

Shared responsibility security model

How Lake Formation can help

Understanding the benefits of governed tables

ACID transactions on S3-backed tables

Time-traveling queries

Automated compaction of data

Summary

Further reading

Table of Contents for
Chapter 14: Lake Formation – Advanced Topics