Chapter 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees

Einat Orr

The modern data lake architecture is based on object storage as the lake, utilizing streaming and replication technologies to pour the data into the lake, and a rich ecosystem of applications that consume data directly from the lake, or use the lake as their deep storage. This architecture is cost-effective and allows high throughput when ingesting or consuming data.

So why is it still extremely challenging to work with data? Here are some reasons:

  • We’re missing isolation. The only way to ensure isolation is by using permissions or copying the data. Using permissions reduces our ability to maximize our data’s value by allowing access to anyone who may benefit from the data. Copying is not manageable, as you can then lose track of what is where in your lake.

  • We have no atomicity—in other words, we can’t rely on transactions to be performed safely. For example, there is no native way to guarantee that no one will start reading a collection before it has finished writing.

  • We can’t ensure cross-collection consistency (and in some cases, consisttency even for a single collection). Denormalizing data in a data lake is common; for example, for performance considerations. In such cases, we may write the same data in two formats, or index it differently, to optimize for two different applications or two different use cases. (Note that this is required since object storage has poor support of secondary indices.) If one of those processes fails while the other succeeds, you get an inconsistent lake, and you risk providing the data consumers an inconsistent view of the world.

  • We have no reproducibility. There is no guarantee that operations involving data + code are reproducible, as data identified in a certain way may change without changing its identity: we can replace an object with an object of the same name, but containing different content.

  • We have low manageability. The lineage between datasets and the associations to the code that created them are managed manually or by human-defined and human-configured naming standards.

How do we work around these limitations? First and foremost, we must know not to expect those guarantees. Once you know what you are in for, you will make sure to put in place the guarantees you need, depending on your requirements and contracts you have with your customers who consume data from the data lake.

In addition, centralized metastores such as the Hive metastore can alleviate the pain of a lack of atomicity and consistency across collections, and versioned data formats such as Hudi can ease the strain of isolation and separation of readers from writers. Frameworks like lakeFS can provide a safe data lake management environment with guarantees of atomicity, consistency, isolation, and durability (ACID), while supporting the use of Hive and Hudi.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.147.190