Analysis and machine learning models are only as good as the data they're built on. Querying processed data and getting insights from it requires a robust data pipeline--and an effective storage solution that ensures data quality, data integrity, and performance.

This guide introduces you to Delta Lake, an open-source format that enables building a lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. Delta Lake enhances Apache Spark and makes it easy to store and manage massive amounts of complex data by supporting data integrity, data quality, and performance. Data engineers, data scientists, and data practitioners will learn how to build reliable data lakes and data pipelines at scale using Delta Lake.

  • Understand key data reliability challenges and how to tackle them
  • Learn how to use Delta Lake to realize data reliability improvements
  • Concurrently run streaming and batch jobs against your data lake
  • Execute update, delete, and merge commands against your data lake
  • Use time travel to roll back and examine previous versions of your data
  • Learn best practices to build effective, high-quality end-to-end data pipelines for real world use cases
  • Integrate with other data technologies like Presto, Athena, Redshift and other BI tools

Learn how thousands of companies are processing exabytes of data per month with their lakehouse architecture using Delta Lake.

Table of Contents

  1. 1. Basic Operations on Delta Lakes
    1. What is Delta Lake?
    2. How to start using Delta Lake
    3. Using Delta Lake via local Spark shells
    4. Leveraging GitHub or Maven
    5. Using Databricks Community Edition
    6. Basic operations
    7. Creating your first Delta table
    8. Unpacking the Transaction Log
    9. What Is the Delta Lake Transaction Log?
    10. How Does the Transaction Log Work?
    11. Dealing With Multiple Concurrent Reads and Writes
    12. Other Use Cases
    13. Diving further into the transaction log
    14. Table Utilities
    15. Review table history
    16. Vacuum History
    17. Retrieve Delta table details
    18. Generate a manifest file
    19. Convert a Parquet table to a Delta table
    20. Convert a Delta table to a Parquet table
    21. Restore a table version
    22. Summary
  2. 2. Time Travel with Delta Lake
    1. Introduction
    2. Under the hood of a Delta Table
    3. The Delta Directory
    4. Delta Logs Directory
    5. The files of a Delta table
    6. Time Travel
    7. Common Challenges with Changing Data
    8. Working with Time Travel
    9. Time travel use cases
    10. Time travel considerations
    11. Summary
  3. 3. Continuous Applications with Delta Lake
    1. Make All Your Streams Come True
    2. Spark Streaming Was Built to Unify Batch and Streaming
    3. Exactly-Once Semantics
    4. Putting Some Structure Around Streaming
    5. Streaming with Delta
    6. Delta as a Stream Source
    7. Ignore Updates and Deletes
    8. Delta Table as a Sink
    9. Appendix