Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you'll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available in the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You'll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, governance, and deployment that are critical in any data environment regardless of the underlying technology.

This book will help you:

  • Assess data engineering problems using an end-to-end data framework of best practices
  • Cut through marketing hype when choosing data technologies, architecture, and processes
  • Use the data engineering lifecycle to design and build a robust architecture
  • Incorporate data governance and security across the data engineering lifecycle

Table of Contents

  1. 1. Data Engineering Described
    1. What Is Data Engineering?
    2. Evolution of the Data Engineer
    3. Data Engineering Defined
    4. Data Engineering and Data Science
    5. Data Engineering Skills and Activities
    6. Data Maturity and the Data Engineer
    7. The Background and Skills of a Data Engineer
    8. Business Responsibilities
    9. Technical Responsibilities
    10. The Continuum of Data Engineering Roles, from a to B
    11. Data Engineers Inside an Organization
    12. Internal Vs External Facing Data Engineers
    13. Data Engineers and Other Technical Roles
    14. Data Engineers and Business Leadership
    15. Summary
    16. Further Reading
    17. Links
  2. 2. The Data Engineering Lifecycle
    1. What Is the Data Engineering Lifecycle?
    2. The Data Lifecycle Versus the Data Engineering Lifecycle
    3. Generation - Source Systems
    4. Ingestion
    5. Storage
    6. Transformation
    7. Serving Data for Analytics, Machine Learning, and Reverse Etl
    8. The Major Undercurrents Across the Data Engineering Lifecycle
    9. Data Management
    10. Orchestration
    11. Dataops
    12. Data Architecture
    13. Software Engineering
    14. Chapter Summary
    15. Further Reading
    16. Data Transformation and Processing
    17. Undercurrents
    18. Further Watching
  3. 3. Choosing Technologies Across the Data Engineering Lifecycle
    1. Cost Optimization: Total Cost of Ownership and Opportunity Cost
    2. Total Cost of Ownership
    3. Opportunity Cost
    4. Today Vs the Future - Immutable Vs Transitory Technologies
    5. Our Advice
    6. Location: On-Prem, Cloud, Hybrid, Multi-Cloud, and More
    7. On-Premises
    8. Cloud
    9. Hybrid Cloud
    10. Multi-Cloud
    11. Decentralized - Blockchain and the Edge
    12. Be Cautious with Repatriation Arguments
    13. Our Advice
    14. Build Vs Buy
    15. Open-Source Software (Oss)
    16. Proprietary Walled-Gardens
    17. Our Advice
    18. Monolith Vs Modular
    19. Monolith
    20. Modularity
    21. The Distributed Monolith Pattern
    22. Our Advice
    23. Serverless Vs Servers
    24. Serverless
    25. Containers
    26. When Servers Make Sense
    27. Our Advice
    28. Undercurrents and How They Impact Choosing Technologies
    29. Data Management
    30. Dataops
    31. Data Architecture
    32. Orchestration
    33. Software Engineering
    34. Summary