0%

In Data Engineering on Azure you’ll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you’ll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms.

Table of Contents

  1. inside front cover
    1. Data Platform Architecture
  2. Data Engineering on Azure
  3. Copyright
  4. dedication
  5. brief contents
  6. contents
  7. front matter
    1. preface
    2. acknowledgments
    3. about this book
    4. about the author
    5. about the cover illustration
  8. 1 Introduction
    1. 1.1 What is data engineering?
    2. 1.2 Who this book is for
    3. 1.3 What is a data platform?
    4. 1.3.1 Anatomy of a data platform
    5. 1.3.2 Infrastructure as code, codeless infrastructure
    6. 1.4 Building in the cloud
    7. 1.4.1 IaaS, PaaS, SaaS
    8. 1.4.2 Network, storage, compute
    9. 1.4.3 Getting started with Azure
    10. 1.4.4 Interacting with Azure
    11. 1.5 Implementing an Azure data platform
    12. Summary
  9. Part 1 Infrastructure
  10. 2 Storage
    1. 2.1 Storing data in a data platform
    2. 2.1.1 Storing data across multiple data fabrics
    3. 2.1.2 Having a single source of truth
    4. 2.2 Introducing Azure Data Explorer
    5. 2.2.1 Deploying an Azure Data Explorer cluster
    6. 2.2.2 Using Azure Data Explorer
    7. 2.2.3 Working around query limits
    8. 2.3 Introducing Azure Data Lake Storage
    9. 2.3.1 Creating an Azure Data Lake Storage account
    10. 2.3.2 Using Azure Data Lake Storage
    11. 2.3.3 Integrating with Azure Data Explorer
    12. 2.4 Ingesting data
    13. 2.4.1 Ingestion frequency
    14. 2.4.2 Load type
    15. 2.4.3 Restatements and reloads
    16. Summary
  11. 3 DevOps
    1. 3.1 What is DevOps?
    2. 3.1.1 DevOps in data engineering
    3. 3.2 Introducing Azure DevOps
    4. 3.2.1 Using the az azure-devops extension
    5. 3.3 Deploying infrastructure
    6. 3.3.1 Exporting an Azure Resource Manager template
    7. 3.3.2 Creating Azure DevOps service connections
    8. 3.3.3 Deploying Azure Resource Manager templates
    9. 3.3.4 Understanding Azure Pipelines
    10. 3.4 Deploying analytics
    11. 3.4.1 Using Azure DevOps marketplace extensions
    12. 3.4.2 Storing everything in Git; deploying everything automatically
    13. Summary
  12. 4 Orchestration
    1. 4.1 Ingesting the Bing COVID-19 open dataset
    2. 4.2 Introducing Azure Data Factory
    3. 4.2.1 Setting up the data source
    4. 4.2.2 Setting up the data sink
    5. 4.2.3 Setting up the pipeline
    6. 4.2.4 Setting up a trigger
    7. 4.2.5 Orchestrating with Azure Data Factory
    8. 4.3 DevOps for Azure Data Factory
    9. 4.3.1 Deploying Azure Data Factory from Git
    10. 4.3.2 Setting up access control
    11. 4.3.3 Deploying the production data factory
    12. 4.3.4 DevOps for the Azure Data Factory recap
    13. 4.4 Monitoring with Azure Monitor
    14. Summary
  13. Part 2 Workloads
  14. 5 Processing
    1. 5.1 Data modeling techniques
    2. 5.1.1 Normalization and denormalization
    3. 5.1.2 Data warehousing
    4. 5.1.3 Semistructured data
    5. 5.1.4 Data modeling recap
    6. 5.2 Identity keyrings
    7. 5.2.1 Building an identity keyring
    8. 5.2.2 Understanding keyrings
    9. 5.3 Timelines
    10. 5.3.1 Building a timeline view
    11. 5.3.2 Using timelines
    12. 5.4 Continuous data processing
    13. 5.4.1 Tracking processing functions in Git
    14. 5.4.2 Keyring building in Azure Data Factory
    15. 5.4.3 Scaling out
    16. Summary
  15. 6 Analytics
    1. 6.1 Structuring storage
    2. 6.1.1 Providing development data
    3. 6.1.2 Replicating production data
    4. 6.1.3 Providing read-only access to the production data
    5. 6.1.4 Storage structure recap
    6. 6.2 Analytics workflow
    7. 6.2.1 Prototyping
    8. 6.2.2 Development and user acceptance testing
    9. 6.2.3 Production
    10. 6.2.4 Analytics workflow recap
    11. 6.3 Self-serve data movement
    12. 6.3.1 Support model
    13. 6.3.2 Data contracts
    14. 6.3.3 Pipeline validation
    15. 6.3.4 Postmortems
    16. 6.3.5 Self-serve data movement recap
    17. Summary
  16. 7 Machine learning
    1. 7.1 Training a machine learning model
    2. 7.1.1 Training a model using scikit-learn
    3. 7.1.2 High spender model implementation
    4. 7.2 Introducing Azure Machine Learning
    5. 7.2.1 Creating a workspace
    6. 7.2.2 Creating an Azure Machine Learning compute target
    7. 7.2.3 Setting up Azure Machine Learning storage
    8. 7.2.4 Running ML in the cloud
    9. 7.2.5 Azure Machine Learning recap
    10. 7.3 MLOps
    11. 7.3.1 Deploying from Git
    12. 7.3.2 Storing pipeline IDs
    13. 7.3.3 DevOps for Azure Machine Learning recap
    14. 7.4 Orchestrating machine learning
    15. 7.4.1 Connecting Azure Data Factory with Azure Machine Learning
    16. 7.4.2 Machine learning orchestration
    17. 7.4.3 Orchestrating recap
    18. Summary
  17. Part 3 Governance
  18. 8 Metadata
    1. 8.1 Making sense of the data
    2. 8.2 Introducing Azure Purview
    3. 8.3 Maintaining a data inventory
    4. 8.3.1 Setting up a scan
    5. 8.3.2 Browsing the data dictionary
    6. 8.3.3 Data dictionary recap
    7. 8.4 Managing a data glossary
    8. 8.4.1 Adding a new glossary term
    9. 8.4.2 Curating terms
    10. 8.4.3 Custom templates and bulk import
    11. 8.4.4 Data glossary recap
    12. 8.5 Understanding Azure Purview's advanced features
    13. 8.5.1 Tracking lineage
    14. 8.5.2 Classification rules
    15. 8.5.3 REST API
    16. 8.5.4 Advanced features recap
    17. Summary
  19. 9 Data quality
    1. 9.1 Testing data
    2. 9.1.1 Availability tests
    3. 9.1.2 Correctness tests
    4. 9.1.3 Completeness tests
    5. 9.1.4 Detecting anomalies
    6. 9.1.5 Testing data recap
    7. 9.2 Running data quality checks
    8. 9.2.1 Testing using Azure Data Factory
    9. 9.2.2 Executing tests
    10. 9.2.3 Creating and using a template
    11. 9.2.4 Running data quality checks recap
    12. 9.3 Scaling out data testing
    13. 9.3.1 Supporting multiple data fabrics
    14. 9.3.2 Testing at rest and during movement
    15. 9.3.3 Authoring tests
    16. 9.3.4 Storing tests and results
    17. Summary
  20. 10 Compliance
    1. 10.1 Data classification
    2. 10.1.1 Feature data
    3. 10.1.2 Telemetry
    4. 10.1.3 User data
    5. 10.1.4 User-owned data
    6. 10.1.5 Business data
    7. 10.1.6 Data classification recap
    8. 10.2 Changing classification through processing
    9. 10.2.1 Aggregation
    10. 10.2.2 Anonymization
    11. 10.2.3 Pseudonymization
    12. 10.2.4 Masking
    13. 10.2.5 Processing classification changes recap
    14. 10.3 Implementing an access model
    15. 10.3.1 Security groups
    16. 10.3.2 Securing Azure Data Explorer
    17. 10.3.3 Access model recap
    18. 10.4 Complying with GDPR and other considerations
    19. 10.4.1 Data handling
    20. 10.4.2 Data subject requests
    21. 10.4.3 Other considerations
    22. Summary
  21. 11 Distributing data
    1. 11.1 Data distribution overview
    2. 11.2 Building a data API
    3. 11.2.1 Introducing Azure Cosmos DB
    4. 11.2.2 Populating the Cosmos DB collection
    5. 11.2.3 Retrieving data
    6. 11.2.4 Data API recap
    7. 11.3 Serving machine learning
    8. 11.4 Sharing data for bulk copy
    9. 11.4.1 Separating compute resources
    10. 11.4.2 Introducing Azure Data Share
    11. 11.4.3 Sharing data for bulk copy recap
    12. 11.5 Data sharing best practices
    13. Summary
  22. Appendix A. Azure services
    1. Azure Storage
    2. Azure SQL
    3. Azure Synapse Analytics
    4. Azure Data Explorer
    5. Azure Databricks
    6. Azure Cosmos DB
  23. Appendix B. KQL quick reference
    1. Common query reference
    2. SQL to KQL
  24. Appendix C. Running code samples
  25. index
  26. inside back cover
    1. MLOps
3.142.173.227