Data Engineering on Azure

Release Date: 2021/08/01

ISBN: 9781617298929

Topic:

27
Chapters

0-1
Hours read

0k
Total Words

Start Reading Now
Add to Wishlist
View table of contents

In Data Engineering on Azure you’ll learn the skills you need to build and maintain big data platforms in massive enterprises. This invaluable guide includes clear, practical guidance for setting up infrastructure, orchestration, workloads, and governance. As you go, you’ll set up efficient machine learning pipelines, and then master time-saving automation and DevOps solutions. The Azure-based examples are easy to reproduce on other cloud platforms.

inside front cover
1. Data Platform Architecture
Data Engineering on Azure
Copyright
dedication
brief contents
contents
front matter
1. preface
2. acknowledgments
3. about this book
4. about the author
5. about the cover illustration
1 Introduction
1. 1.1 What is data engineering?
2. 1.2 Who this book is for
3. 1.3 What is a data platform?
4. 1.3.1 Anatomy of a data platform
5. 1.3.2 Infrastructure as code, codeless infrastructure
6. 1.4 Building in the cloud
7. 1.4.1 IaaS, PaaS, SaaS
8. 1.4.2 Network, storage, compute
9. 1.4.3 Getting started with Azure
10. 1.4.4 Interacting with Azure
11. 1.5 Implementing an Azure data platform
12. Summary
Part 1 Infrastructure
2 Storage
1. 2.1 Storing data in a data platform
2. 2.1.1 Storing data across multiple data fabrics
3. 2.1.2 Having a single source of truth
4. 2.2 Introducing Azure Data Explorer
5. 2.2.1 Deploying an Azure Data Explorer cluster
6. 2.2.2 Using Azure Data Explorer
7. 2.2.3 Working around query limits
8. 2.3 Introducing Azure Data Lake Storage
9. 2.3.1 Creating an Azure Data Lake Storage account
10. 2.3.2 Using Azure Data Lake Storage
11. 2.3.3 Integrating with Azure Data Explorer
12. 2.4 Ingesting data
13. 2.4.1 Ingestion frequency
14. 2.4.2 Load type
15. 2.4.3 Restatements and reloads
16. Summary
3 DevOps
1. 3.1 What is DevOps?
2. 3.1.1 DevOps in data engineering
3. 3.2 Introducing Azure DevOps
4. 3.2.1 Using the az azure-devops extension
5. 3.3 Deploying infrastructure
6. 3.3.1 Exporting an Azure Resource Manager template
7. 3.3.2 Creating Azure DevOps service connections
8. 3.3.3 Deploying Azure Resource Manager templates
9. 3.3.4 Understanding Azure Pipelines
10. 3.4 Deploying analytics
11. 3.4.1 Using Azure DevOps marketplace extensions
12. 3.4.2 Storing everything in Git; deploying everything automatically
13. Summary
4 Orchestration
1. 4.1 Ingesting the Bing COVID-19 open dataset
2. 4.2 Introducing Azure Data Factory
3. 4.2.1 Setting up the data source
4. 4.2.2 Setting up the data sink
5. 4.2.3 Setting up the pipeline
6. 4.2.4 Setting up a trigger
7. 4.2.5 Orchestrating with Azure Data Factory
8. 4.3 DevOps for Azure Data Factory
9. 4.3.1 Deploying Azure Data Factory from Git
10. 4.3.2 Setting up access control
11. 4.3.3 Deploying the production data factory
12. 4.3.4 DevOps for the Azure Data Factory recap
13. 4.4 Monitoring with Azure Monitor
14. Summary
Part 2 Workloads
5 Processing
1. 5.1 Data modeling techniques
2. 5.1.1 Normalization and denormalization
3. 5.1.2 Data warehousing
4. 5.1.3 Semistructured data
5. 5.1.4 Data modeling recap
6. 5.2 Identity keyrings
7. 5.2.1 Building an identity keyring
8. 5.2.2 Understanding keyrings
9. 5.3 Timelines
10. 5.3.1 Building a timeline view
11. 5.3.2 Using timelines
12. 5.4 Continuous data processing
13. 5.4.1 Tracking processing functions in Git
14. 5.4.2 Keyring building in Azure Data Factory
15. 5.4.3 Scaling out
16. Summary
6 Analytics
1. 6.1 Structuring storage
2. 6.1.1 Providing development data
3. 6.1.2 Replicating production data
4. 6.1.3 Providing read-only access to the production data
5. 6.1.4 Storage structure recap
6. 6.2 Analytics workflow
7. 6.2.1 Prototyping
8. 6.2.2 Development and user acceptance testing
9. 6.2.3 Production
10. 6.2.4 Analytics workflow recap
11. 6.3 Self-serve data movement
12. 6.3.1 Support model
13. 6.3.2 Data contracts
14. 6.3.3 Pipeline validation
15. 6.3.4 Postmortems
16. 6.3.5 Self-serve data movement recap
17. Summary
7 Machine learning
1. 7.1 Training a machine learning model
2. 7.1.1 Training a model using scikit-learn
3. 7.1.2 High spender model implementation
4. 7.2 Introducing Azure Machine Learning
5. 7.2.1 Creating a workspace
6. 7.2.2 Creating an Azure Machine Learning compute target
7. 7.2.3 Setting up Azure Machine Learning storage
8. 7.2.4 Running ML in the cloud
9. 7.2.5 Azure Machine Learning recap
10. 7.3 MLOps
11. 7.3.1 Deploying from Git
12. 7.3.2 Storing pipeline IDs
13. 7.3.3 DevOps for Azure Machine Learning recap
14. 7.4 Orchestrating machine learning
15. 7.4.1 Connecting Azure Data Factory with Azure Machine Learning
16. 7.4.2 Machine learning orchestration
17. 7.4.3 Orchestrating recap
18. Summary
Part 3 Governance
8 Metadata
1. 8.1 Making sense of the data
2. 8.2 Introducing Azure Purview
3. 8.3 Maintaining a data inventory
4. 8.3.1 Setting up a scan
5. 8.3.2 Browsing the data dictionary
6. 8.3.3 Data dictionary recap
7. 8.4 Managing a data glossary
8. 8.4.1 Adding a new glossary term
9. 8.4.2 Curating terms
10. 8.4.3 Custom templates and bulk import
11. 8.4.4 Data glossary recap
12. 8.5 Understanding Azure Purview's advanced features
13. 8.5.1 Tracking lineage
14. 8.5.2 Classification rules
15. 8.5.3 REST API
16. 8.5.4 Advanced features recap
17. Summary
9 Data quality
1. 9.1 Testing data
2. 9.1.1 Availability tests
3. 9.1.2 Correctness tests
4. 9.1.3 Completeness tests
5. 9.1.4 Detecting anomalies
6. 9.1.5 Testing data recap
7. 9.2 Running data quality checks
8. 9.2.1 Testing using Azure Data Factory
9. 9.2.2 Executing tests
10. 9.2.3 Creating and using a template
11. 9.2.4 Running data quality checks recap
12. 9.3 Scaling out data testing
13. 9.3.1 Supporting multiple data fabrics
14. 9.3.2 Testing at rest and during movement
15. 9.3.3 Authoring tests
16. 9.3.4 Storing tests and results
17. Summary
10 Compliance
1. 10.1 Data classification
2. 10.1.1 Feature data
3. 10.1.2 Telemetry
4. 10.1.3 User data
5. 10.1.4 User-owned data
6. 10.1.5 Business data
7. 10.1.6 Data classification recap
8. 10.2 Changing classification through processing
9. 10.2.1 Aggregation
10. 10.2.2 Anonymization
11. 10.2.3 Pseudonymization
12. 10.2.4 Masking
13. 10.2.5 Processing classification changes recap
14. 10.3 Implementing an access model
15. 10.3.1 Security groups
16. 10.3.2 Securing Azure Data Explorer
17. 10.3.3 Access model recap
18. 10.4 Complying with GDPR and other considerations
19. 10.4.1 Data handling
20. 10.4.2 Data subject requests
21. 10.4.3 Other considerations
22. Summary
11 Distributing data
1. 11.1 Data distribution overview
2. 11.2 Building a data API
3. 11.2.1 Introducing Azure Cosmos DB
4. 11.2.2 Populating the Cosmos DB collection
5. 11.2.3 Retrieving data
6. 11.2.4 Data API recap
7. 11.3 Serving machine learning
8. 11.4 Sharing data for bulk copy
9. 11.4.1 Separating compute resources
10. 11.4.2 Introducing Azure Data Share
11. 11.4.3 Sharing data for bulk copy recap
12. 11.5 Data sharing best practices
13. Summary
Appendix A. Azure services
1. Azure Storage
2. Azure SQL
3. Azure Synapse Analytics
4. Azure Data Explorer
5. Azure Databricks
6. Azure Cosmos DB
Appendix B. KQL quick reference
1. Common query reference
2. SQL to KQL
Appendix C. Running code samples
index
inside back cover
1. MLOps

Data Engineering on Azure

Table of Contents