0%

Book Description

Azure Storage, Streaming, and Batch Analytics shows you how to build state-of-the-art data solutions with tools from the Microsoft Azure platform. Read along to construct a cloud-native data warehouse, adding features like real-time data processing. Based on the Lambda architecture for big data, the design uses scalable services such as Event Hubs, Stream Analytics, and SQL databases. Along the way, you’ll cover most of the topics needed to earn an Azure data engineering certification.

Table of Contents

  1. Azure Storage, Streaming, and Batch Analytic
  2. Copyright
  3. dedication
  4. brief contents
  5. contents
  6. front matter
    1. preface
    2. acknowledgements
    3. about this book
      1. Who should read this book
      2. How this book is organized: a roadmap
      3. About the code
      4. Author online
    4. about the author
    5. about the cover illustration
  7. 1 What is data engineering?
    1. 1.1 What is data engineering?
    2. 1.2 What do data engineers do?
    3. 1.3 How does Microsoft define data engineering?
      1. 1.3.1 Data acquisition
      2. 1.3.2 Data storage
      3. 1.3.3 Data processing
      4. 1.3.4 Data queries
      5. 1.3.5 Orchestration
      6. 1.3.6 Data retrieval
    4. 1.4 What tools does Azure provide for data engineering?
    5. 1.5 Azure Data Engineers
    6. 1.6 Example application
    7. Summary
  8. 2 Building an analytics system in Azure
    1. 2.1 Fundamentals of Azure architecture
      1. 2.1.1 Azure subscriptions
      2. 2.1.2 Azure regions
      3. 2.1.3 Azure naming conventions
      4. 2.1.4 Resource groups
      5. 2.1.5 Finding resources
    2. 2.2 Lambda architecture
    3. 2.3 Azure cloud services
      1. 2.3.1 Azure analytics system architecture
      2. 2.3.2 Event Hubs
      3. 2.3.3 Stream Analytics
      4. 2.3.4 Data Lake Storage
      5. 2.3.5 Data Lake Analytics
      6. 2.3.6 SQL Database
      7. 2.3.7 Data Factory
      8. 2.3.8 Azure PowerShell
    4. 2.4 Walk-through of processing a series of event data records
      1. 2.4.1 Hot path
      2. 2.4.2 Cold path
      3. 2.4.3 Choosing abstract Azure services
    5. 2.5 Calculating cloud hosting costs
      1. 2.5.1 Event Hubs
      2. 2.5.2 Stream Analytics
      3. 2.5.3 Data Lake Storage
      4. 2.5.4 Data Lake Analytics
      5. 2.5.5 SQL Database
      6. 2.5.6 Data Factory
    6. Summary
  9. 3 General storage with Azure Storage accounts
    1. 3.1 Cloud storage services
      1. 3.1.1 Before you begin
    2. 3.2 Creating an Azure Storage account
      1. 3.2.1 Using Azure portal
      2. 3.2.2 Using Azure PowerShell
      3. 3.2.3 Azure Storage replication
    3. 3.3 Storage account services
      1. 3.3.1 Blob storage
      2. 3.3.2 Creating a Blobs service container
      3. 3.3.3 Blob tiering
      4. 3.3.4 Copy tools
      5. 3.3.5 Queues
      6. 3.3.6 Creating a queue
      7. 3.3.7 Azure Storage queue options
    4. 3.4 Storage account access
      1. 3.4.1 Blob container security
      2. 3.4.2 Designing Storage account access
    5. 3.5 Exercises
      1. 3.5.1 Exercise 1
      2. 3.5.2 Exercise 2
    6. Summary
  10. 4 Azure Data Lake Storage
    1. 4.1 Create an Azure Data Lake store
      1. 4.1.1 Using Azure Portal
      2. 4.1.2 Using Azure PowerShell
    2. 4.2 Data Lake store access
      1. 4.2.1 Access schemes
      2. 4.2.2 Configuring access
      3. 4.2.3 Hierarchy structure in the Data Lake store
    3. 4.3 Storage folder structure and data drift
      1. 4.3.1 Hierarchy structure revisited
      2. 4.3.2 Data drift
    4. 4.4 Copy tools for Data Lake stores
      1. 4.4.1 Data Explorer
      2. 4.4.2 ADLCopy tool
      3. 4.4.3 Azure Storage Explorer tool
    5. 4.5 Exercises
      1. 4.5.1 Exercise 1
      2. 4.5.2 Exercise 2
    6. Summary
  11. 5 Message handling with Event Hubs
    1. 5.1 How does an Event Hub work?
    2. 5.2 Collecting data in Azure
    3. 5.3 Create an Event Hubs namespace
      1. 5.3.1 Using Azure PowerShell
      2. 5.3.2 Throughput units
      3. 5.3.3 Event Hub geo-disaster recovery
      4. 5.3.4 Failover with geo-disaster recovery
    4. 5.4 Creating an Event Hub
      1. 5.4.1 Using Azure portal
      2. 5.4.2 Using Azure PowerShell
      3. 5.4.3 Shared access policy
    5. 5.5 Event Hub partitions
      1. 5.5.1 Multiple consumers
      2. 5.5.2 Why specify a partition?
      3. 5.5.3 Why not specify a partition?
      4. 5.5.4 Event Hubs message journal
      5. 5.5.5 Partitions and throughput units
    6. 5.6 Configuring Capture
      1. 5.6.1 File name formats
      2. 5.6.2 Secure access for Capture
      3. 5.6.3 Enabling Capture
      4. 5.6.4 The importance of time
    7. 5.7 Securing access to Event Hubs
      1. 5.7.1 Shared Access Signature policies
      2. 5.7.2 Writing to Event Hubs
    8. 5.8 Exercises
      1. 5.8.1 Exercise 1
      2. 5.8.2 Exercise 2
      3. 5.8.3 Exercise 3
    9. Summary
  12. 6 Real-time queries with Azure Stream Analytics
    1. 6.1 Creating a Stream Analytics service
      1. 6.1.1 Elements of a Stream Analytics job
      2. 6.1.2 Create an ASA job using the Azure portal
      3. 6.1.3 Create an ASA job using Azure PowerShell
    2. 6.2 Configuring inputs and outputs
      1. 6.2.1 Event Hub job input
      2. 6.2.2 ASA job outputs
    3. 6.3 Creating a job query
      1. 6.3.1 Starting the ASA job
      2. 6.3.2 Failure to start
      3. 6.3.3 Output exceptions
    4. 6.4 Writing job queries
      1. 6.4.1 Window functions
      2. 6.4.2 Machine learning functions
    5. 6.5 Managing performance
      1. 6.5.1 Streaming units
      2. 6.5.2 Event ordering
    6. 6.6 Exercises
      1. 6.6.1 Exercise 1
      2. 6.6.2 Exercise 2
    7. Summary
  13. 7 Batch queries with Azure Data Lake Analytics
    1. 7.1 U-SQL language
      1. 7.1.1 Extractors
      2. 7.1.2 Outputters
      3. 7.1.3 File selectors
      4. 7.1.4 Expressions
    2. 7.2 U-SQL jobs
      1. 7.2.1 Selecting the biometric data files
      2. 7.2.2 Schema extraction
      3. 7.2.3 Aggregation
      4. 7.2.4 Writing files
    3. 7.3 Creating a Data Lake Analytics service
      1. 7.3.1 Using Azure portal
      2. 7.3.2 Using Azure PowerShell
    4. 7.4 Submitting jobs to ADLA
      1. 7.4.1 Using Azure portal
      2. 7.4.2 Using Azure PowerShell
    5. 7.5 Efficient U-SQL job executions
      1. 7.5.1 Monitoring a U-SQL job
      2. 7.5.2 Analytics units
      3. 7.5.3 Vertexes
      4. 7.5.4 Scaling the job execution
    6. 7.6 Using Blob Storage
      1. 7.6.1 Constructing Blob file selectors
      2. 7.6.2 Adding a new data source
      3. 7.6.3 Filtering rowsets
    7. 7.7 Exercises
      1. 7.7.1 Exercise 1
      2. 7.7.2 Exercise 2
    8. Summary
  14. 8 U-SQL for complex analytics
    1. 8.1 Data Lake Analytics Catalog
      1. 8.1.1 Simplifying U-SQL queries
      2. 8.1.2 Simplifying data access
      3. 8.1.3 Loading data for reuse
    2. 8.2 Window functions
    3. 8.3 Local C# functions
    4. 8.4 Exercises
      1. 8.4.1 Exercise 1
      2. 8.4.2 Exercise 2
    5. Summary
  15. 9 Integrating with Azure Data Lake Analytics
    1. 9.1 Processing unstructured data
      1. 9.1.1 Azure Cognitive Services
      2. 9.1.2 Managing assemblies in the Data Lake
      3. 9.1.3 Image data extraction with Advanced Analytics
    2. 9.2 Reading different file types
      1. 9.2.1 Adding custom libraries with a Catalog
      2. 9.2.2 Creating a catalog database
      3. 9.2.3 Building the U-SQL DataFormats solution
      4. 9.2.4 Code folders
      5. 9.2.5 Using custom assemblies
    3. 9.3 Connecting to remote sources
      1. 9.3.1 External databases
      2. 9.3.2 Credentials
      3. 9.3.3 Data Source
      4. 9.3.4 Tables and views
    4. 9.4 Exercises
      1. 9.4.1 Exercise 1
      2. 9.4.2 Exercise 2
    5. Summary
  16. 10 Service integration with Azure Data Factory
    1. 10.1 Creating an Azure Data Factory service
    2. 10.2 Secure authentication
      1. 10.2.1 Azure Active Directory integration
      2. 10.2.2 Azure Key Vault
    3. 10.3 Copying files with ADF
      1. 10.3.1 Creating a Files storage container
      2. 10.3.2 Adding secrets to AKV
      3. 10.3.3 Creating a Files storage linkedservice
      4. 10.3.4 Creating an ADLS linkedservice
      5. 10.3.5 Creating a pipeline and activity
      6. 10.3.6 Creating a scheduled trigger
    4. 10.4 Running an ADLA job
      1. 10.4.1 Creating an ADLA linkedservice
      2. 10.4.2 Creating a pipeline and activity
    5. 10.5 Exercises
      1. 10.5.1 Exercise 1
      2. 10.5.2 Exercise 2
    6. Summary
  17. 11 Managed SQL with Azure SQL Database
    1. 11.1 Creating an Azure SQL Database
      1. 11.1.1 Create a SQL Server and SQLDB
    2. 11.2 Securing SQLDB
    3. 11.3 Availability and recovery
      1. 11.3.1 Restoring and moving SQLDB
      2. 11.3.2 Database safeguards
      3. 11.3.3 Creating alerts for SQLDB
    4. 11.4 Optimizing costs for SQLDB
      1. 11.4.1 Pricing structure
      2. 11.4.2 Scaling SQLDB
      3. 11.4.3 Serverless
      4. 11.4.4 Elastic Pools
    5. 11.5 Exercises
      1. 11.5.1 Exercise 1
      2. 11.5.2 Exercise 2
      3. 11.5.3 Exercise 3
      4. 11.5.4 Exercise 4
    6. Summary
  18. 12 Integrating Data Factory with SQL Database
    1. 12.1 Before you begin
    2. 12.2 Importing data with external data sources
      1. 12.2.1 Creating a database scoped credential
      2. 12.2.2 Creating an external data source
      3. 12.2.3 Creating an external table
      4. 12.2.4 Importing Blob files
    3. 12.3 Importing file data with ADF
      1. 12.3.1 Authenticating between ADF and SQLDB
      2. 12.3.2 Creating SQL Database linkedservice
      3. 12.3.3 Creating datasets
      4. 12.3.4 Creating a copy activity and pipeline
    4. 12.4 Exercises
      1. 12.4.1 Exercise 1
      2. 12.4.2 Exercise 2
      3. 12.4.3 Exercise 3
    5. Summary
  19. 13 Where to go next
    1. 13.1 Data catalog
      1. 13.1.1 Data Catalog as a service
      2. 13.1.2 Data locations
      3. 13.1.3 Data definitions
      4. 13.1.4 Data frequency
      5. 13.1.5 Business drivers
    2. 13.2 Version control and backups
      1. 13.2.1 Blob Storage
      2. 13.2.2 Data Lake Storage
      3. 13.2.3 Stream Analytics
      4. 13.2.4 Data Lake Analytics
      5. 13.2.5 Data Factory configuration files
      6. 13.2.6 SQL Database
    3. 13.3 Microsoft certifications
    4. 13.4 Signing off
    5. Summary
  20. appendix A. Setting up Azure services through PowerShell
    1. A.1 Setting up Azure PowerShell
    2. A.2 Create a subscription
    3. A.3 Azure naming conventions
    4. A.4 Setting up common Azure resources using PowerShell
      1. A.4.1 Creating a new resource group
      2. A.4.2 Creating a new Azure Active Directory user
      3. A.4.3 Creating a new Azure Active Directory group
    5. A.5 Setting up Azure services using PowerShell
      1. A.5.1 Creating a new Storage account
      2. A.5.2 Creating a new Data Lake store
      3. A.5.3 Create new Event Hub
      4. A.5.4 Create new Stream Analytics job
      5. A.5.5 Create new Data Lake Analytics account
      6. A.5.6 Create new SQL Server and Database
      7. A.5.7 Create a new Data Factory service
      8. A.5.8 Creating a new App registration
      9. A.5.9 Creating a new key vault
      10. A.5.10 Create new SQL Server and Database with lookup data
  21. appendix B. Configuring the Jonestown Sluggers analytics system
    1. B.1 Solution design
      1. B.1.1 Hot path
      2. B.1.2 Cold path
    2. B.2 Naming convention
    3. B.3 Creation script
    4. B.4 Configure Azure services using PowerShell
      1. B.4.1 Stream Analytics Managed Identity
      2. B.4.2 Data Lake store
      3. B.4.3 Stream Analytics job configuration
      4. B.4.4 SQL Database
      5. B.4.5 Data Factory
    5. B.5 Load event data
    6. B.6 Output of batch and stream processing
    7. B.7 Removing services
  22. index
18.219.22.107