The

ARTIficial Intelligence INfrastructure

Workshop

The Artificial Intelligence Infrastructure Workshop

Copyright © 2020 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Chinmay Arankalle, Gareth Dwyer, Bas Geerdink, Kunal Gera, Kevin Liao, and Anand N.S.

Reviewers: Brent Broadnax, John Wesley Doyle, Tim Hoolihan, Rochit Jain, Sasikanth Kotti, Asheesh Mehta, Arunkumar Nair, Madhav Pandya, Ashish Patel, Shovon Sengupta, and Ashish Tulsankar

Managing Editor: Ashish James

Acquisitions Editors: Manuraj Nair, Royluis Rodrigues, Kunal Sawant, Anindya Sil, Archie Vankar, and Karan Wadekar

Production Editor: Shantanu Zagade

Editorial Board: Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: August 2020

Production reference: 1130820

ISBN: 978-1-80020-984-8

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface   i

1. Data Storage Fundamentals   1

Introduction   2

Problems Solved by Machine Learning   3

Image Processing – Detecting Cancer in Mammograms with Computer Vision    3

Text and Language Processing – Google Translate   4

Audio Processing – Automatically Generated Subtitles   6

Time Series Analysis   7

Optimizing the Storing and Processing of Data for Machine Learning Problems   7

Diving into Text Classification   8

Looking at TF-IDF Vectorization    9

Looking at Terminology in Text Classification Tasks   11

Exercise 1.01: Training a Machine Learning Model to Identify Clickbait Headlines   12

Designing for Scale – Choosing the Right Architecture and Hardware   21

Optimizing Hardware – Processing Power, Volatile Memory, and Persistent Storage   21

Optimizing Volatile Memory   23

Optimizing Persistent Storage   24

Optimizing Cloud Costs – Spot Instances and Reserved Instances   25

Using Vectorized Operations to Analyze Data Fast   26

Exercise 1.02: Applying Vectorized Operations to Entire Matrices   27

Activity 1.01: Creating a Text Classifier for Movie Reviews   33

Summary   36

2. Artificial Intelligence Storage Requirements   39

Introduction   40

Storage Requirements   41

The Three Stages of Digital Data   43

Data Layers   44

From Data Warehouse to Data Lake    45

Exercise 2.01: Designing a Layered Architecture for an AI System   47

Requirements per Infrastructure Layer   49

Raw Data   50

Security   50

Basic Protection 50

The AIC Rating 51

Role-Based Access 52

Encryption 53

Exercise 2.02: Defining the Security Requirements for Storing Raw Data   54

Scalability   55

Time Travel   56

Retention   57

Metadata and Lineage    59

Historical Data   60

Security   60

Scalability   61

Availability   61

Exercise 2.03: Analyzing the Availability of a Data Store   62

Availability Consequences 63

Time Travel   65

Locality of Data   67

Metadata and Lineage   67

Streaming Data   68

Security   68

Performance   69

Availability   69

Retention   70

Exercise 2.04: Setting the Requirements for Data Retention   71

Analytics Data   72

Performance   72

Cost-Efficiency   73

Quality   73

Model Development and Training   74

Security   75

Availability   75

Retention   75

Activity 2.01: Requirements Engineering for a Data-Driven Application   76

Summary   77

3. Data Preparation   79

Introduction   80

ETL   80

Data Processing Techniques   82

Exercise 3.01: Creating a Simple ETL Bash Script   83

Traditional ETL with Dedicated Tooling   93

Distributed, Parallel Processing with Apache Spark   94

Exercise 3.02: Building an ETL Job Using Spark   95

Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages   102

Source to Raw: Importing Data from Source Systems   104

Raw to Historical: Cleaning Data   105

Raw to Historical: Modeling Data   106

Historical to Analytics: Filtering and Aggregating Data   107

Historical to Analytics: Flattening Data   107

Analytics to Model: Feature Engineering   107

Analytics to Model: Splitting Data   109

Streaming Data   110

Windows 110

Event Time 112

Late Events and Watermarks 113

Exercise 3.03: Streaming Data Processing with Spark   114

Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics   123

Summary   125

4. The Ethics of AI Data Storage   127

Introduction   128

Case Study 1: Cambridge Analytica   130

Summary and Takeaways    135

Case Study 2: Amazon's AI Recruiting Tool   136

Imbalanced Training Sets   136

Summary and Takeaways   139

Case Study 3: COMPAS Software   140

Summary and Takeaways   142

Finding Built-In Bias in Machine Learning Models   143

Exercise 4.01: Observing Prejudices and Biases in Word Embeddings   145

Exercise 4.02: Testing Our Sentiment Classifier on Movie Reviews   151

Activity 4.01: Finding More Latent Prejudices   158

Summary   160

5. Data Stores: SQL and NoSQL Databases   163

Introduction   164

Database Components   165

SQL Databases   166

MySQL   167

Advantages of MySQL   167

Disadvantages of MySQL   167

Query Language   167

Terminology 168

Data Definition Language (DDL) 168

Data Manipulation Language (DML) 169

Data Control Language (DCL) 170

Transaction Control Language (TCL) 171

Data Retrieval 172

SQL Constraints 176

Exercise 5.01: Building a Relational Database for the FashionMart Store   179

Data Modeling   186

Normalization 187

Dimensional Data Modeling 190

Performance Tuning and Best Practices   193

Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query   194

NoSQL Databases   198

Need for NoSQL   199

Consistency Availability Partitioning (CAP) Theorem   200

MongoDB   201

Advantages of MongoDB   201

Disadvantages of MongoDB   202

Query Language   202

Terminology 202

Exercise 5.02: Managing the Inventory of an E-Commerce Website Using a MongoDB Query   210

Data Modeling   217

Lack of Joins 217

Joins 219

Performance Tuning and Best Practices   221

Activity 5.02: Data Model to Capture User Information   221

Cassandra   226

Advantages of Cassandra   226

Disadvantages of Cassandra   227

Dealing with Denormalizations in Cassandra   227

Query Language   228

Terminology 228

Exercise 5.03: Managing Visitors of an E-Commerce Site Using Cassandra   231

Data Modeling   237

Column Family Design 238

Distributing Data Evenly across Clusters 239

Considering Write-Heavy Scenarios 239

Performance Tuning and Best Practices   240

Activity 5.03: Managing Customer Feedback Using Cassandra   240

Exploring the Collective Knowledge of Databases   242

Summary   245

6. Big Data File Formats   247

Introduction   248

Common Input Files   248

CSV – Comma-Separated Values   249

JSON – JavaScript Object Notation   249

Choosing the Right Format for Your Data   250

Orientation – Row-Based or Column-Based   251

Row-Based   251

Column-Based   252

Partitions   253

Schema Evolution   254

Compression   254

Introduction to File Formats   255

Parquet   255

Exercise 6.01: Converting CSV and JSON Files into the Parquet Format   260

Avro   266

Exercise 6.02: Converting CSV and JSON Files into the Avro Format   267

ORC   274

Exercise 6.03: Converting CSV and JSON Files into the ORC Format   276

Query Performance   282

Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs   284

Summary   285

7. Introduction to Analytics Engine (Spark) for Big Data   287

Introduction   288

Apache Spark   289

Fundamentals and Terminology   290

How Does Spark Work?   294

Apache Spark and Databricks   295

Exercise 7.01: Creating Your Databricks Notebook   296

Understanding Various Spark Transformations   304

Exercise 7.02: Applying Spark Transformations to Analyze the Temperature in California   306

Understanding Various Spark Actions   311

Spark Pipeline   312

Exercise 7.03: Applying Spark Actions to the Gettysburg Address   313

Activity 7.01: Exploring and Processing a Movie Locations Database Using Transformations and Actions   319

Best Practices   321

Summary   322

8. Data System Design Examples   325

Introduction   326

The Importance of System Design   327

Components to Consider in System Design   328

Features   328

Hardware   329

Data   329

Architecture   330

Security   330

Scaling   331

Examining a Pipeline Design for an AI System   331

Reproducibility – How Pipelines Can Help Us Keep Track of Each Component   334

Exercise 8.01: Designing an Automatic Trading System    334

Making a Pipeline System Highly Available   342

Exercise 8.02: Adding Queues to a System to Make It Highly Available   344

Activity 8.01: Building the Complete System with Pipelines and Queues   348

Summary   350

9. Workflow Management for AI   353

Introduction   354

Creating Your Data Pipeline   355

Exercise 9.01: Implementing a Linear Pipeline to Get the Top 10 Trending Videos   357

Exercise 9.02: Creating a Nonlinear Pipeline to Get the Daily Top 10 Trending Video Categories   363

Challenges in Managing Processes in the Real World   372

Automation   372

Failure Handling   373

Retry Mechanism   373

Exercise 9.03: Creating a Multi-Stage Data Pipeline   375

Automating a Data Pipeline   382

Exercise 9.04: Automating a Multi-Stage Data Pipeline Using a Bash Script   382

Automating Asynchronous Data Pipelines   385

Exercise 9.05: Automating an Asynchronous Data Pipeline    388

Workflow Management with Airflow   392

Exercise 9.06: Creating a DAG for Our Data Pipeline Using Airflow   394

Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category 405

Summary   408

10. Introduction to Data Storage on Cloud Services (AWS)   411

Introduction   412

Interacting with Cloud Storage   413

Exercise 10.01: Uploading a File to an AWS S3 Bucket Using AWS CLI   416

Exercise 10.02: Copying Data from One Bucket to Another Bucket    421

Exercise 10.03: Downloading Data from Your S3 Bucket   423

Exercise 10.04: Creating a Pipeline Using AWS SDK Boto3 and Uploading the Result to S3   425

Getting Started with Cloud Relational Databases   430

Exercise 10.05: Creating an AWS RDS Instance via the AWS Console   431

Exercise 10.06: Accessing and Managing the AWS RDS Instance   442

Introduction to NoSQL Data Stores on the Cloud   450

Key-Value Data Stores   452

Document Data Stores   452

Columnar Data Store   453

Graph Data Store   454

Data in Document Format   455

Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage   456

Summary   458

11. Building an Artificial Intelligence Algorithm   461

Introduction   462

Machine Learning Algorithms   462

Model Training   463

Closed-Form Solution   463

Non-Closed-Form Solutions   464

Gradient Descent   464

Exercise 11.01: Implementing a Gradient Descent Algorithm in NumPy   467

Getting Started with PyTorch   478

Exercise 11.02: Gradient Descent with PyTorch   481

Mini-Batch SGD with PyTorch   488

Exercise 11.03: Implementing Mini-Batch SGD with PyTorch   492

Building a Reinforcement Learning Algorithm to Play a Game   500

Exercise 11.04: Implementing a Deep Q-Learning Algorithm in PyTorch to Solve the Classic Cart Pole Problem   506

Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem   513

Summary   516

12. Productionizing Your AI Applications   519

Introduction   520

pickle and Flask   521

Exercise 12.01: Creating a Machine Learning Model API with pickle and Flask That Predicts Survivors of the Titanic   522

Activity 12.01: Predicting the Class of a Passenger on the Titanic   536

Deploying Models to Production   537

Docker   538

Kubernetes   539

Exercise 12.02: Deploying a Dockerized Machine Learning API to a Kubernetes Cluster   541

Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers   555

Model Execution in Streaming Data Applications   557

PMML   558

Apache Flink   559

Exercise 12.03: Exporting a Model to PMML and Loading it in the Flink Stream Processing Engine for Real-time Execution   559

Activity 12.03: Predicting the Class of Titanic Passengers in Real Time   572

Summary   575

Appendix   579

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.249.252