Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

The Artificial Intelligence Infrastructure Workshop

Copyright © 2020 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Chinmay Arankalle, Gareth Dwyer, Bas Geerdink, Kunal Gera, Kevin Liao, and Anand N.S.

Reviewers: Brent Broadnax, John Wesley Doyle, Tim Hoolihan, Rochit Jain, Sasikanth Kotti, Asheesh Mehta, Arunkumar Nair, Madhav Pandya, Ashish Patel, Shovon Sengupta, and Ashish Tulsankar

Managing Editor: Ashish James

Acquisitions Editors: Manuraj Nair, Royluis Rodrigues, Kunal Sawant, Anindya Sil, Archie Vankar, and Karan Wadekar

Production Editor: Shantanu Zagade

Editorial Board: Megan Carlisle, Samuel Christa, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Dominic Pereira, Shiny Poojary, Abhishek Rane, Brendan Rodrigues, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: August 2020

Production reference: 1130820

ISBN: 978-1-80020-984-8

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface i

1. Data Storage Fundamentals 1

Introduction 2

Problems Solved by Machine Learning 3

Image Processing – Detecting Cancer in Mammograms with Computer Vision 3

Text and Language Processing – Google Translate 4

Audio Processing – Automatically Generated Subtitles 6

Time Series Analysis 7

Optimizing the Storing and Processing of Data for Machine Learning Problems 7

Diving into Text Classification 8

Looking at TF-IDF Vectorization 9

Looking at Terminology in Text Classification Tasks 11

Exercise 1.01: Training a Machine Learning Model to Identify Clickbait Headlines 12

Designing for Scale – Choosing the Right Architecture and Hardware 21

Optimizing Hardware – Processing Power, Volatile Memory, and Persistent Storage 21

Optimizing Volatile Memory 23

Optimizing Persistent Storage 24

Optimizing Cloud Costs – Spot Instances and Reserved Instances 25

Using Vectorized Operations to Analyze Data Fast 26

Exercise 1.02: Applying Vectorized Operations to Entire Matrices 27

Activity 1.01: Creating a Text Classifier for Movie Reviews 33

Summary 36

2. Artificial Intelligence Storage Requirements 39

Introduction 40

Storage Requirements 41

The Three Stages of Digital Data 43

Data Layers 44

From Data Warehouse to Data Lake 45

Exercise 2.01: Designing a Layered Architecture for an AI System 47

Requirements per Infrastructure Layer 49

Raw Data 50

Security 50

Basic Protection 50

The AIC Rating 51

Role-Based Access 52

Encryption 53

Exercise 2.02: Defining the Security Requirements for Storing Raw Data 54

Scalability 55

Time Travel 56

Retention 57

Metadata and Lineage 59

Historical Data 60

Security 60

Scalability 61

Availability 61

Exercise 2.03: Analyzing the Availability of a Data Store 62

Availability Consequences 63

Time Travel 65

Locality of Data 67

Metadata and Lineage 67

Streaming Data 68

Security 68

Performance 69

Availability 69

Retention 70

Exercise 2.04: Setting the Requirements for Data Retention 71

Analytics Data 72

Performance 72

Cost-Efficiency 73

Quality 73

Model Development and Training 74

Security 75

Availability 75

Retention 75

Activity 2.01: Requirements Engineering for a Data-Driven Application 76

Summary 77

3. Data Preparation 79

Introduction 80

ETL 80

Data Processing Techniques 82

Exercise 3.01: Creating a Simple ETL Bash Script 83

Traditional ETL with Dedicated Tooling 93

Distributed, Parallel Processing with Apache Spark 94

Exercise 3.02: Building an ETL Job Using Spark 95

Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages 102

Source to Raw: Importing Data from Source Systems 104

Raw to Historical: Cleaning Data 105

Raw to Historical: Modeling Data 106

Historical to Analytics: Filtering and Aggregating Data 107

Historical to Analytics: Flattening Data 107

Analytics to Model: Feature Engineering 107

Analytics to Model: Splitting Data 109

Streaming Data 110

Windows 110

Event Time 112

Late Events and Watermarks 113

Exercise 3.03: Streaming Data Processing with Spark 114

Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics 123

Summary 125

4. The Ethics of AI Data Storage 127

Introduction 128

Case Study 1: Cambridge Analytica 130

Summary and Takeaways 135

Case Study 2: Amazon's AI Recruiting Tool 136

Imbalanced Training Sets 136

Summary and Takeaways 139

Case Study 3: COMPAS Software 140

Summary and Takeaways 142

Finding Built-In Bias in Machine Learning Models 143

Exercise 4.01: Observing Prejudices and Biases in Word Embeddings 145

Exercise 4.02: Testing Our Sentiment Classifier on Movie Reviews 151

Activity 4.01: Finding More Latent Prejudices 158

Summary 160

5. Data Stores: SQL and NoSQL Databases 163

Introduction 164

Database Components 165

SQL Databases 166

MySQL 167

Advantages of MySQL 167

Disadvantages of MySQL 167

Query Language 167

Terminology 168

Data Definition Language (DDL) 168

Data Manipulation Language (DML) 169

Data Control Language (DCL) 170

Transaction Control Language (TCL) 171

Data Retrieval 172

SQL Constraints 176

Exercise 5.01: Building a Relational Database for the FashionMart Store 179

Data Modeling 186

Normalization 187

Dimensional Data Modeling 190

Performance Tuning and Best Practices 193

Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query 194

NoSQL Databases 198

Need for NoSQL 199

Consistency Availability Partitioning (CAP) Theorem 200

MongoDB 201

Advantages of MongoDB 201

Disadvantages of MongoDB 202

Query Language 202

Terminology 202

Exercise 5.02: Managing the Inventory of an E-Commerce Website Using a MongoDB Query 210

Data Modeling 217

Lack of Joins 217

Joins 219

Performance Tuning and Best Practices 221

Activity 5.02: Data Model to Capture User Information 221

Cassandra 226

Advantages of Cassandra 226

Disadvantages of Cassandra 227

Dealing with Denormalizations in Cassandra 227

Query Language 228

Terminology 228

Exercise 5.03: Managing Visitors of an E-Commerce Site Using Cassandra 231

Data Modeling 237

Column Family Design 238

Distributing Data Evenly across Clusters 239

Considering Write-Heavy Scenarios 239

Performance Tuning and Best Practices 240

Activity 5.03: Managing Customer Feedback Using Cassandra 240

Exploring the Collective Knowledge of Databases 242

Summary 245

6. Big Data File Formats 247

Introduction 248

Common Input Files 248

CSV – Comma-Separated Values 249

JSON – JavaScript Object Notation 249

Choosing the Right Format for Your Data 250

Orientation – Row-Based or Column-Based 251

Row-Based 251

Column-Based 252

Partitions 253

Schema Evolution 254

Compression 254

Introduction to File Formats 255

Parquet 255

Exercise 6.01: Converting CSV and JSON Files into the Parquet Format 260

Avro 266

Exercise 6.02: Converting CSV and JSON Files into the Avro Format 267

ORC 274

Exercise 6.03: Converting CSV and JSON Files into the ORC Format 276

Query Performance 282

Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs 284

Summary 285

7. Introduction to Analytics Engine (Spark) for Big Data 287

Introduction 288

Apache Spark 289

Fundamentals and Terminology 290

How Does Spark Work? 294

Apache Spark and Databricks 295

Exercise 7.01: Creating Your Databricks Notebook 296

Understanding Various Spark Transformations 304

Exercise 7.02: Applying Spark Transformations to Analyze the Temperature in California 306

Understanding Various Spark Actions 311

Spark Pipeline 312

Exercise 7.03: Applying Spark Actions to the Gettysburg Address 313

Activity 7.01: Exploring and Processing a Movie Locations Database Using Transformations and Actions 319

Best Practices 321

Summary 322

8. Data System Design Examples 325

Introduction 326

The Importance of System Design 327

Components to Consider in System Design 328

Features 328

Hardware 329

Data 329

Architecture 330

Security 330

Scaling 331

Examining a Pipeline Design for an AI System 331

Reproducibility – How Pipelines Can Help Us Keep Track of Each Component 334

Exercise 8.01: Designing an Automatic Trading System 334

Making a Pipeline System Highly Available 342

Exercise 8.02: Adding Queues to a System to Make It Highly Available 344

Activity 8.01: Building the Complete System with Pipelines and Queues 348

Summary 350

9. Workflow Management for AI 353

Introduction 354

Creating Your Data Pipeline 355

Exercise 9.01: Implementing a Linear Pipeline to Get the Top 10 Trending Videos 357

Exercise 9.02: Creating a Nonlinear Pipeline to Get the Daily Top 10 Trending Video Categories 363

Challenges in Managing Processes in the Real World 372

Automation 372

Failure Handling 373

Retry Mechanism 373

Exercise 9.03: Creating a Multi-Stage Data Pipeline 375

Automating a Data Pipeline 382

Exercise 9.04: Automating a Multi-Stage Data Pipeline Using a Bash Script 382

Automating Asynchronous Data Pipelines 385

Exercise 9.05: Automating an Asynchronous Data Pipeline 388

Workflow Management with Airflow 392

Exercise 9.06: Creating a DAG for Our Data Pipeline Using Airflow 394

Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category 405

Summary 408

10. Introduction to Data Storage on Cloud Services (AWS) 411

Introduction 412

Interacting with Cloud Storage 413

Exercise 10.01: Uploading a File to an AWS S3 Bucket Using AWS CLI 416

Exercise 10.02: Copying Data from One Bucket to Another Bucket 421

Exercise 10.03: Downloading Data from Your S3 Bucket 423

Exercise 10.04: Creating a Pipeline Using AWS SDK Boto3 and Uploading the Result to S3 425

Getting Started with Cloud Relational Databases 430

Exercise 10.05: Creating an AWS RDS Instance via the AWS Console 431

Exercise 10.06: Accessing and Managing the AWS RDS Instance 442

Introduction to NoSQL Data Stores on the Cloud 450

Key-Value Data Stores 452

Document Data Stores 452

Columnar Data Store 453

Graph Data Store 454

Data in Document Format 455

Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage 456

Summary 458

11. Building an Artificial Intelligence Algorithm 461

Introduction 462

Machine Learning Algorithms 462

Model Training 463

Closed-Form Solution 463

Non-Closed-Form Solutions 464

Gradient Descent 464

Exercise 11.01: Implementing a Gradient Descent Algorithm in NumPy 467

Getting Started with PyTorch 478

Exercise 11.02: Gradient Descent with PyTorch 481

Mini-Batch SGD with PyTorch 488

Exercise 11.03: Implementing Mini-Batch SGD with PyTorch 492

Building a Reinforcement Learning Algorithm to Play a Game 500

Exercise 11.04: Implementing a Deep Q-Learning Algorithm in PyTorch to Solve the Classic Cart Pole Problem 506

Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem 513

Summary 516

12. Productionizing Your AI Applications 519

Introduction 520

pickle and Flask 521

Exercise 12.01: Creating a Machine Learning Model API with pickle and Flask That Predicts Survivors of the Titanic 522

Activity 12.01: Predicting the Class of a Passenger on the Titanic 536

Deploying Models to Production 537

Docker 538

Kubernetes 539

Exercise 12.02: Deploying a Dockerized Machine Learning API to a Kubernetes Cluster 541

Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers 555

Model Execution in Streaming Data Applications 557

PMML 558

Apache Flink 559

Exercise 12.03: Exporting a Model to PMML and Loading it in the Flink Stream Processing Engine for Real-time Execution 559

Activity 12.03: Predicting the Class of Titanic Passengers in Real Time 572

Summary 575

Appendix 579

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

3.149.249.252