0%

Book Description

Explore how a data storage system works – from data ingestion to representation

Key Features

  • Understand how artificial intelligence, machine learning, and deep learning are different from one another
  • Discover the data storage requirements of different AI apps using case studies
  • Explore popular data solutions such as Hadoop Distributed File System (HDFS) and Amazon Simple Storage Service (S3)

Book Description

Social networking sites see an average of 350 million uploads daily - a quantity impossible for humans to scan and analyze. Only AI can do this job at the required speed, and to leverage an AI application at its full potential, you need an efficient and scalable data storage pipeline. The Artificial Intelligence Infrastructure Workshop will teach you how to build and manage one.

The Artificial Intelligence Infrastructure Workshop begins taking you through some real-world applications of AI. You'll explore the layers of a data lake and get to grips with security, scalability, and maintainability. With the help of hands-on exercises, you'll learn how to define the requirements for AI applications in your organization. This AI book will show you how to select a database for your system and run common queries on databases such as MySQL, MongoDB, and Cassandra. You'll also design your own AI trading system to get a feel of the pipeline-based architecture. As you learn to implement a deep Q-learning algorithm to play the CartPole game, you'll gain hands-on experience with PyTorch. Finally, you'll explore ways to run machine learning models in production as part of an AI application.

By the end of the book, you'll have learned how to build and deploy your own AI software at scale, using various tools, API frameworks, and serialization methods.

What you will learn

  • Get to grips with the fundamentals of artificial intelligence
  • Understand the importance of data storage and architecture in AI applications
  • Build data storage and workflow management systems with open source tools
  • Containerize your AI applications with tools such as Docker
  • Discover commonly used data storage solutions and best practices for AI on Amazon Web Services (AWS)
  • Use the AWS CLI and AWS SDK to perform common data tasks

Who this book is for

If you are looking to develop the data storage skills needed for machine learning and AI and want to learn AI best practices in data engineering, this workshop is for you. Experienced programmers can use this book to advance their career in AI. Familiarity with programming, along with knowledge of exploratory data analysis and reading and writing files using Python will help you to understand the key concepts covered.

Table of Contents

  1. The Artificial Intelligence Infrastructure Workshop
  2. Preface
    1. About the Book
      1. Audience
      2. About the Chapters
      3. Conventions
      4. Code Presentation
      5. Setting up Your Environment
      6. Installing Anaconda
      7. Installing Scikit-Learn
      8. Installing gawk
      9. Installing Apache Spark
      10. Installing PySpark
      11. Installing Tweepy
      12. Installing spaCy
      13. Installing MySQL
      14. Installing MongoDB
      15. Installing Cassandra
      16. Installing Apache Spark and Scala
      17. Installing Airflow
      18. Installing AWS
      19. Registering Your AWS Account
      20. Creating an IAM Role for Programmatic AWS Access
      21. Installing the AWS CLI
      22. Installing an AWS Python SDK – Boto
      23. Installing MySQL Client
      24. Installing pytest
      25. Installing Moto
      26. Installing PyTorch
      27. Installing Gym
      28. Installing Docker
      29. Kubernetes – Minikube
      30. Installing Maven
      31. Installing JDK
      32. Installing Netcat
      33. Installing Libraries
      34. Accessing the Code Files
  3. 1. Data Storage Fundamentals
    1. Introduction
    2. Problems Solved by Machine Learning
      1. Image Processing – Detecting Cancer in Mammograms with Computer Vision
      2. Text and Language Processing – Google Translate
      3. Audio Processing – Automatically Generated Subtitles
      4. Time Series Analysis
    3. Optimizing the Storing and Processing of Data for Machine Learning Problems
    4. Diving into Text Classification
      1. Looking at TF-IDF Vectorization
    5. Looking at Terminology in Text Classification Tasks
      1. Exercise 1.01: Training a Machine Learning Model to Identify Clickbait Headlines
    6. Designing for Scale – Choosing the Right Architecture and Hardware
      1. Optimizing Hardware – Processing Power, Volatile Memory, and Persistent Storage
      2. Optimizing Volatile Memory
      3. Optimizing Persistent Storage
      4. Optimizing Cloud Costs – Spot Instances and Reserved Instances
    7. Using Vectorized Operations to Analyze Data Fast
      1. Exercise 1.02: Applying Vectorized Operations to Entire Matrices
      2. Activity 1.01: Creating a Text Classifier for Movie Reviews
    8. Summary
  4. 2. Artificial Intelligence Storage Requirements
    1. Introduction
    2. Storage Requirements
      1. The Three Stages of Digital Data
    3. Data Layers
      1. From Data Warehouse to Data Lake
      2. Exercise 2.01: Designing a Layered Architecture for an AI System
      3. Requirements per Infrastructure Layer
    4. Raw Data
      1. Security
        1. Basic Protection
        2. The AIC Rating
        3. Role-Based Access
        4. Encryption
      2. Exercise 2.02: Defining the Security Requirements for Storing Raw Data
      3. Scalability
      4. Time Travel
      5. Retention
      6. Metadata and Lineage
    5. Historical Data
      1. Security
      2. Scalability
      3. Availability
      4. Exercise 2.03: Analyzing the Availability of a Data Store
        1. Availability Consequences
      5. Time Travel
      6. Locality of Data
      7. Metadata and Lineage
    6. Streaming Data
      1. Security
      2. Performance
      3. Availability
      4. Retention
      5. Exercise 2.04: Setting the Requirements for Data Retention
    7. Analytics Data
      1. Performance
      2. Cost-Efficiency
      3. Quality
    8. Model Development and Training
      1. Security
      2. Availability
      3. Retention
      4. Activity 2.01: Requirements Engineering for a Data-Driven Application
    9. Summary
  5. 3. Data Preparation
    1. Introduction
    2. ETL
    3. Data Processing Techniques
      1. Exercise 3.01: Creating a Simple ETL Bash Script
      2. Traditional ETL with Dedicated Tooling
      3. Distributed, Parallel Processing with Apache Spark
      4. Exercise 3.02: Building an ETL Job Using Spark
      5. Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages
      6. Source to Raw: Importing Data from Source Systems
      7. Raw to Historical: Cleaning Data
      8. Raw to Historical: Modeling Data
      9. Historical to Analytics: Filtering and Aggregating Data
      10. Historical to Analytics: Flattening Data
      11. Analytics to Model: Feature Engineering
      12. Analytics to Model: Splitting Data
    4. Streaming Data
      1. Windows
        1. Event Time
        2. Late Events and Watermarks
      2. Exercise 3.03: Streaming Data Processing with Spark
      3. Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics
    5. Summary
  6. 4. The Ethics of AI Data Storage
    1. Introduction
      1. Case Study 1: Cambridge Analytica
      2. Summary and Takeaways
      3. Case Study 2: Amazon's AI Recruiting Tool
      4. Imbalanced Training Sets
      5. Summary and Takeaways
      6. Case Study 3: COMPAS Software
      7. Summary and Takeaways
      8. Finding Built-In Bias in Machine Learning Models
      9. Exercise 4.01: Observing Prejudices and Biases in Word Embeddings
      10. Exercise 4.02: Testing Our Sentiment Classifier on Movie Reviews
      11. Activity 4.01: Finding More Latent Prejudices
    2. Summary
  7. 5. Data Stores: SQL and NoSQL Databases
    1. Introduction
    2. Database Components
    3. SQL Databases
    4. MySQL
      1. Advantages of MySQL
      2. Disadvantages of MySQL
      3. Query Language
        1. Terminology
        2. Data Definition Language (DDL)
        3. Data Manipulation Language (DML)
        4. Data Control Language (DCL)
        5. Transaction Control Language (TCL)
        6. Data Retrieval
        7. SQL Constraints
      4. Exercise 5.01: Building a Relational Database for the FashionMart Store
      5. Data Modeling
        1. Normalization
        2. Dimensional Data Modeling
      6. Performance Tuning and Best Practices
      7. Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query
    5. NoSQL Databases
      1. Need for NoSQL
      2. Consistency Availability Partitioning (CAP) Theorem
    6. MongoDB
      1. Advantages of MongoDB
      2. Disadvantages of MongoDB
      3. Query Language
        1. Terminology
      4. Exercise 5.02: Managing the Inventory of an E-Commerce Website Using a MongoDB Query
      5. Data Modeling
        1. Lack of Joins
        2. Joins
      6. Performance Tuning and Best Practices
      7. Activity 5.02: Data Model to Capture User Information
    7. Cassandra
      1. Advantages of Cassandra
      2. Disadvantages of Cassandra
      3. Dealing with Denormalizations in Cassandra
      4. Query Language
        1. Terminology
      5. Exercise 5.03: Managing Visitors of an E-Commerce Site Using Cassandra
      6. Data Modeling
        1. Column Family Design
        2. Distributing Data Evenly across Clusters
        3. Considering Write-Heavy Scenarios
      7. Performance Tuning and Best Practices
      8. Activity 5.03: Managing Customer Feedback Using Cassandra
    8. Exploring the Collective Knowledge of Databases
    9. Summary
  8. 6. Big Data File Formats
    1. Introduction
    2. Common Input Files
      1. CSV – Comma-Separated Values
      2. JSON – JavaScript Object Notation
    3. Choosing the Right Format for Your Data
      1. Orientation – Row-Based or Column-Based
      2. Row-Based
      3. Column-Based
      4. Partitions
      5. Schema Evolution
      6. Compression
    4. Introduction to File Formats
      1. Parquet
      2. Exercise 6.01: Converting CSV and JSON Files into the Parquet Format
      3. Avro
      4. Exercise 6.02: Converting CSV and JSON Files into the Avro Format
      5. ORC
      6. Exercise 6.03: Converting CSV and JSON Files into the ORC Format
      7. Query Performance
      8. Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs
    5. Summary
  9. 7. Introduction to Analytics Engine (Spark) for Big Data
    1. Introduction
    2. Apache Spark
      1. Fundamentals and Terminology
      2. How Does Spark Work?
    3. Apache Spark and Databricks
      1. Exercise 7.01: Creating Your Databricks Notebook
    4. Understanding Various Spark Transformations
      1. Exercise 7.02: Applying Spark Transformations to Analyze the Temperature in California
    5. Understanding Various Spark Actions
      1. Spark Pipeline
      2. Exercise 7.03: Applying Spark Actions to the Gettysburg Address
      3. Activity 7.01: Exploring and Processing a Movie Locations Database Using Transformations and Actions
    6. Best Practices
    7. Summary
  10. 8. Data System Design Examples
    1. Introduction
    2. The Importance of System Design
    3. Components to Consider in System Design
      1. Features
      2. Hardware
      3. Data
      4. Architecture
      5. Security
      6. Scaling
    4. Examining a Pipeline Design for an AI System
      1. Reproducibility – How Pipelines Can Help Us Keep Track of Each Component
      2. Exercise 8.01: Designing an Automatic Trading System
    5. Making a Pipeline System Highly Available
      1. Exercise 8.02: Adding Queues to a System to Make It Highly Available
      2. Activity 8.01: Building the Complete System with Pipelines and Queues
    6. Summary
  11. 9. Workflow Management for AI
    1. Introduction
    2. Creating Your Data Pipeline
      1. Exercise 9.01: Implementing a Linear Pipeline to Get the Top 10 Trending Videos
      2. Exercise 9.02: Creating a Nonlinear Pipeline to Get the Daily Top 10 Trending Video Categories
    3. Challenges in Managing Processes in the Real World
      1. Automation
      2. Failure Handling
      3. Retry Mechanism
      4. Exercise 9.03: Creating a Multi-Stage Data Pipeline
    4. Automating a Data Pipeline
      1. Exercise 9.04: Automating a Multi-Stage Data Pipeline Using a Bash Script
    5. Automating Asynchronous Data Pipelines
      1. Exercise 9.05: Automating an Asynchronous Data Pipeline
    6. Workflow Management with Airflow
      1. Exercise 9.06: Creating a DAG for Our Data Pipeline Using Airflow
        1. Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category
    7. Summary
  12. 10. Introduction to Data Storage on Cloud Services (AWS)
    1. Introduction
    2. Interacting with Cloud Storage
      1. Exercise 10.01: Uploading a File to an AWS S3 Bucket Using AWS CLI
      2. Exercise 10.02: Copying Data from One Bucket to Another Bucket
      3. Exercise 10.03: Downloading Data from Your S3 Bucket
      4. Exercise 10.04: Creating a Pipeline Using AWS SDK Boto3 and Uploading the Result to S3
    3. Getting Started with Cloud Relational Databases
      1. Exercise 10.05: Creating an AWS RDS Instance via the AWS Console
      2. Exercise 10.06: Accessing and Managing the AWS RDS Instance
    4. Introduction to NoSQL Data Stores on the Cloud
      1. Key-Value Data Stores
      2. Document Data Stores
      3. Columnar Data Store
      4. Graph Data Store
    5. Data in Document Format
      1. Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage
    6. Summary
  13. 11. Building an Artificial Intelligence Algorithm
    1. Introduction
    2. Machine Learning Algorithms
    3. Model Training
      1. Closed-Form Solution
      2. Non-Closed-Form Solutions
    4. Gradient Descent
      1. Exercise 11.01: Implementing a Gradient Descent Algorithm in NumPy
    5. Getting Started with PyTorch
      1. Exercise 11.02: Gradient Descent with PyTorch
    6. Mini-Batch SGD with PyTorch
      1. Exercise 11.03: Implementing Mini-Batch SGD with PyTorch
      2. Building a Reinforcement Learning Algorithm to Play a Game
      3. Exercise 11.04: Implementing a Deep Q-Learning Algorithm in PyTorch to Solve the Classic Cart Pole Problem
      4. Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem
    7. Summary
  14. 12. Productionizing Your AI Applications
    1. Introduction
    2. pickle and Flask
      1. Exercise 12.01: Creating a Machine Learning Model API with pickle and Flask That Predicts Survivors of the Titanic
      2. Activity 12.01: Predicting the Class of a Passenger on the Titanic
    3. Deploying Models to Production
      1. Docker
      2. Kubernetes
      3. Exercise 12.02: Deploying a Dockerized Machine Learning API to a Kubernetes Cluster
      4. Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers
    4. Model Execution in Streaming Data Applications
      1. PMML
      2. Apache Flink
      3. Exercise 12.03: Exporting a Model to PMML and Loading it in the Flink Stream Processing Engine for Real-time Execution
      4. Activity 12.03: Predicting the Class of Titanic Passengers in Real Time
    5. Summary
  15. Appendix
    1. 1. Data Storage Fundamentals
      1. Activity 1.01: Creating a Text Classifier for Movie Reviews
    2. 2. Artificial Intelligence Storage Requirements
      1. Activity 2.01: Requirements Engineering for a Data-Driven Application
    3. 3. Data Preparation
      1. Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages
      2. Activity 3.02: Counting the Words in a Twitter Data Stream to Determine the Trending Topics
    4. 4. Ethics of AI Data Storage
      1. Activity 4.01: Finding More Latent Prejudices
    5. 5. Data Stores: SQL and NoSQL Databases
      1. Activity 5.01: Managing the Inventory of an E-Commerce Website Using a MySQL Query
      2. Activity 5.02: Data Model to Capture User Information
      3. Activity 5.03: Managing Customer Feedback Using Cassandra
    6. 6. Big Data File Formats
      1. Activity 6.01: Selecting an Appropriate Big Data File Format for Game Logs
    7. 7. Introduction to Analytics Engine (Spark) for Big Data
      1. Activity 7.01: Exploring and Processing a Movie Locations Database by Using Spark's Transformations and Actions
    8. 8. Data System Design Examples
      1. Activity 8.01: Building the Complete System with Pipelines and Queues
    9. 9. Workflow Management for AI
      1. Activity 9.01: Creating a DAG in Airflow to Calculate the Ratio of Likes-Dislikes for Each Category
    10. 10. Introduction to Data Storage on Cloud Services (AWS)
      1. Activity 10.01: Transforming a Table Schema into Document Format and Uploading It to Cloud Storage
    11. 11. Building an Artificial Intelligence Algorithm
      1. Activity 11.01: Implementing a Double Deep Q-Learning Algorithm to Solve the Cart Pole Problem
    12. 12. Productionizing Your AI Applications
      1. Activity 12.01: Predicting the Class of a Passenger on the Titanic
      2. Activity 12.02: Deploying a Machine Learning Model to a Kubernetes Cluster to Predict the Class of Titanic Passengers
      3. Activity 12.03: Predicting the Class of Titanic Passengers in Real Time
13.58.114.29