Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5: Installing Pachyderm on a Cloud Platform

In the previous chapter, you learned the process for installing Pachyderm locally to get started quickly and test Pachyderm on your computer.

Production use cases require additional compute resources and scalability that can be efficiently achieved using cloud platforms and managed Kubernetes platform services provided by the popular cloud vendors. Pachyderm can run on a Kubernetes cluster, irrespective of whether it is deployed manually on cloud instances or as a managed Kubernetes service. We will discuss the most popular and easy-to-configure methods on cloud providers.

This chapter walks you through the cloud-based installation of Pachyderm and explains the software requirements needed to run a Pachyderm cluster in production. This chapter will cover the installation on the following most popular cloud platforms: Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), and Microsoft Azure Kubernetes Service (AKS).

In this chapter, we're going to cover the following main topics:

Installing the required tools
Deploying Pachyderm on Amazon EKS
Deploying Pachyderm on GKE
Deploying Pachyderm on Microsoft AKS
Accessing the Pachyderm console

Technical requirements

If you are on macOS, verify that you have an up-to-date version of macOS. If you are using Linux, you must be on 64-bit versions of recent distributions of CentOS, Fedora, or Ubuntu. If you are on Windows, run all the commands described in this section from Windows Subsystem for Linux (WSL). You should have the following tools installed from the previous chapters:

Homebrew (macOS only)
The Kubernetes Command-Line Interface (CLI) – kubectl
Helm
Pachyderm CLI – pachctl
WSL (Windows only)

We will need to install the following tools:

The Amazon Web Services (AWS) CLI – aws
The Amazon Identity and Access Management (AWS IAM) authenticator for Kubernetes
The EKS command-line tool – eksctl
The Google Cloud SDK – gcloud
The Azure CLI – az
A JSON processor – jq

We will go into the specifics regarding the installation and configuration of these tools as we go through this chapter. If you already know how to do this, you can go ahead and set them up now.

Installing the required tools

In this section, we will cover the installation of the system tools that we will use to prepare our environment before deploying a Kubernetes cluster and installing Pachyderm on cloud platforms.

Installing the AWS Command Line Interface to manage AWS

The AWS Command Line Interface, aws-cli, is required to execute commands in your AWS account. For additional information, you can refer to the AWS Command Line Interface official documentation at https://docs.aws.amazon.com/cli/latest/userguide/. Let's install aws-cli on your computer:

Execute the following commands to install aws-cli version 2 on your computer:

If you are using macOS:

$ curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"

$ sudo installer -pkg AWSCLIV2.pkg -target

If you are on Linux (x86) or WSL on Windows:

$ curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

$ unzip awscliv2.zip

$ sudo ./aws/install

Execute the following command to verify the successful installation of the AWS CLI:
$ aws --version

The output of the preceding command should look as follows:

aws-cli/2.4.7 Python/3.8.8 Linux/5.11.0-41-generic exe/x86_64.ubuntu.20 prompt/off

Execute the following command to configure the AWS CLI and enter your AWS access key and secret access key when asked. The AWS CLI saves the information used here in a credentials file located under ~/.aws/credentials to be used when you run the command later:
$ aws configure

The output of the preceding command should look as follows:

AWS Access Key ID [None]: YOURACCESSKEYHERE

AWS Secret Access Key [None]: YOURSECRETACCESSKEYHERE

Default region name [None]: us-east-1

Default output format [None]: json

To use a different account than what you have configured in the credentials file, you can set the environment variable, which will be used until the end of your current shell session. Use the following variables temporarily to set a user configuration if needed:
$ export AWS_ACCESS_KEY_ID=YOURACCESSKEY2HERE
$ export AWS_SECRET_ACCESS_KEY=YOURSECRETACCESS2KEYHERE
$ export AWS_DEFAULT_REGION=us-west-2

Now that you have installed the AWS Command Line Interface on your computer, let's install the AWS IAM authenticator for Kubernetes.

Installing the AWS IAM authenticator for Kubernetes

Amazon EKS leverages AWS IAM to provide access to the Kubernetes clusters created through EKS. To be able to make the kubectl command work with Amazon EKS IAM roles, the Amazon IAM authenticator for Kubernetes needs to be installed. Let's install the IAM authenticator on your computer:

Execute the following commands to install aws-iam-authenticator:

If you are using macOS:

$ brew install aws-iam-authenticator

If you are using Linux (x86) or WSL on Windows:

$ curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/aws-iam-authenticator

$ chmod +x ./aws-iam-authenticator

$ mkdir -p $HOME/bin && cp ./aws-iam-authenticator $HOME/bin/aws-iam-authenticator && export PATH=$PATH:$HOME/bin

$ echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc

Verify its version and make sure that aws-iam-authenticator is installed:
$ aws-iam-authenticator version

The output of the preceding command should look as follows. To be able to perform the following recipes, the aws-iam-authenticator version should be 0.5.0 or later:

{"Version":"v0.5.0","Commit":"1cfe2a90f68381eacd7b6dcfa2 bf689e76eb8b4b"}

Now you have installed aws-iam-authenticator on your computer.

Installing eksctl to manage Amazon EKS

Amazon EKS is a managed Kubernetes service on Amazon EC2. To manage Amazon EKS over the terminal and execute commands, the official CLI for Amazon EKS, eksctl, is used. For additional information, you can refer to the AWS eksctl official documentation at https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html. Let's install eksctl on your computer:

Execute the following commands to install eksctl on your computer:

If you are using macOS:

$ brew tap weaveworks/tap

$ brew install weaveworks/tap/eksctl

If you are using Linux (x86) or WSL on Windows:

$ curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp

$ sudo mv /tmp/eksctl /usr/local/bin

Verify its version and make sure that eksctl is installed:
$ eksctl version

The output of the preceding command should look as follows:

0.77.0

To be able to perform the following steps, the eksctl version should be 0.77.0 or later. Now you have installed eksctl to manage Amazon EKS on your computer.

Installing the Google Cloud SDK to manage Google Cloud

The Google Cloud SDK, gcloud, is required to execute commands in your Google Cloud account. The following instructions assume that you have an active GCP account with billing enabled. If you don't have an account already, go to https://console.cloud.google.com and create an account. Let's install gcloud on your computer:

Execute the following commands to install gcloud on your computer:

If you are using macOS:

$ brew tap weaveworks/tap

$ brew install weaveworks/tap/eksctl

If you are using Linux (x86) or WSL on Windows:

$ curl https://sdk.cloud.google.com | bash

Restart your shell by executing the following command:
$ exec -l $SHELL
Verify its version and make sure gcloud is installed:
$ gcloud version

The output of the preceding command should look as follows. To be able to perform the following recipes, the Google Cloud SDK version should be 339.0.0 or later:

Google Cloud SDK 367.0.0

bq 2.0.72

core 2021.12.10

gsutil 5.5

Execute the following command to initialize the SDK and follow the instructions:
$ gcloud init
Execute the following command to set a default zone for future deployments. In our example, the compute zone is set to us-central1-a:
$ gcloud config set compute/zone us-central1-a

Now you have installed gcloud to manage GKE on your computer.

Installing the Azure CLI to manage Microsoft Azure

The Azure CLI, az, is required to execute commands in your Microsoft Azure account. The following instructions assume that you have an active Azure account with billing enabled. If you don't have an account already, go to https://portal.azure.com and create an account. Let's install the Azure CLI on your computer:

Execute the following commands to install gcloud on your computer:

If you are using macOS:

$ brew update && brew install azure-cli

$ brew install jq

If you are using Linux (x86) or WSL on Windows:

$ curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

$ sudo apt-get install jq

Verify its version and make sure the Azure CLI is installed:
$ az version

The output of the preceding command should look as follows. To be able to perform the following steps, the Azure CLI version should be 2.0.1 or later:

{

"azure-cli": "2.31.0",

"azure-cli-core": "2.31.0",

"azure-cli-telemetry": "1.0.6",

"extensions": {}

}

Execute the following command to initialize the Azure CLI and follow the instructions:
$ az login
Execute the following command to create a resource group with a unique name and set a default zone for future deployments. In our example, the compute zone is set to centralus:
$ az group create --name="pachyderm-group" --location=centralus

Now you have installed the Azure CLI to manage AKS on your computer.

Deploying Pachyderm on Amazon EKS

Kubernetes is an open source container orchestration platform and, by itself, is a large topic to cover. In this section, we take the topic of containerization from a data scientist's perspective and will only focus on running our workload, Pachyderm, on the most common managed platforms available in the market. There are various ways and tools to provision and manage the life cycle of production-grade Kubernetes clusters on the AWS cloud platform, such as kOps, kubespray, k3s, Terraform, and others. For additional configuration details, you can refer to Kubernetes' official documentation at https://kubernetes.io/docs/setup/production-environment/. Let's learn the simplest way to get the services required by Pachyderm up and running on AWS's managed Kubernetes service, Amazon EKS.

Preparing an Amazon EKS cluster to run Pachyderm

Follow these steps to provision an Amazon EKS cluster using eksctl. Initially developed as a third-party open source tool, eksctl is now the official tool for deploying and managing EKS clusters via a CLI. You will need to have the AWS CLI and the AWS IAM authenticator for Kubernetes installed and their credentials configured. If you have a cluster, you can skip these instructions and jump to the Deploying Pachyderm on Amazon EKS section. Also, you can refer to the Amazon EKS official documentation at https://eksctl.io/introduction/:

Execute the following command to simply deploy an EKS cluster with default parameters. This command will generate a cluster with two m5.large worker nodes using the official AWS EKS Amazon Machine Image (AMI):
$ eksctl create cluster

The output of the preceding command should return output similar to this:

...

kubectl command should work with "/home/norton/.kube/config", try 'kubectl get nodes'

[✔] EKS cluster "exciting-badger-1620255089" in "us-east-1" region is ready

Important note

To customize the EKS cluster configuration, you can pass additional parameters to eksctl as follows:

eksctl create cluster --name <name> --version <version>

--nodegroup-name <name> --node-type <vm-flavor>

--nodes <number-of-nodes> --nodes-min <min-number-nodes>

--nodes-max <max-number-nodes> --node-ami auto

Verify the cluster deployment by executing the following command:
$ kubectl cluster-info && kubectl get nodes

The output of the preceding command should look as follows:

Kubernetes control plane is running at https://ABCD.gr7.us-east-1.eks.amazonaws.com

CoreDNS is running at https://ABCD.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

NAME STATUS ROLES AGE VERSION

ip-192-168-17-133.ec2.internal Ready <none> 21m v1.21.5-eks-bc4871b

ip-192-168-63-179.ec2.internal Ready <none> 21m v1.21.5-eks-bc4871b

Now, your Amazon EKS cluster is provisioned and ready to deploy Pachyderm.

Creating an S3 object storage bucket

Pachyderm uses S3-compliant object storage to store data. Follow these steps to create an S3 object storage bucket:

Generate variables that will be used to create your S3 bucket and pass it to pachctl to store your Pachyderm data later. Make sure to use a unique bucket name. In our example, we will use s3.pachyderm as our bucket name with a capacity of 200 GB and located in the same region as our EKS cluster, us-east-1:
$ export S3_BUCKET_NAME=s3.pachyderm
$ export EBS_STORAGE_SIZE=200
$ export AWS_REGION=us-east-1
In order for Pachyderm to store pipeline data, a dedicated S3 bucket is required. Execute the following command to create an S3 bucket with the parameters defined by your variables:
$ aws s3api create-bucket --bucket ${S3_BUCKET_NAME}
--region ${AWS_REGION}
Execute the following command to confirm the creation of the bucket:
$ aws s3api list-buckets --query 'Buckets[].Name'

The output of the preceding command should look as follows:

[

"s3.pachyderm",

]

Now that we have an S3 bucket created, we are ready to deploy Pachyderm on Amazon EKS.

Deploying the cluster

When you start learning Pachyderm, it is recommended to run experiments in a small local cluster. We have previously covered the local deployment of Pachyderm in Chapter 4, Installing Pachyderm Locally. In this chapter, we focus on a scalable production-grade deployment of Pachyderm using IAM roles on Amazon EKS clusters.

Follow these steps to install Pachyderm on your Amazon EKS cluster:

The AWS IAM role assigned to your EKS cluster needs to have access to the S3 bucket you created during the Creating an S3 object storage bucket section. As shown in the following screenshot, log in to your AWS Management Console and go to the EKS dashboard as follows:

Figure 5.1 – Amazon EKS Clusters dashboard

Click on the cluster. Locate the EC2 instance in the cluster. Find the IAM role on the Instance description page. Click on IAM Role:

Figure 5.2 – IAM role assigned to EC2 instances

Replace the <s3-bucket> placeholder below with your bucket name. Click on Add inline policy to create a new inline policy similar to the following code:
{
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<s3-bucket>",
                "arn:aws:s3:::*/*"
            ]
        }
    ]
}
Switch to the Trust relationships tab and click on the Edit trust relationship button. Confirm that the trust relationship is similar to the following, otherwise, make the changes and click on the Update Trust Policy button to update:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
Now, execute the following command to add the Pachyderm Helm Chart repos to your local repository:
$ helm repo add pach https://helm.pachyderm.com

Execute the following command to get the latest Chart information from the Chart repository:
$ helm repo update
For a quick deployment, replace the Amazon S3 bucket name, access key ID, and secret key below and execute the command to deploy the latest version of Pachyderm on your cluster without the console:
$ helm install pachd pach/pachyderm
--set deployTarget=AMAZON
--set pachd.storage.amazon.bucket="AWS_S3_BUCKET"
--set pachd.storage.amazon.id="AWS_ACCESS_KEY"
--set pachd.storage.amazon.secret="AWS_SECRET"
--set pachd.storage.amazon.region="us-east-1"
--set pachd.externalService.enabled=true

If you have an enterprise key and would like to deploy it with Pachyderm's console user interface, execute the following command:

$ helm install pachd pach/pachyderm

--set deployTarget=AMAZON

--set pachd.storage.amazon.bucket="AWS_S3_BUCKET"

--set pachd.storage.amazon.id="AWS_ACCESS_KEY"

--set pachd.storage.amazon.secret="AWS_SECRET"

--set pachd.storage.amazon.region="us-east-1"

--set pachd.enterpriseLicenseKey=$(cat license.txt)

--set console.enabled=true

Once the console is deployed successfully, follow the instructions under the Accessing the Pachyderm console section to access the console.

The commands return the following output:

Figure 5.3 – Pachyderm Helm Chart getting deployed on Kubernetes

Optional: Customizing Installation Parameters

You can also download and customize the values.yaml file in the Helm Chart repository, https://github.com/pachyderm/pachyderm/tree/master/etc/helm/pachyderm, to further optimize the components needed to run Pachyderm.

Execute the following command to create a local copy of the values.yaml file:

$ wget https://raw.githubusercontent.com/pachyderm/pachyderm/master/etc/helm/pachyderm/values.yaml

Once customized, you can use the same YAML file and install your Helm Chart by executing the following command instead:

$ helm install pachyderm -f values.yaml pach/pachyderm

A Kubernetes Deployment is a controller that rolls out a ReplicaSet of Pods based on the requirements defined in your manifest file. A ReplicaSet is a group of instances of the same service. Execute the following command to verify the state of Deployments created during the installation:
$ kubectl get deployments

The output of the preceding command should look as follows:

Figure 5.4 – List of Pachyderm Deployment objects

Execute the following command to verify a successful installation and see the Pods created as part of the Deployments:
$ kubectl get pods

The output of the preceding command should look as follows:

Figure 5.5 – List of Pachyderm Pods

Execute the following command to verify the persistent volumes created as part of the Deployment:
$ kubectl get pv

The output of the preceding command should look as follows:

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE

pvc-cab1f435-02fb-42df-85d9-d49f6151b281 200Gi RWO Delete Bound default/etcd-storage-etcd-0 etcd-storage-class 81m

Important note

When Pachyderm is deployed using the --dynamic-etcd-nodes flag, it creates an etcd deployment to manage administrative metadata. Block storage used by etcd Pods is provisioned using the default AWS StorageClass, gp2. To use a different StorageClass during deployment, you will need to deploy an Amazon EBS CSI driver to your cluster and update the etcd.storageclass parameter to gp3 during the Helm Chart deployment.

Execute the following command to verify the successful installation of Pachyderm:
$ pachctl version

The output of the preceding command should look as follows:

COMPONENT VERSION

pachctl 2.0.0

pachd 2.0.0

Now that we have installed Pachyderm on our AWS EKS cluster, we are ready to create our first pipeline.

Deleting a Pachyderm deployment on Amazon EKS

If you need to delete your Pachyderm deployment or start afresh, you can wipe out your environment and start over again from the Preparing an EKS cluster to run Pachyderm instructions. Let's perform the following steps to delete your existing Pachyderm deployment:

If you have used a different name for your Helm instance, execute the following command to find the Pachyderm instance name deployed using the Helm Chart:
$ helm ls | grep pachyderm

The output of the preceding command should look as follows:

pachd default 1 2021-12-27 20:20:33.168535538 -0800 PST deployed pachyderm-2.0.0 2.0.0

Execute the following command using your Pachyderm instance name to remove Pachyderm components from your cluster:
$ helm uninstall pachd
Execute the following command to retrieve the list of EKS clusters deployed and identify the name of your cluster:
$ eksctl get cluster

The output of the preceding command should look as follows:

2021-05-05 21:53:56 [ℹ] eksctl version 0.47.0

2021-05-05 21:53:56 [ℹ] using region us-east-1

NAME REGION EKSCTL CREATED

exciting-badger-1620255089 us-east-1 True

If you would like to remove the cluster completely, copy the name of your cluster from the preceding output and execute the following command after replacing the name to delete the Amazon EKS cluster. Note that all the other workloads on this cluster will also be destroyed:
$ eksctl delete cluster --name <name>

The output of the preceding command should complete similar to the following:

...

2021-05-05 22:00:54 [ℹ] will delete stack "eksctl-exciting-badger-1620255089-cluster"

2021-05-05 22:00:54 [✔] all cluster resources were deleted

Now you have completely removed Pachyderm and your EKS cluster from your AWS account.

Deploying Pachyderm on GKE

If you use Google Cloud, a managed Kubernetes service can be deployed on Google Cloud Platform (GCP) using automation and command-line tools with the help of kOps, kubespray, Terraform, and others. For additional configuration details, you can refer to Kubernetes' official documentation at https://kubernetes.io/docs/setup/production-environment/. Let's learn the simplest way to get the services required by Pachyderm up and running on Google Cloud's managed Kubernetes service, GKE.

Preparing a GKE cluster to run Pachyderm

Follow these steps to provision a GKE cluster using the Google Cloud SDK. You will need to have the Google Cloud SDK installed and its credentials configured. If you have a cluster, you can skip these instructions and jump to the Deploying the cluster section. Also, you can refer to the Google Cloud SDK official documentation at https://cloud.google.com/sdk/docs/install:

Execute the following command to deploy a GKE cluster with default parameters. This command will generate a three-node cluster with the recommended n2-standard-4 instance type using a Container-Optimized OS (COS) with Docker in your default compute zone:
$ gcloud container clusters create pachyderm-cluster
--scopes compute-rw,storage-rw,service-management,service-control,logging-write,monitoring
--machine-type n2-standard-4

The output of the preceding command should complete similar to the following:

...

kubeconfig entry generated for pachyderm-cluster.

NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS

pachyderm-cluster us-central1-a 1.18.16-gke.2100 35.238.200.52 n2-standard-4 1.18.16-gke.2100 1 RUNNING

Important note

To simply customize the GKE cluster parameters, you can use the GCP console and the Kubernetes Engine creation wizard. After you configure the parameters on the wizard, click on the command-line button in the wizard to convert the configuration into a CLI command to use with the gcloud command.

Verify deployment of the cluster by executing the following command:
$ kubectl cluster-info && kubectl get nodes

The output of the preceding command should look as follows:

Kubernetes control plane is running at https://<IP_ADDRESS>

GLBCDefaultBackend is running at https://<IP_ADDRESS>/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy

KubeDNS is running at https://<IP_ADDRESS>/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

Metrics-server is running at https://<IP_ADDRESS>/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

NAME STATUS ROLES AGE VERSION

gke-pachyderm-cluster-default-pool-26cf3a77-1vr1 Ready <none> 12 v1.18.16-gke.2100

gke-pachyderm-cluster-default-pool-26cf3a77-5sgs Ready <none> 12m v1.18.16-gke.2100

gke-pachyderm-cluster-default-pool-26cf3a77-lkr4 Ready <none> 12m v1.18.16-gke.2100

Now, your GKE cluster is provisioned and ready to deploy Pachyderm.

Creating a Google Cloud object storage bucket

Pachyderm uses object storage to store data. Follow these steps to create a Google Cloud object storage bucket:

Generate variables that will be used to create your Google Cloud Storage (GCS) bucket and pass to pachctl to store your Pachyderm data later. Make sure to use a unique bucket name. In our example, we will use pachyderm-bucket as our bucket name with a capacity of 200 GB and located in the same region as our GKE cluster, us-central1:
$ export GCS_BUCKET_NAME=pachyderm-bucket
$ export GKE_STORAGE_SIZE=200
In order for Pachyderm to store pipeline data, a dedicated GCS bucket is required. Execute the following command to create a GCS bucket with the parameters defined by your variables:
$ gsutil mb gs://${GCS_BUCKET_NAME}
Execute the following command to confirm the creation of the bucket:
$ gsutil ls

The output of the preceding command should look as follows:

gs://pachyderm-bucket/

Now, you have a GCS bucket created to store Pachyderm data. We are ready to deploy Pachyderm on GKE.

Deploying the cluster

When you start learning Pachyderm, it is suggested to run experiments in a small local cluster. We have previously covered the local deployment of Pachyderm in Chapter 4, Installing Pachyderm Locally. In this chapter, we are going to focus on a scalable production-grade deployment of Pachyderm using IAM roles on GKE clusters.

Follow these steps to install Pachyderm on your GKE cluster:

If you don't already have a service account, execute the following command to create a service account:
$ gcloud iam service-accounts create my-service-account --display-name=my-account
Replace the following pachyderm-book in both places with your project name and add a storage admin role that is binding to your service account:
$ gcloud projects add-iam-policy-binding
pachyderm-book –role roles/owner --member
serviceAccount:[email protected]
Now, execute the following command to add Pachyderm Helm Chart repos to your local repository:
$ helm repo add pach https://helm.pachyderm.com
Execute the following command to get the latest Chart information from the Chart repository:
$ helm repo update
For a quick deployment, replace the Google Cloud bucket name, and Google Cloud credentials, and execute the following command to deploy the latest version of Pachyderm on your cluster without the console:
$ helm install pachd pach/pachyderm
--set deployTarget=GOOGLE
--set pachd.storage.google.bucket="GOOGLE_BUCKET"
--set pachd.storage.google.cred="GOOGLE_CRED"
--set pachd.externalService.enabled=true

If you have an enterprise key and would like to deploy it with Pachyderm's console user interface, execute the following command:

$ helm install pachd pach/pachyderm

--set deployTarget=GOOGLE

--set pachd.storage.google.bucket="GOOGLE_BUCKET"

--set pachd.storage.google.cred="GOOGLE_CRED"

--set pachd.enterpriseLicenseKey=$(cat license.txt)

--set console.enabled=true

Once the console is deployed successfully, follow the instructions under the Accessing the Pachyderm console section to access the console.

A Kubernetes Deployment is a controller that rolls out a ReplicaSet of Pods based on the requirements defined in your manifest file. Execute the following command to verify the state of Deployments created during installation:
$ kubectl get deployments

The output of the preceding command should look as follows:

NAME READY UP-TO-DATE AVAILABLE AGE

dash 1/1 1 1 44s

pachd 1/1 1 1 45s

Execute the following command to verify a successful installation and see the Pods created as part of the Deployments:
$ kubectl get Pods

The output of the preceding command should look as follows:

NAME READY STATUS RESTARTS AGE

dash-cf6f47d7d-xpvvp 2/2 Running 0 104s

etcd-0 1/1 Running 0 104s

pachd-6c99f6fb7-dnjhn 1/1 Running 0 104s

Execute the following command to verify the persistent volumes created as part of the Deployment:
$ kubectl get pv

The output of the preceding command should look as follows:

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE

pvc-c4eac147-8571-4ccb-8cd0-7c1cb68a627d 200Gi RWO Delete Bound default/etcd-storage-etcd-0 etcd-storage-class 3m4s

Execute the following command to verify the successful installation of Pachyderm:
$ pachctl version

The output of the preceding command should look as follows:

COMPONENT VERSION

pachctl 2.0.0

pachd 2.0.0

Now we have installed Pachyderm on your GKE cluster, you are ready to create your first pipeline.

Deleting a Pachyderm deployment on GKE

If you need to delete your deployment and start afresh, you can wipe out your environment and start over again using the Preparing a GKE cluster to run Pachyderm instructions. Let's perform the following steps to delete your existing Pachyderm deployment:

If you have used a different name for your Helm instance, execute the following command to find the Pachyderm instance name deployed using the Helm Chart:
$ helm ls | grep pachyderm

The output of the preceding command should look as follows:

pachd default 1 2021-12-27 20:20:33.168535538 -0800 PST deployed pachyderm-2.0.0 2.0.0

Execute the following command using your Pachyderm instance name to remove the Pachyderm components from your cluster:
$ helm uninstall pachd
Execute the following command to retrieve the list of GKE clusters deployed and identify the name of your cluster:
$ gcloud container clusters list

The output of the preceding command should look as follows:

NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS

pachyderm-cluster us-central1-a 1.18.16-gke.2100 35.238.200.52 n2-standard-2 1.18.16-gke.2100 3 RUNNING

If you would like to remove the cluster completely, copy the name of your cluster from the preceding output and execute the following command after replacing the name to delete the GKE cluster. Note that all the other workloads on this cluster will also be destroyed:
$ gcloud container clusters delete <name>

The output of the preceding command should complete similar to the following:

...

Deleting cluster pachyderm-cluster...done.

Deleted [https://container.googleapis.com/v1/projects/pachydermbook/zones/us-central1-a/clusters/pachyderm-cluster].

Now you have completely removed Pachyderm and your Kubernetes cluster from your GCP account.

Deploying Pachyderm on Microsoft AKS

If you use Microsoft Azure, a managed Kubernetes service can be deployed on the Azure platform using automation and command-line tools with the help of kOps, kubespray, Terraform, and others. For additional configuration details, you can refer to Kubernetes' official documentation at https://kubernetes.io/docs/setup/production-environment/. Let's learn the simplest way to get the services required by Pachyderm up and running on AKS.

Preparing an AKS cluster to run Pachyderm

Follow these steps to provision an AKS cluster using the Azure CLI. You will need to have the Azure CLI installed and its credentials configured. If you have a cluster, you can skip these instructions and jump to the Deploying Pachyderm on Microsoft AKS section. Also, you can refer to the Azure CLI official documentation at https://docs.microsoft.com/en-us/cli/azure/:

Execute the following command in your previously specified resource-group to deploy an AKS cluster with default parameters. This command will generate a three-node cluster with the recommended Standard_DS4_v2 instance type in your default compute zone:
$ az aks create --resource-group pachyderm-group --name pachyderm-cluster --generate-ssh-keys --node-vm-size Standard_DS4_v2

The output of the preceding command should complete similar to the following:

...

"privateFqdn": null,

"provisioningState": "Succeeded",

"resourceGroup": "pachyderm-group",

"servicePrincipalProfile": {

"clientId": "msi",

"secret": null

},...

Important note

If you don't remember your resource group name, you can use the az group list command to list the previously created resource groups.

Execute the following command to connect to your cluster:
$ az aks get-credentials --resource-group pachyderm-group --name pachyderm-cluster
Verify the cluster deployment by executing the following command:
$ kubectl get nodes

The output of the preceding command should look as follows:

NAME STATUS ROLES AGE VERSION

aks-nodepool1-34139531-vmss000000 Ready agent 5m57s v1.19.9

aks-nodepool1-34139531-vmss000001 Ready agent 5m58s v1.19.9

aks-nodepool1-34139531-vmss000002 Ready agent 5m58s v1.19.9

Now your AKS cluster is provisioned and ready to deploy Pachyderm.

Creating an Azure storage container

Pachyderm uses blob storage to store data and block storage for metadata. It is recommended to use SSDs rather than the Standard HDD-based slower storage option.

Follow these steps to create a Premium LRS Block blobs storage container:

Generate variables that will be used to create your Azure Block blobs storage and pass to pachctl to store your Pachyderm data later. Make sure to use a unique storage account and container name. In our example, we will use pachydermstorageaccount as our STORAGE_ACCOUNT, pachydermblobcontainer as our CONTAINER_NAME, and it will be located in the centralus region:
$ export RESOURCE_GROUP=pachyderm-group
$ export STORAGE_ACCOUNT=pachydermstorageaccount
$ export CONTAINER_NAME=pachydermblobcontainer
$ export AZURE_STORAGE_SIZE=200
Execute the following command to your Azure storage account with the parameters defined by your variables:
$ az storage account create
  --resource-group="${RESOURCE_GROUP}"
  --location="centralus"
  --sku=Premium_LRS
  --name="${STORAGE_ACCOUNT}"
  --kind=BlockBlobStorage
Execute the following command to confirm the creation of the bucket:
$ az storage account list

The output of the preceding command should look as follows:

...

"web": "https://pachydermstorageaccount.z19.web.core.windows.net/"

"primaryLocation": "centralus",

"privateEndpointConnections": [],

"provisioningState": "Succeeded",

"resourceGroup": "pachyderm-group",

...

The deployment of Pachyderm will require a storage account key. Execute the following command to store the storage keys in a variable:
$ STORAGE_KEY="$(az storage account keys list
              --account-name="${STORAGE_ACCOUNT}"
              --resource-group="${RESOURCE_GROUP}"
              --output=json
              | jq '.[0].value' -r
            )"
Execute the following command to create a data storage container in your storage account:
$ az storage container create --name ${CONTAINER_NAME}
--account-name ${STORAGE_ACCOUNT}
--account-key "${STORAGE_KEY}"

The output of the preceding command should look as follows:

{

"created": true

}

Now, you have an Azure data storage container created in your Azure storage account to store Pachyderm data.

Deploying the cluster

Follow these steps to install Pachyderm on your AKS cluster:

If you have not connected to your Kubernetes cluster, execute the following command to connect to your cluster:
$ az aks get-credentials --resource-group pachyderm-group --name pachyderm-cluster
For a quick deployment, replace the Azure storage container name, Azure storage account name, and AKS account key, and execute the following command to deploy the latest version of Pachyderm on your cluster without the console:
$ helm install pachd pach/pachyderm
--set deployTarget=MICROSOFT
--set pachd.storage.microsoft.container="CONTAINER_NAME"
--set pachd.storage.microsoft.id="AZURE_ID"
--set pachd.storage.microsoft.secret="AZURE_SECRET"
--set pachd.externalService.enabled=true

If you have an enterprise key and you would like to deploy it with Pachyderm's console user interface, execute the following command:

$ helm install pachd pach/pachyderm

--set deployTarget=MICROSOFT

--set pachd.storage.microsoft.container="CONTAINER_NAME"

--set pachd.storage.microsoft.id="AZURE_ID"

--set pachd.storage.microsoft.secret="AZURE_SECRET"

--set pachd.enterpriseLicenseKey=$(cat license.txt)

--set console.enabled=true

Once the console is deployed successfully, follow the instructions under the Accessing the Pachyderm console section to access the console.

A Kubernetes Deployment is a controller that rolls out a ReplicaSet of Pods based on the requirements defined in your manifest file. Execute the following command to verify the state of Deployments created during the installation:
$ kubectl get deployments

The output of the preceding command should look as follows:

NAME READY UP-TO-DATE AVAILABLE AGE

dash 1/1 1 1 39s

pachd 1/1 1 1 39s

Execute the following command to verify a successful installation and see the Pods created as part of the Deployments:
$ kubectl get pods

The output of the preceding command should look as follows:

NAME READY STATUS RESTARTS AGE

dash-866fd997-z79jj 2/2 Running 0 54s

etcd-0 1/1 Running 0 54s

pachd-8588c44f56-skmkl 1/1 Running 0 54 s

Execute the following command to verify the persistent volumes created as part of the Deployment:
$ kubectl get pv

The output of the preceding command should look as follows:

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE

pvc-9985a602-789d-40f3-9249-7445a9c15bc3 200Gi RWO Delete Bound default/etcd-storage-etcd-0 default 89s

Important note

When Pachyderm is deployed using the --dynamic-etcd-nodes flag, it creates an etcd deployment to manage administrative metadata. In Azure, block storage used by etcd Pods is provisioned using the default StorageClass. This Storage Class uses the azure-disk provisioner with StandardSSD_LRS volumes. To use a different StorageClass during the deployment, you can customize the values.yaml file and update the etcd.storageClass parameter prior to the deployment.

Execute the following command to verify the successful installation of Pachyderm:
$ pachctl version

The output of the preceding command should look as follows:

COMPONENT VERSION

pachctl 2.0.0

pachd 2.0.0

Now that we have installed Pachyderm on your AKS cluster, you are ready to create your first pipeline.

Deleting a Pachyderm deployment on AKS

If you need to delete your deployment or start afresh, you can wipe out your environment and start over again using the Preparing an AKS cluster to run Pachyderm instructions.

Let's perform the following steps to delete your existing Pachyderm deployment:

If you have used a different name for your Helm instance, execute the following command to find the Pachyderm instance name deployed using the Helm Chart:
$ helm ls | grep pachyderm

The output of the preceding command should look as follows:

pachd default 1 2021-12-27 20:20:33.168535538 -0800 PST deployed pachyderm-2.0.0 2.0.0

Execute the following command using your Pachyderm instance name to remove Pachyderm components from your cluster:
$ helm uninstall pachd
Execute the following command to retrieve the list of AKS clusters deployed and identify the name of your cluster:
$ az aks list

The output of the preceding command should look as follows:

...

"location": "centralus",

"maxAgentPools": 100,

"name": "pachyderm-cluster",

"networkProfile": {

…

If you would like to remove the cluster completely, copy the name of your cluster from the preceding output and execute the following command after replacing the name to delete the AKS cluster. Note that all the other workloads on this cluster will also be destroyed:
$ az aks delete --name <name> --resource-group pachyderm-group

The output of the preceding command should complete similar to the following:

...

Deleting cluster pachyderm-cluster...done.

Deleted [https://container.googleapis.com/v1/projects/pachydermbook/zones/us-central1-a/clusters/pachyderm-cluster].

Now you have completely removed Pachyderm and your Kubernetes cluster from your AKS account.

Accessing the Pachyderm console

Pachyderm Enterprise Edition offers a graphical user interface where you can see pipelines and repositories. Accessing the Pachyderm console using port forwarding was covered in Chapter 4, Installing Pachyderm Locally.

In addition, for cloud deployments, you can deploy a Kubernetes ingress to access the Pachyderm console securely. For more information, refer to the official Pachyderm documentation.

Summary

In this chapter, we learned the software prerequisites for getting Pachyderm up and running on managed Kubernetes services from major cloud providers including AWS, Google Cloud, and Microsoft Azure.

We acquired basic knowledge of cloud providers' command-line tools and learned how to install and operate them on your local machine to provide production-grade Kubernetes clusters.

We created an object storage bucket and also deployed highly available multi-node managed Kubernetes clusters using the most common configuration options. And finally, we deployed a Pachyderm instance.

In the next chapter, we will learn in detail about creating your first pipeline. You will learn a simple data science example and a pipeline creation workflow.

Table of Contents for
Chapter 5: Installing Pachyderm on a Cloud Platform

Chapter 5: Installing Pachyderm on a Cloud Platform

Technical requirements

Installing the required tools

Installing the AWS Command Line Interface to manage AWS

Installing the AWS IAM authenticator for Kubernetes

Installing eksctl to manage Amazon EKS

Installing the Google Cloud SDK to manage Google Cloud

Installing the Azure CLI to manage Microsoft Azure

Deploying Pachyderm on Amazon EKS

Preparing an Amazon EKS cluster to run Pachyderm

Creating an S3 object storage bucket

Deploying the cluster

Deleting a Pachyderm deployment on Amazon EKS

Deploying Pachyderm on GKE

Preparing a GKE cluster to run Pachyderm

Creating a Google Cloud object storage bucket

Deploying the cluster

Deleting a Pachyderm deployment on GKE

Deploying Pachyderm on Microsoft AKS

Preparing an AKS cluster to run Pachyderm

Creating an Azure storage container

Deploying the cluster

Deleting a Pachyderm deployment on AKS

Accessing the Pachyderm console

Summary

Further reading

Table of Contents for Chapter 5: Installing Pachyderm on a Cloud Platform

Create new playlist

Sign In

Sign Up

Chapter 5: Installing Pachyderm on a Cloud Platform

Technical requirements

Installing the required tools

Installing the AWS Command Line Interface to manage AWS

Installing the AWS IAM authenticator for Kubernetes

Installing eksctl to manage Amazon EKS

Installing the Google Cloud SDK to manage Google Cloud

Installing the Azure CLI to manage Microsoft Azure

Deploying Pachyderm on Amazon EKS

Preparing an Amazon EKS cluster to run Pachyderm

Creating an S3 object storage bucket

Deploying the cluster

Deleting a Pachyderm deployment on Amazon EKS

Deploying Pachyderm on GKE

Preparing a GKE cluster to run Pachyderm

Creating a Google Cloud object storage bucket

Deploying the cluster

Deleting a Pachyderm deployment on GKE

Deploying Pachyderm on Microsoft AKS

Preparing an AKS cluster to run Pachyderm

Creating an Azure storage container

Deploying the cluster

Deleting a Pachyderm deployment on AKS

Accessing the Pachyderm console

Summary

Further reading

Table of Contents for
Chapter 5: Installing Pachyderm on a Cloud Platform