Chapter 10: Pachyderm Language Clients

In previous chapters, we have learned how to use Pachyderm through the pachctl Command-Line Interface (CLI). We have deployed and managed multiple Pachyderm pipelines by using mostly pachctl. We briefly looked into the Pachyderm User Interface (UI)—or the dashboard—as well, although we did not use it extensively. The Pachyderm CLI enables you to perform all Pachyderm management operations and, in general, provides more functionality than the Pachyderm UI.

However, many users may decide to take Pachyderm even further by using Pachyderm Application Programming Interfaces (APIs) directly without relying on pachctl or the dashboard. Many Pachyderm users develop scripts and call the Pachyderm API directly from these scripts. As of today, Pachyderm provides two official Pachyderm programming language clients, Golang (Go) and Python, to enable advanced users to extend Pachyderm functionality further.

In addition, if you are familiar with Protocol Buffers (Protobuf), an open source platform that enables cross-platform development, you can leverage the Pachyderm pps.proto file to access Pachyderm through such languages as C, C++, and Java.

In this chapter, you will learn how to use both Python and Go Pachyderm clients. You will learn how to run basic operations by using both of these clients, including how to create repositories and pipelines.

This chapter is intended to demonstrate how to use official Pachyderm language clients.

We will cover the following topics:

  • Using the Pachyderm Go client
  • Cloning the Pachyderm source repository
  • Using the Pachyderm Python client

Technical requirements

You should have already installed the components listed next.

For a local macOS installation, you will need the following:

  • macOS Mojave, Catalina, Big Sur, or later
  • Docker Desktop for Mac 10.14
  • minikube v1.9.0 or later
  • pachctl 2.0.0 or later
  • Pachyderm 2.0.0 or later

For a local Windows installation, you will need the following:

  • Windows Pro 64-bit v10 or later
  • Windows Subsystem for Linux (WSL) 2 or later
  • Microsoft PowerShell v6.2.1 or later
  • Hyper-V
  • minikube v1.9.0 or later
  • kubectl v1.18 or later
  • pachctl 2.0.0 or later
  • Pachyderm 2.0.0 or later

For an Amazon Elastic Kubernetes Service (Amazon EKS) installation, you will need the following:

  • kubectl v1.18 or later
  • eksctl
  • aws-iam-authenticator
  • pachctl 2.0.0 or later
  • Pachyderm 2.0.0 or later

For a Microsoft Azure Kubernetes Service (AKS) cloud installation, you will need the following:

  • kubectl v1.18 or later
  • The Azure CLI
  • pachctl 2.0.0 or later
  • Pachyderm 2.0.0 or later
  • jq 1.5 or later

For a Google Kubernetes Engine (GKE) cloud installation, you will need the following:

  • Google Cloud Software Development kit (SDK) v124.0.0 or later
  • kubectl v1.18 or later
  • pachctl 2.0.0 or later
  • Pachyderm 2.0.0 or later

Downloading the source files

All scripts for this chapter are available at https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/tree/main/Chapter10-Pachyderm-Language-Clients.

We will use the image processing example that we had in Chapter 6, Creating Your First Pipeline. If you do not have them already, download the files for this example from here: https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/tree/main/Chapter06-Creating-Your-First-Pipeline.

Using the Pachyderm Go client

The Pachyderm Go client enables you to communicate with Pachyderm through the Go API. Go is a popular programming language that has been developed by Google and has become widely popular among the developer community in recent years. In this chapter, we will learn how to enable the Pachyderm Go client and how to run basic Pachyderm operations using the Go client.

The main source files that you can use for reference are located in the https://github.com/pachyderm/pachyderm/tree/master/src/client directory of the Pachyderm source repository. These files include all the methods that you can use to communicate with Pachyderm objects and primitives—specifically, the following files:

These files include most of the important Pachyderm methods and the ones that we will use in the following sections.

Installing Go on your computer

To get started, we need to verify that we have a valid Go installation in our environment. Go is supported in all major operating systems, including Microsoft Windows, Linux, and macOS. Check that you have Go installed on your computer by running the following command:

go version

You should see an output similar to this:

go version go1.16.4 darwin/amd64

If you do not have Go installed on your machine, follow these next steps to install it:

  1. Go to https://golang.org/doc/install and download the version of Go for your operating system.
  2. Open the downloaded package and follow the prompts to install Go in your system. You should see the following screen when you are done:
Figure 10.1 – Go installation

Figure 10.1 – Go installation

  1. Follow the instructions for your operating system to verify that Go was installed as described in https://golang.org/doc/install.
  2. Restart your Terminal and run go version again to verify your installation. You should see an output similar to this:

    go version go1.16.4 darwin/amd64

Now that we have Go installed, let's configure $GOPATH.

Configuring $GOPATH

If you have never used Go before, you need to make sure that you have your $GOPATH directory properly set up; otherwise, none of the scripts described in this section will work. When you installed Go, it might have already been configured. However, you might still want to configure the following:

  1. Verify that you have the following in either ~/.bash_profile, ~/.profile, or ~/.zsh_profile file:

    export GOPATH="$HOME/go"

    PATH="$GOPATH/bin:$PATH"

  2. If your respective shell configuration file did not have this configuration, add it, and then source your shell configuration file, like this:

    source ~/.<shell-profile>

  3. Check your $GOPATH directory by running the following command:

    go env

This command prints your Go environment configuration. If you are on macOS, your $GOPATH directory should be `GOPATH="/Users/<username>/go"`.

  1. If you don't have it already, create a src directory under your $GOPATH directory, as follows:

    mkdir $GOPATH/src

  2. Under $GOPATH/src, create a github.com directory, as follows:

    mkdir $GOPATH/src/github.com

You will need to clone the Pachyderm repository into this directory, as described in the next chapter.

  1. Update to the latest version of grpc, as follows:

    go get google.golang.org/grpc

After you have configured $GOPATH, you need to clone the Pachyderm source repository.

Cloning the Pachyderm source repository

Before you can use Pachyderm language clients, you need to have a copy of the Pachyderm source repository on your machine to be able to use the APIs. You will run the client methods against an existing Pachyderm cluster. The Pachyderm repository is stored in GitHub at https://github.com/pachyderm/pachyderm. In addition, you need to make sure that you switch to the branch and tag that matches your pachd and pachctl version. In this section, we will learn how to clone the Pachyderm repository and how to switch to the required branch and tag.

To be able to run Go modules used in the scripts in this section, you need to clone the Pachyderm repository under the $GOPATH directory on your computer. On Mac, Go is installed under /Users/<username>/go, and you can clone the Pachyderm repository at / Users/<username>/go/src/github.com/.

To clone the Pachyderm repository, complete the following steps:

  1. Go to https://github.com/pachyderm/pachyderm.
  2. Click on the Code tab, as shown in the following screenshot:
Figure 10.2 – Pachyderm source repository

Figure 10.2 – Pachyderm source repository

  1. In the drop-down menu, select an option to clone with HTTPS or SSH, and click the Clone icon.

    Important note

    If you decide to clone with Secure Shell (SSH) and this is your first time cloning from GitHub, you will likely need to configure an SSH key pair. For more information, see https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh.

  2. Go to your terminal and run the git clone command with the HyperText Transfer Protocol Secure (HTTPS) or SSH address you copied in Step 3, as follows:

    git clone [email protected]:pachyderm/pachyderm.git

You should see an output similar to this:

Cloning into 'pachyderm'...

remote: Enumerating objects: 226153, done.

remote: Counting objects: 100% (171/171), done.

...

The Pachyderm source code will be cloned to the pachyderm directory.

  1. Go to the pachyderm directory by running the following command:

    cd pachyderm

  2. Check the branch you are on by running the following command:

    git branch

  3. Get a list of tags by running the following command:

    git fetch –tags

  4. Verify the version of pachctl and pachd that you are using by running the following command:

    pachctl version

  5. Your output might look like this:

    COMPONENT           VERSION

    pachctl             2.0.1

    pachd               2.0.1

  6. Check out the tag that corresponds to the version of pachctl and pachd that you are using. In this example, we need to check out the 2.0.1 tag:

    git checkout tags/v2.0.1

  7. Check that you have switched to the correct version by running the following command:

    git branch

You should see the following output:

* (HEAD detached at v2.0.1)

  master

We now have valid Pachyderm source code that we will use to access our Pachyderm cluster. Next, let's connect to Pachyderm with the Go client.

Connecting to Pachyderm with the Go client

You must have a running Pachyderm cluster to be able to use the Go API client. If you have followed previous sections, you likely have one running on a cloud platform of your choice or locally. If not, go back to Chapter 4, Installing Pachyderm Locally, or Chapter 5, Installing Pachyderm on a Cloud Platform, and deploy a cluster.

We will use the access.go script to access Pachyderm. Let's look at the script to understand how it works. The first part of the script imports the required components, as we can see here:

package main

import (

     "fmt"

     "log"

     "github.com/gogo/protobuf/types"

     "github.com/pachyderm/pachyderm/v2/src/client"

)

The second part of the script defines a main function. You must use the main function with Go; otherwise, it won't work. The main function defines the Internet Protocol (IP) address and port of your cluster. In the following code example, it is localhost or 127.0.0.1. 30650 is the pachd port:

func main() {

      c, err := client.NewFromURI("grpc://localhost:30650")

      if err != nil {

          log.Fatal(err)

      }

The third part of the script, shown here, gets the version of your cluster:

     version, err := c.VersionAPIClient.GetVersion(c.Ctx(), &types.Empty{})

     if err != nil {

         panic(err)

     }

     fmt.Println(version)

}

To connect to a Pachyderm cluster, you need to know the IP address of your cluster. If running these examples on a local machine, grpc://localhost:30650 should work.

Now, let's run the script. Follow these next steps:

  1. Unless you have a load balancer deployed that enables access to your cluster, you also need to make sure that you run Pachyderm port-forwarding all the time you are accessing your cluster through the API. To start Pachyderm port-forwarding, run the following command in a separate terminal window:

    pachctl port-forward

  2. Run the access.go script, as follows:

    go run access.go

This is an example response that you should get:

major:2 micro:1

We have successfully accessed our cluster through the Go API. Our cluster is running version 2.0.1. Your version might be different.

Now, let's use the Go API to create a Pachyderm repository.

Creating a repository with the Go client

Now that we know how to connect to Pachyderm, let's create a repository with the code we have in the create-repo.go script.

Here is what the script imports:

package main

import (

     "fmt"

     "log"

     "github.com/pachyderm/pachyderm/v2/src/client"

     "github.com/pachyderm/pachyderm/v2/src/pfs"

)

The next part of the script defines a main function that performs the following operations:

  1. Connects to the Pachyderm cluster.
  2. Creates a repository called photos.
  3. Lists the repositories on this cluster.

Here is how it looks:

func main() {

     c, err := client.NewOnUserMachine("user")

     if err != nil {

         log.Fatal(err)

     }

     if _, err := c.PfsAPIClient.CreateRepo(

         c.Ctx(),

         &pfs.CreateRepoRequest{

             Repo:        client.NewRepo("photos"),

             Description: "A repository that stores images.",

             Update:      true,

         },

     ); err != nil {

         panic(err)

     }

     repos, err := c.ListRepo()

     if err != nil {

         log.Fatal(err)

     }

     fmt.Println(repos)

}

You must run port-forwarding and make sure you replace the IP address listed in the script with the IP address of your cluster. If you are running the cluster in minikube, you probably don't need to change anything.

  1. Run the create-repo.go script, as follows:

    go run create-repo.go

This command returns the following output:

[repo:<name:"photos" type:"user" > created:<seconds:1637264349 nanos:440180000 > description:"A repository that stores images." auth_info:<permissions:REPO_READ permissions:REPO_INSPECT_COMMIT permissions:REPO_LIST_COMMIT permissions:REPO_LIST_BRANCH permissions:REPO_LIST_FILE permissions:REPO_INSPECT_FILE permissions:REPO_ADD_PIPELINE_READER permissions:REPO_REMOVE_PIPELINE_READER permissions:PIPELINE_LIST_JOB permissions:REPO_WRITE permissions:REPO_DELETE_COMMIT permissions:REPO_CREATE_BRANCH permissions:REPO_DELETE_BRANCH permissions:REPO_ADD_PIPELINE_WRITER permissions:REPO_MODIFY_BINDINGS permissions:REPO_DELETE roles:"repoOwner" > ]

Now that we have created a repository, let's put some data into it.

Putting data into a Pachyderm repository with the Go client

In the previous section, we created a Pachyderm repository called photos. Let's add the files that we have used in Chapter 6, Creating Your First Pipeline, to this repository. We will use the put-files.go script to add the files. Here is what the script imports:

package main

import (

     "fmt"

     "log"

     "os"

     "github.com/pachyderm/pachyderm/v2/src/client"

)

The next part of the script connects to the Pachyderm cluster and adds the landscape.png, red_vase.png, and hand.png files to the master branch of the photos repository.

Here is the part that connects to the repository. Make sure you replace the IP address with the address of your cluster:

func main() {

     c, err := client.NewOnUserMachine("user")

     if err != nil {

         log.Fatal(err)

     }

This part adds the files:

     myCommit := client.NewCommit("photos","master", "")

     f1, err := os.Open("landscape.png")

     if err != nil {

         panic(err)

     }

     if err := c.PutFile(myCommit, "landscape.png", f1); err != nil {

         panic(err)}

     f2, err := os.Open("brown_vase.png")

     if err != nil {

         panic(err)

     }

     if err := c.PutFile(myCommit, "brown_vase.png", f2); err != nil {

         panic(err)

     }

     f3, err := os.Open("hand.png")

     if err != nil {

         panic(err)

     }

     if err := c.PutFile(myCommit, "hand.png", f3); err != nil {

         panic(err)

     }

And the last part, shown here, lists the files in the master branch of the photos repository:

    files, err := c.ListFileAll(myCommit, "/")

    if err != nil {

        panic(err)

    }

    fmt.Println(files)

}

Let's run this script with the following command:

go run put-files.go

This script returns the following output:

[file:<commit:<branch:<repo:<name:"photos" type:"user" > name:"master" > id:"2c15226b838f48cabd2ae13b43c26517" > path:"/brown_vase.png" datum:"default" > file_type:FILE committed:<seconds:1637299733 nanos:503150000 > size_bytes:93481 hash:"20612326376O&323313212215226Ra346245=Er _@E23360352240275}204235346"  file:<commit:<branch:<repo:<name:"photos" type:"user" > name:"master" > id:"2c15226b838f48cabd2ae13b43c26517" > path:"/hand.png" datum:"default" > file_type:FILE committed:<seconds:1637299733 nanos:503150000 > size_bytes:856063 hash:"14X22432251260(263267234345{16353a310357935432337213357yFg27425600}"  file:<commit:<branch:<repo:<name:"photos" type:"user" > name:"master" > id:"2c15226b838f48cabd2ae13b43c26517" > path:"/landscape.png" datum:"default" > file_type:FILE committed:<seconds:1637299733 nanos:503150000 > size_bytes:54009 hash:"320:265363363z&264324]364unfv24330001[206347344b257274366220JnR04" ]

Great! We have a repository with data in it. Now, let's learn how to create a pipeline.

Creating pipelines with the Go client

Finally, we can create pipelines for our example from Chapter 6, Creating Your First Pipeline.

Here is what the create-pipeline.go script imports:

package main

import (

     "fmt"

     "log"

     "github.com/pachyderm/pachyderm/v2/src/client"

     "github.com/pachyderm/pachyderm/v2/src/pps""

)

The second part of the script connects to the Pachyderm cluster by using the pachd IP address, as follows:

func main() {

     c, err := client.NewFromAddress("127.0.0.1:30650")

     if err != nil {

         log.Fatal(err)

     }

The next part of the script creates a contour pipeline. You can see that the script uses the svekars/contour-histogram:1.0 image and gets the data from the photos repository with the / glob pattern. One important thing to note in the example shown here is that you need to specify the parallelism_spec for all pipelines:

if err := c.CreatePipeline(

         "contour",

         "svekars/contour-histogram:1.0 ",

         []string{"python3", "/contour.py"},

         []string{},

         &pps.ParallelismSpec{

             Constant: 1,

         },

         client.NewPFSInput("photos", "/"),

         "",

         false,

     ); err != nil {

         panic(err)

     }

Next, the script creates a histogram pipeline, as follows:

if err := c.CreatePipeline(

         "histogram",

         "svekars/contour-histogram:1.0",

         []string{"python3", "/histogram.py"},

         []string{},

         &pps.ParallelismSpec{

             Constant: 1,

         },

         client.NewPFSInput("contour", "/"),

         "",

         false,

     ); err != nil {

         panic(err)

     }

And finally, the script lists all the created pipelines, as follows:

     pipelines, err := c.ListPipeline(true)

     if err != nil {

         panic(err)

     }

     fmt.Println(pipelines)

}

Run the following command:

go run create-pipeline.go

Here is an example response:

[pipeline:<name:"histogram" > version:1 spec_commit:<branch:<repo:<name:"histogram" type:"spec" > name:"master" > id:"44945b0d0e2944e3b1015617e224e3e3" > state:PIPELINE_STARTING job_counts:<key:1 value:1 > last_job_state:JOB_CREATED parallelism:1 type:PIPELINE_TYPE_TRANSFORM details:<transform:<image:"svekars/contour-histogram:1.0" cmd:"python3" cmd:"/histogram.py" > parallelism_spec:<constant:1 > created_at:<seconds:1637300756 nanos:806783300 > output_branch:"master" input:<pfs:<name:"contour" repo:"contour" repo_type:"user" branch:"master" glob:"/" > > salt:"0715a02027ba4489a79bd8a400f349ad" datum_tries:3 reprocess_spec:"until_success" >  

pipeline:<name:"contour" > version:1 spec_commit:<branch:<repo:<name:"contour" type:"spec" > name:"master" > id:"f3f8bf226e5a4dda8a9f27da10b7fd87" > state:PIPELINE_STARTING job_counts:<key:1 value:1 > last_job_state:JOB_CREATED parallelism:1 type:PIPELINE_TYPE_TRANSFORM details:<transform:<image:"svekars/contour-histogram:1.0 " cmd:"python3" cmd:"/contour.py" > parallelism_spec:<constant:1 > created_at:<seconds:1637300756 nanos:592992600 > output_branch:"master" input:<pfs:<name:"photos" repo:"photos" repo_type:"user" branch:"master" glob:"/" > > salt:"98c0a867ea56439eb1f2466fbf1aa838" datum_tries:3 reprocess_spec:"until_success" > ]

You can see that the script has created two of our pipelines, as intended. We have the whole example uploaded in one file called contour-go-example.go in the chapter's GitHub repository. Now that you have learned how to do it, you can just run that one script to create a whole contour pipeline example from one command. Next, we'll learn how to clean up our cluster.

Cleaning up the cluster with the Go client

The cleanup.go script cleans up your cluster and deletes all the pipelines, data, and repositories. Only run it if you do not want to preserve the data anymore.

This script only needs to import the client from the Pachyderm repository. For this, the following code is required:

package main

import (

     "fmt"

     "log"

     "github.com/pachyderm/pachyderm/v2/src/client"

)

The next part of the script deletes all the repositories and pipelines. We set the force flag to true for all pipelines and repositories so that Pachyderm does not interrupt deletion due to downstream pipeline dependencies. The code is illustrated in the following snippet:

     if err := c.DeleteRepo("contour", true); err != nil {

         panic(err)

     }

     if err := c.DeleteRepo("photos", true); err != nil {

          panic(err)

     }

     if err := c.DeleteRepo("histogram", true); err != nil {

           panic(err)

     }

     if err := c.DeletePipeline("contour", true); err != nil {

           panic(err)

     }

     if err := c.DeletePipeline("histogram", true); err != nil {

           panic(err)

     }

And the final part of the script returns empty lists since we delete all the pipelines and repositories, as illustrated in the following code snippet:

pipelines, err := c.ListPipeline(true)

     if err != nil {

         panic(err)

     }

     fmt.Println(pipelines)

     repos, err := c.ListRepo()

     if err != nil {

         log.Fatal(err)

     }

     fmt.Println(repos)

}

Run the following command:

go run cleanup.go

This command returns the following output:

[]

[]

In this section, we have learned how to use the Go client to create Pachyderm pipelines and repositories. Next, let's learn how to do this with the Pachyderm Python client.

Using the Pachyderm Python client

Python is probably one of the most popular languages within the software engineering and data science community. Pachyderm provides an officially supported Python client through the python-pachyderm package. You can find the Python Pachyderm source repository on GitHub at https://github.com/pachyderm/python-pachyderm and on the Python Package Index (PyPI) at https://pypi.org/project/python-pachyderm/.

The main files that you can use as a Python client reference are located in the https://github.com/pachyderm/python-pachyderm/tree/master/src/python_pachyderm/mixin directory of the Pachyderm source repository—most notably, the following files:

Before you proceed, you must have the following components configured on your machine:

  • A copy of the Pachyderm repository (see the Cloning the Pachyderm source repository section). With Python Pachyderm, you can clone your repository to any directory on your machine. It does not have to be $GOPATH.
  • Python 3.6 or later installed on your machine.
  • Access to an active Pachyderm cluster. If it's a local installation, you need to have Pachyderm port-forwarding running all the time you work with the repository through the APIs. If it's a cloud installation, you need to either have a load balancer running to enable access to your cluster, or you might be able to use Pachyderm port-forwarding as well.

We have reviewed the prerequisites for this section. Now, let's install the python-pachyderm client.

Installing the Pachyderm Python client

Before you can start using the Pachyderm Python client, you need to install it on your machine.

To install the Python Pachyderm client, complete the following steps:

  1. Open a terminal window.
  2. If you are on macOS or Linux, run the following command:

    pip install python-pachyderm

  You should see the following output:

Collecting python-pachyderm

  Downloading python-pachyderm-6.2.0.tar.gz (409 kB)

  ...

Successfully installed grpcio-1.38.0 protobuf-3.17.0 python-pachyderm-6.2.0

Your version of the python-pachyderm package might be different.

Now that we have python-pachyderm installed, let's connect to Pachyderm by using python-pachyderm.

Connecting to your Pachyderm cluster with the Python client

To get started, let's use the access.py script to connect to your cluster. Make sure port-forwarding is running on your machine. Here is the script:

import python_pachyderm

client = python_pachyderm.Client()

print(client.get_remote_version())

This script connects to pachd, which runs on https://localhost:30650, by using the python_pachyderm.Client() invocation and prints the version of Pachyderm that you are running.

Let's run this script and see what output it returns.

Run the access.py script with the following command:

python access.py

You should see output similar to this:

major: 2

micro: 1

This output means that we are on version 2.0.1. Your output might be different.

Now that we know how to access the cluster, let's go ahead and create a Pachyderm repository.

Creating a Pachyderm repository with the Python client

We will use the create-repo.py script to create a Pachyderm repository called photos.

Here is the code of the script:

import python_pachyderm

client = python_pachyderm.Client()

client.create_repo("photos")

print(list(client.list_repo()))

Run the create-repo.py script with the following command:

python create-repo.py

Here is an example output:

[repo {

  name: "photos"

  type: "user"

}

created {

  seconds: 1637207890

  nanos: 80987000

}

auth_info {

  permissions: REPO_READ

  permissions: REPO_INSPECT_COMMIT

...

Now that we have a repository created, let's put some data into it.

Putting data into a Pachyderm repository with the Python client

We will put the same files we used in Chapter 6, Creating Your First Pipeline, to the photos repository we have just created. Here is the script that we will use:

import python_pachyderm

client = python_pachyderm.Client()

with client.commit('photos', 'master') as i:

     client.put_file_url(i, 'landscape.png', 'https://i.imgur.com/zKo9Mdl.jpg')

     client.put_file_url(i, 'hand.png', 'https://i.imgur.com/HtZ8FyG.png')

     client.put_file_url(i, 'red_vase.png', 'https://i.imgur.com/d45jop9.jpg') print(list(client.list_file(("photos","master"), "/")))

The script uses the client.commit method to start a commit to the master branch of the photos repository, and the client.put_file_bytes adds three files to the repository. Note that client.list_file needs to be a list and not a string for the command to work correctly.

Let's run this script.

Run the put-files.py script with the following command:

python put-files.py

Here is the system response that you should get:

[file {

  commit {

    branch {

      repo {

        name: "photos"

        type: "user"

      }

      name: "master"

    }

    id: "e29c6f5c49244ce193fe5f86c9df0297"

  }

  path: "/hand.png"

  datum: "default"

}

file_type: FILE

committed {

  seconds: 1637208291

  nanos: 161527000

}

size_bytes: 856063

hash: "14X22432251260(263267234345{16353a310357935432337213357yFg27425600}"  

...

]

The preceding output is truncated. You should see the same output for each file that we added to the repository.

Now that we have added the files, let's create pipelines for this example.

Creating pipelines with the Pachyderm Python client

Now that we have a repository and files uploaded to it, let's use the create-pipeline.py script to create two pipelines from the example we had in Chapter 6, Creating Your First Pipeline.

python-pachyderm provides two methods to create pipelines, as outlined here:

  • create_pipeline: This method is for all languages and is equivalent to the pachctl create pipeline method.
  • create_python_pipeline: This pipeline is designed to be run with Python code and provides a slightly different User Experience (UX). You can read more about this method in the Pachyderm documentation, at https://docs.pachyderm.com.

We will use the standard create_pipeline method to create this pipeline.

The first part of the script creates a contour pipeline, as follows:

import python_pachyderm

from python_pachyderm.service import pps_proto

client = python_pachyderm.Client()

client.create_pipeline(

     pipeline_name="contour",

     transform=pps_proto.Transform(

         cmd=["python3", "contour.py"],

         image="svekars/contour-histogram:1.0",

     ),

     input=pps_proto.Input(

         pfs=pps_proto.PFSInput(glob="/", repo="photos")

     ),

)

The second part of the script creates a histogram pipeline, as follows:

client.create_pipeline(

     pipeline_name="histogram",

     transform=pps_proto.Transform(

         cmd=["python3", "histogram.py"],

         image="svekars/contour-histogram:1.0",

     ),

     input=pps_proto.Input(

         pfs=pps_proto.PFSInput(glob="/", repo="contour")

     ),

)

And the last part of the script returns a list of pipelines that were created, as illustrated in the following code snippet:

Print(list(client.list_pipeline()))

Let's run this script.

Run the create-pipeline.py script with the following command:

python create-pipeline.py

Here is a fragment of the output:

[pipeline {

  name: "histogram"

}

version: 1

spec_commit {

  branch {

    repo {

      name: "histogram"

      type: "spec"

    }

    name: "master"

  }

  id: "94286ef36318425c8177bd4e0f959c57"

}

state: PIPELINE_STARTING

job_counts {

  key: 1

  value: 1

}...

In this section, we have learned how to create pipelines by using the python-pachyderm client. Next, let's clean up our cluster.

Cleaning up the cluster with the Python client

We have successfully recreated our contour and histogram pipeline example. The whole example is available as one file called contour-histogram-example.py in the GitHub repository. You can download it at https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/tree/main/Chapter10-Pachyderm-Language-Clients and recreate it as many times as needed.

In this section, we will clean up our cluster so that we have a clean installation for Chapter 11, Using Pachyderm Notebooks. We will use the cleanup.py script for that, as follows:

import python_pachyderm

client.delete_repo("photos", force=True)

client.delete_pipeline(pipeline_name="contour", force=True, keep_repo=False)

client.delete_pipeline(pipeline_name="histogram", force=True, keep_repo=False)

print(list(client.list_repo()))

print(list(client.list_pipeline()))

This script uses the delete_all_pipelines method, which deletes all pipelines in the cluster. You can also use delete_all to delete all objects and primitives in the cluster.

Let's run this script.

Run the cleanup.py script with the following command:

python cleanup.py

This command should return the following output:

[]

[]

That's it! We have successfully cleaned up our cluster.

Summary

In this chapter, we've learned about how to use the two officially supported Pachyderm language clients—the Pachyderm Go client and the Python client. We've learned how to clone the Pachyderm repository and switch to the correct branch and tag. We've learned how to connect, create repositories, put files into repositories, and create pipelines, as well as delete all the objects after we are done. There is much more that you can do with these two language clients, but the examples in this chapter give you a general idea about how to use them.

In the next chapter, we will learn how to integrate Pachyderm with JupyterHub, a popular data science Integrated Development Environment (IDE) for which Pachyderm has a special plugin. We will also work more with the python-pachyderm client.

Further reading

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.248.62