Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3: Pachyderm Pipeline Specification

A Machine Learning (ML) pipeline is an automated workflow that enables you to execute the same code continuously against different combinations of data and parameters. A pipeline ensures that every cycle is automated and goes through the same sequence of steps. Like in many other technologies, in Pachyderm, an ML pipeline is defined by a single configuration file called the pipeline specification, or the pipeline spec.

The Pachyderm pipeline specification is the most important configuration in Pachyderm as it defines what your pipeline does, how often it runs, how the work is spread across Pachyderm workers, and where to output the result.

This chapter is intended as a pipeline specification reference and will walk you through all the parameters you can specify for your pipeline. To do this, we will cover the following topics:

Pipeline specification overview
Understanding inputs
Exploring informational parameters
Exploring transformation
Optimizing your pipeline
Exploring service parameters
Exploring output parameters

Pipeline specification overview

Typically, when you conduct an ML experiment, it involves multiple sequential steps. In the simplest scenario, your pipeline takes input from an input repository, applies your transformation code, and outputs the result in the output repository. For example, you may have a set of images to apply a monochrome filter to and then output the result in an output repository that goes by the same name as the pipeline. This workflow performs only one operation and can be called a one-step pipeline, or one-step workflow. A diagram for such a pipeline would look like this:

Figure 3.1 – One-step workflow

The specification for this simple pipeline, in YAML format, would look like this:

---

pipeline:

transform:

cmd:

- python3

- "/photo-filter.py"

image: myregistry/filter

input:

pfs:

repo: photos

glob: "/"

This is the simplest pipeline specification that you can create. It must include the following parameters:

name: A descriptive name for your pipeline. Often, the name of the pipeline describes a step in your ML workflow. For example, if this pipeline applies a photo filter to your images, you can call it apply-photo-filter. Alternatively, if it validates your model, you could call it cross-validation.
transform: This parameter includes your transformation code, which can be specified as a reference to a file or directly inline. We will discuss this parameter in more detail in the next section.
input: This parameter refers to an existing input repository where the files for the pipeline are located. input is a filesystem inside your pipeline worker under the pfs/ directory. For example, if your repository is called photos, it is stored under pfs/photos on your pipeline worker. Output repositories are created automatically by the pipeline and stored under pfs/out. All output repositories have the same name as the pipeline. For example, if your pipeline is called apply-photo-filter, the output repository will be stored as apply-photo-filter in pfs/out/.

This is a very simple example of a pipeline. In a more realistic use case, you would likely have more than one pipeline step. In a typical ML pipeline, you need to perform pre-processing, training, and cross-validation steps, among others. When you have multiple pipelines chained together, this is called a multi-step workflow. For example, if you are creating an NLP pipeline, your pipeline may have the following structure:

Figure 3.2 – Multi-step workflow

In the preceding diagram, each pipeline has a pipeline specification with a name, input repository, transformation code, and other parameters defined.

All pipeline specifications must be written in YAML Ain't Markup Language (YAML) or JavaScript Object Notation (JSON) format. These formats are easy for people to read and write and are widely used in various configuration files in the industry. It is easier to write in than Extensible Markup Language (XML) or other similar formats.

Now that we have reviewed the minimum pipeline specification and have looked at a more realistic example, let's review the other parameters that you can specify.

Understanding inputs

We described inputs in Chapter 2, Pachyderm Basics, in detail by providing examples. Therefore, in this section, we'll just mention that inputs define the type of your pipeline. You can specify the following types of Pachyderm inputs:

PFS is a generic parameter that defines a standard pipeline and inputs in all multi-input pipelines.
Cross is an input that creates a cross-product of the datums from two input repositories. The resulting output will include all possible combinations of all datums from the input repositories.
Union is an input that adds datums from one repository to the datums in another repository.
Join is an input that matches datums with a specific naming pattern.
Spout is an input that consumes data from a third-party source and adds it to the Pachyderm filesystem for further processing.
Group is an input that combines datums from multiple repositories based on a configured naming pattern.
Cron is a pipeline that runs according to a specified time interval.
Git is an input that enables you to ingest data from a Git repository.

For all inputs except for Cron and Git, you can define a pfs parameter that defines the input.

pfs

The pfs parameter, which stands for Pachyderm File System, defines the input and all its attributes, such as name, repository, branch, and others. A simple Pachyderm pipeline takes only one input, while the multi-input pipelines take multiple pfs inputs. You need to define one or multiple pfs inputs for all pipelines, except for Cron and Git.

Here are the sub-parameters of the pfs input:

name: The name of your pipeline.
repo: A Pachyderm input repository where the data is stored.
branch: A branch in the Pachyderm input repository where the data is stored. Often, this will be the master branch.
glob: A parameter that defines how to break the data into chunks for processing. You can read more about the glob parameter in the Datums section of Chapter 2, Pachyderm Basics.
lazy: A parameter that enables slower, less aggressive data downloading on a pipeline. The lazy parameter is useful when you need to look into a subset of your data.
s3: This parameter defines whether to include an S3 gateway sidecar on the pipeline. When you integrate with third-party applications through the S3 gateway, this ensures that the pipeline's provenance is preserved.

You can read more about the types of pipelines and inputs and view example pipelines in Chapter 2, Pachyderm Basics. The next section describes informational parameters that you can define for your pipeline.

Exploring informational parameters

Pipeline informational parameters define basic information about the pipeline. Out of all of them, only the name parameter is required in any pipeline specification. All other parameters are optional and can be omitted. Let's look at these parameters in more detail.

name

The name parameter is the descriptive name of your pipeline. Typically, you want to name a pipeline after the type of transformation it performs. For example, if your pipeline performs image classification, you may want to call it image-classification. A pipeline name must consist of alphanumeric characters, dashes, and underscores, and cannot exceed 63 symbols.

The following is an example of the name parameter in YAML format:

---

pipeline:

The following is an example of the name parameter in JSON format:

{

"pipeline": {

"name": "image-classification"

Next, let's look at the description parameter.

description

The description parameter provides additional information about the pipeline. Although it is an optional parameter, it is good practice to add a short description to your pipeline. For example, if your pipeline performs image classification, you can add the following description: A pipeline that performs image classification by using scikit-learn.

The following is an example of the description parameter in YAML format:

description: A pipeline that performs image classification by using scikit-learn.

The following is an example of the description parameter in JSON format:

"description": "A pipeline that performs image classification by using scikit-learn.",

Next, let's learn about metadata.

metadata

The metadata parameter enables you to specify Kubernetes labels or annotations. Labels are typically used to group Kubernetes objects into some sort of category and help simplify the management of those objects. Labels can be queried to display objects of the same type.

Annotations, on the other hand, can be used to specify any random key-value pairs that are not defined within Kubernetes and could be used by external applications. You can use annotations to define the type of service, but things such as Identity Access Management (IAM) roles should be specified through pach_patch or pach_spec instead. Multiple labels and annotations can be specified in each pipeline specification.

Here is an example of how to specify annotations in YAML format:

metadata:

annotations:

annotation: data

The following example shows how to specify annotations in JSON format:

"metadata": {

"annotations": {

"annotation": "data"

}

The following example shows how to specify labels in YAML format:

metadata:

labels:

label: object

Finally, the following example shows how to specify labels in JSON format:

"metadata": {

"labels": {

"label": "object"

}

Now that we've learned about the informational parameters, let's look at the transformation section of the Pachyderm pipeline.

Exploring transformation

The transformation section is where you define your pipeline transformation code. It is the core of your pipeline's functionality. Most pipelines, unless they are a connector between two pipelines or a pipeline that exports results outside of Pachyderm, must have a transformation section.

The most important parameters of a transformation section – and the ones that are most commonly used – are image and cmd or stdin, env, and secrets.

Let's look at these parameters in more detail.

image

The image parameter defines a Docker image that your pipeline will run. A Docker image contains information about the environment in your pipeline container. For example, if you are running Python code, you will need to have some version of Python in your pipeline image. There are many publicly available containers that you can use for your pipeline.

You can also include your scripts in that container. Unless your code is just a Bash script that can be specified through the stdin parameter inline, you will likely need to build your own Docker image, include your code in that image, and store it in a public or private container registry. Docker images are built from a Dockerfile, which describes the container environment and what you can run in the container. You can read more about Docker images at https://docs.docker.com/.

Important note

Do not use the Docker CMD instruction; instead, use RUN. The CMD instruction will fail.

The following code shows how to specify a Docker image in YAML format:

metadata:

labels:

label: object

The following code shows how to specify a Docker image in JSON format:

"transform": {

"image": "my-image",

However, just specifying a Docker image is not enough. You must define what to run, either through the cmd or stdin parameter.

cmd

The cmd parameter defines the code that the pipeline will run against your data. There is a lot of flexibility around what you can put in the cmd line. Typically, you want to specify the type of command you want to run, such as python, or set it to run a command-line shell, such as Bourne Shell (sh) or Bourne Again Shell (bash), and then specify the list of commands that you want to run in the stdin parameter.

There is no difference or preference between the two approaches. The only difference is that if you specify a file in the cmd parameter, you will need to build a Docker image and include that file in the image.

For example, if you have a Python 3 file that contains the code that you want to run against your data, you can specify it like this in YAML format:

cmd:

- python3

- "/test.py"

Alternatively, you can specify labels in JSON format:

"transform": {

"cmd": [ "python3", "/test.py" ],

However, if you want to specify your commands inline in the stdin parameter, just have the format specified in the cmd parameter like this, in YAML format:

cmd:

- python3

You can do the same in JSON format:

"transform": {

"cmd": [ "python3" ],

See the stdin section for examples of how you can specify your inline code.

Your cmd field can get even more complex than that, however. For example, you can specify a script inside the cmd parameter.

The following text is an example of the syntax you can use in the cmd field in YAML format:

transform:

image: my-image

cmd:

- tree

- "/pfs"

- "&&"

- python

- my-script.py

- "--outdir"

- "/pfs/out"

- "--input"

- "/pfs/data "

The following text is an example of the syntax you can use in the cmd field in JSON format:

"transform": {

"image": "my-image",

"cmd": ["tree",

"/pfs",

"&&",

"python",

"my-script.py",

"--outdir", "/pfs/out",

"--input", "/pfs/data "]

Next, let's review the stdin parameter.

stdin

The stdin parameter is similar to the UNIX standard input (stdin), and it enables communication between the Pachyderm environment and the pipeline worker. This means that you can put a code in the format specified in the cmd command inline in the stdin field. You could also specify paths to your code files as well, similar to the cmd parameter.

This method does not require you to build a Docker image and allows you to configure your pipeline entirely through the pipeline specification. If you are unfamiliar with the Docker image-building process, this approach may feel more appealing. However, for more complex pipelines, you likely want to save your scripts in files, build Docker images, and use them in your pipeline.

The following code shows the syntax you can use in the stdin field in YAML format:

transform:

cmd:

- bash

stdin:

- for f in /pfs/data/*

- do

- filename=$(basename "$f")

- cp /pfs/data/* pfs/out/mypipeline/*

- done

The following is an example of the syntax you can use in the stdin field in JSON format:

"transform": {

"cmd": ["bash"],

"stdin": [

"for f in /pfs/data/*",

"do",

"filename=$(basename "$f")",

"cp /pfs/data/* pfs/out/mypipeline/*",

"done"]

Because the preceding examples do not reference any files, you do not need to build a specific Docker image for them and include the file in there.

However, if you do reference any files or any environment prerequisites that are even slightly more complex than Bash, you likely need a Docker image. For example, if you have a my-script.py file that contains your code, you need to build a Docker image that will include that script, and you must reference it in your pipeline specification.

err_cmd

The err_cmd parameter enables you to define how Pachyderm handles failed datums and ultimately allows you to treat failed datums as non-critical errors, allowing a job with failed datums to succeed and trigger the next job only with the healthy datums. err_cmd does not write any data to the output repo. The err_cmd field is often used in combination with the err_stdin field, where you specify the actual error code, though you could also refer to a file with your error code. If you want your pipeline to succeed even when the job contains failed datums, you can simply set "err_cmd": true.

The following is the syntax you can use in the err_cmd field in YAML format:

transform:

...

err_cmd:

- bash

- "/my-error-code.sh"

The following is the syntax you can use in the err_cmd field in JSON format:

"transform": {

...

"err_cmd": [ "bash", "/my-error-code.sh"],

We will look at an example of how to use err_cmd in combination with err_stdin in the next section.

err_stdin

The err_stdin parameter is used in combination with the err_cmd parameter to specify the error code to run against failed datums. Similar to the stdin parameter, you can specify inline code to handle failed datums. For example, you may want to check if a datum is in a specific directory, and if it is, mark the datum as failed. Typically, you can just write a simple Bash script with an if… then condition to handle this.

The following code shows the syntax you can use in err_stdin with the err_cmd field in YAML format:

transform:

...

err_cmd:

- bash

err_stdin:

- if

- "[-a /pfs/repo1]"

- then

- exit 0

- fi

- exit 1

The following code shows the syntax you can use in err_stdin with the err_cmd field in JSON format:

"transform": {

...

"err_cmd": [

"bash"

"err_stdin": [

"if",

"[-a /pfs/repo1]",

"then",

"exit 0",

"fi",

"exit 1"

]

Next, let's learn about the env parameter.

env

The env parameter enables you to specify Pachyderm the environment variables and arbitrary variables that you need to communicate with other third-party tools. These parameters may include paths to directories and files, hostnames and ports, secret access keys, various identifiers, and many others.

Pachyderm variables can be included as well. For example, you can use the LOG_LEVEL environment variable to specify the verbosity of your log messages for pachd. As another example, you can also specify an AWS region and a bucket in the env field.

The following code shows the syntax you can use in the env field in YAML format:

transform:

...

env:

AWS_REGION: us-west-2

S3_BUCKET: s3://my-bucket/

The following code shows the syntax you can use in the env field in JSON format:

"transform": {

...

"env": {

"AWS_REGION": "us-west-2",

"S3_BUCKET": "s3://my-bucket/"

}

For a complete list of Pachyderm variables, see the Pachyderm documentation at https://docs.pachyderm.com/latest/deploy-manage/deploy/environment-variables/.

secrets

The secrets field enables you to specify Kubernetes secrets, which include sensitive information. This can include passwords or SSH public keys. You need to define a secret by either using the env_var and key parameters or the mount_point parameter.

The following code shows the syntax you can use in the name and mount_path fields to set the path to the secrets file in YAML format:

transform:

...

secrets:

mount_path: "/path/to/file"

The following code shows how to specify these parameters in JSON format:

transform:

...

"secrets": {

"name": "my-ssh-key",

"mount_path": "/path/to/file"

}

The following code shows the syntax you can use in the env_var and key parameters to set secrets in YAML format:

transform:

...

secrets:

env_var: MY_KEY

key: "mykey"

Here is how to do the same in JSON format:

"transform": {

...

"secrets": {

"name": "my-ssh-key",

"env_var": "MY_KEY",

"key": "my_key"

}

Next, let's learn about image_pull_secrets.

image_pull_secrets

The image_pull_secrets parameter enables you to configure your Pachyderm pipeline to pull images from a private Docker registry. To specify this parameter, you need to create a Kubernetes secret with a Docker config, as described in the Kubernetes documentation at https://kubernetes.io/docs/concepts/containers/images/#creating-a-secret-with-a-docker-config, and then specify the secret in the pipeline specification under the image_pull_secrets parameter. You will need to use a full path to the Docker image for the pipeline to pull the image correctly.

The following code shows the syntax you can use in the image_pull_secrets parameter to enable the pipeline to pull images from a private Docker registry in YAML format:

transform:

...

image: my-private-docker-registry.com/my-project/my-image:latest

image_pull_secrets:

- my-secret

This is how you would write the same in JSON format:

"transform": {

...

"image": "my-private-docker-registry.com/my-project/my-image",

"image_pull_secrets": ["my-secret"]

The next parameter that we will review is accept_return_code.

accept_return_code

The accept_return_code parameter enables you to specify an array of integers that define error codes that your pipeline will still be considered successful with. You can use this functionality in cases where you want your code to succeed, even if some part of it failed. This parameter is similar to the err_cmd functionality.

The following code shows the syntax you can use in the accept_return_code parameter to specify error codes in YAML format:

transform:

...

accept_return_code:

- 256

Here is the same example in JSON format:

"transform": {

...

"accept_return_code": [256]

The next parameter we will look at is debug.

debug

The debug parameter enables you to set the verbosity of the pipeline's logging output. Basic logging is enabled by default, but if you'd like to include more detailed messaging, set this parameter to true. By default, this parameter is set to false.

Here is how you can enable debug logging for your pipeline in YAML format:

transform:

...

debug: true

This is how you would enable debug logging for your pipeline in JSON format:

"transform": {

...

"debug": true

Next, let's learn how to use the user parameter.

user

The user parameter enables you to define a user and a group that runs the container's code. This parameter is similar to the Docker USER directive, and you can also define it through Dockerfile. By default, Pachyderm checks what's in your Dockerfile first and sets this value for the user parameter in the pipeline specification. If nothing is specified in your Dockerfile and the pipeline specification, the default parameter is used, which is root. The only time that you must explicitly specify a user in the pipeline specification is when you deploy Pachyderm with the --no-expose-docker-socket parameter.

You can read more about the Docker USER parameter at https://docs.docker.com/engine/reference/builder/#user.

Here is how you can specify user in YAML format:

transform:

...

user: test-user

Here is how you can specify user in JSON format:

"transform": {

...

"user": "test-user"

Now, let's learn about the working_dir parameter.

working_dir

The working_dir parameter enables you to specify a working directory for your pipeline container. This parameter is similar to the Docker WORKDIR directive, and you can also define it through Dockerfile. By default, Pachyderm checks what's in your Dockerfile first and sets this value for the working_dir parameter in the pipeline specification. If nothing is specified in Dockerfile and the pipeline specification, the default parameter is used, which is the root directory (/) or the directory that the Docker image inherits from the base image. The only time that you must explicitly specify a working directory in the pipeline specification is when you deploy Pachyderm with the --no-expose-docker-socket parameter.

You can read more about the Docker Workdir parameter at https://docs.docker.com/engine/reference/builder/#workdir.

Here is how you can specify workdir in YAML format:

transform:

...

working_dir: /usr/src/test

The same parameter in JSON format would look like this:

"transform": {

...

"working_dir": "/usr/src/test"

Next, we'll look at the dockerfile parameter.

dockerfile

The dockerfile parameter enables you to specify a path to the location of the Dockerfile for your pipeline. This is useful when you use Pachyderm's pachctl update-pipeline –build -f <pipeline-spec> to build new Docker images for your pipeline. By default, Pachyderm will look for a Dockerfile in the same directory as the pipeline specification. But with the dockerfile parameter, you can set any path for it.

Here is how you can specify a path to Dockerfile in YAML format:

transform:

...

dockerfile: /path/to/dockerfile

You can do the same in JSON format like this:

"transform": {

...

"dockerfile": "/path/to/dockerfile "

In this section, we learned about all the parameters that you can specify for your transformation code. In the next section, we will review how to control pipeline worker performance and assign resource limits to optimize your pipeline.

Optimizing your pipeline

This section will walk you through the pipeline specification parameters that may help you optimize your pipeline to perform better. Because Pachyderm runs on top of Kubernetes, it is a highly scalable system that can help you use your underlying hardware resources wisely.

One of the biggest advantages of Pachyderm is that you can specify resources for each pipeline individually, as well as defining how many workers your pipeline will spin off for each run and what their behavior will be when they are idle and waiting for new work to come.

If you are just testing Pachyderm to understand whether or not it would work for your use case, the optimization parameters may not be as important. But if you are working on implementing an enterprise-level data science platform with multiple pipelines and massive amounts of data being injected into Pachyderm, knowing how to optimize your pipeline becomes a priority.

You must understand the concept of Pachyderm datums before you proceed. Datums play a major role in pipeline scalability. If you have not read Chapter 2, Pachyderm Basics, yet, you may want to read it before you continue.

parallelism_spec

parallelism_spec defines the number of workers to spin off for your pipeline. You can specify either a coefficient or constant policy. By default, Pachyderm deploys one worker per pipeline with a constant policy of 1.

The coefficient policy means that Pachyderm will create several workers proportional to the specified coefficient. For example, if you have 50 nodes in your Kubernetes cluster and set the coefficient policy to 1, Pachyderm will use all 50 nodes for this cluster. If you use the coefficient policy, your pipeline needs access to the Kubernetes administrative nodes. If you are running Pachyderm on a hosted version of Kubernetes, such as on the AWS or GKE platform, you may not have access to these, and the pipeline will constantly restart. In that case, you will have to use the constant policy instead.

The constant policy enables you to specify the exact number of worker nodes for your pipeline, such as 3, 25, or 100. These workers will run in this pipeline infinitely. However, if you want your cluster to spin them down when idle, you can set the standby:true parameter so that your cluster resizes dynamically based on the workload.

The following code shows the syntax you can use in the parallelism_spec parameter to specify the coefficient policy in YAML format:

parallelism_spec:

coefficient: '1'

The following code shows the syntax you can use in the parallelism_spec parameter to specify the coefficient policy in JSON format:

"parallelism_spec": {

"coefficient": "1"

This is how you can define the parallelism_spec parameter to specify the constant policy in YAML format:

parallelism_spec: until_success

The following code shows how to use the parallelism_spec parameter to specify the constant policy in JSON format:

"parallelism_spec": "until_success"

Next, let's learn how to use reprocess_spec.

reprocess_spec

reprocess_spec enables you to force your pipeline to reprocess all datums. By default, Pachyderm skips reprocessing successful datums, but if your pipeline interacts with an external application or system, you may want to reprocess all the datums. This behavior protects the pipeline from connection and other errors.

The following is the syntax you can use in reprocess_spec in YAML format:

reprocess_spec:

constant: '1'

The following is the syntax you can use in reprocess_spec in JSON format:

"parallelism_spec": {

"constant": "1"

}

Next, we'll learn how to use cache_size.

cache_size

The cache_size parameter enables you to define the amount of cache memory for the user and storage container. Pachyderm pre-downloads the data before processing it and increasing cache_size may help increase the download speed. The default value is 64M, and you can increase this as needed to cache your datums. This is a fine-tuning parameter and should only be used once you have optimized your pipeline through glob and parallelism_spec.

The following is an example of the syntax you can use in the cache_size parameter to increase the cache's size in YAML format:

cache_size: 1G

In JSON format, the same parameter would look like this:

"cache_size": "1G",

Now, let's review the max_queue_size parameter.

max_queue_size

The max_queue_size parameter defines how many datums the pipeline can download at the same time. You can use max_queue_size to make the pipeline pre-download the next datum while the other datums are being processed. By default, Pachyderm sets max_queue_size to 1, meaning that only one datum can be downloaded at a time. This is a fine-tuning parameter that can improve the download speed of datums into the worker if the download time is significantly longer than the processing time. However, you should only adjust this parameter once you have configured the correct glob and paralellism_spec.

The following code shows how to use the max_queue_size parameter to increase the cache's size in YAML format:

max_queue_size: 5

The same key-value pair in JSON format looks as follows:

"max_queue_size": "5",

Next, we'll learn about chunk_spec.

chunk_spec

The chunk_spec parameter defines how many datums to send to each worker for processing. By default, this parameter is set to 1. A chunk of data can be set to number (number of datums) or size_bytes. For example, if you set chunk_spec in number (number of datums) to 3, each worker will get a chunk of three datums at a time.

With size_bytes, you can make evenly sized datums if the runtime of a datum is proportional to its size. If this is not the case, use the number parameter instead.

The following code shows how to set number in the chunk_spec parameter to make the workers process this number of datums at a time in YAML format:

chunk_spec:

number: '10'

Here's how you can set number in the chunk_spec parameter to make the workers process this number of datums at a time in JSON format:

"chunk_spec": {

"number": "10"

The following code shows how to set size_bytes ( of datums) in the chunk_spec parameter to make the workers process this number of datums at a time in YAML format:

chunk_spec:

size_bytes: '1210'

If you prefer to write in JSON format, setting size_bytes in the chunk_spec parameter to make the workers process this number of datums at a time would look like this:

"chunk_spec": {

"size_bytes": "1210"

Next, we will learn how to set resource limits.

resource_limits

The resource_limits parameter enables you to restrict the amount of Central Processing Unit (CPU), Graphics Processing Unit (GPU), and memory resources that your pipeline worker can use. You can specify type for your GPU resources. The pipeline worker cannot use more resources than the limit you specify. The resource_request parameter is similar, but the worker can go over the requested amount of resources if they are available. You can read more about resource_limits and resource_requests in the Kubernetes documentation at https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits.

The following code shows how to set the resource_limits parameter to limit Pachyderm worker resources in YAML format:

resource_limits:

cpu: 1

gpu:

type: nvidia.com/gpu

number: 1

memory: 16G

The following code shows how to set the resource_limits parameter to limit Pachyderm worker resources in JSON format:

"resource_limits": {

"cpu": 1,

"gpu": {

"type": "nvidia.com/gpu",

"number": 1,

"memory": "16G"

}

If you need to set a specific flavor of cloud resource, such as TPU in Google Kubernetes Engine, you can do so by configuring pod_patch. See the upcoming pod_patch section for more information.

resource_requests

The resource_requests parameter specifies the number of resources that a pipeline worker requests to process a unit of work. Unlike the resource_limits parameter, if more resources are available, the worker can use them. The syntax for this parameter is the same as for resource_limits.

sidecar_resource_limits

This parameter is similar to resource_limits and defines the resource for the pipeline sidecar container. For some example syntax, see the resource_limits section.

scheduling_spec

The scheduling_spec parameter enables you to specify which Pods to run your pipeline code on based on a specified node_selector or priority_class. node_selectore enables you to specify a specific group of nodes that have the same label, called nodeSelector, while priority_class enables you to schedule the pipeline on a group of nodes that matches a Kubernetes PriorityClass. The scheduling_spec parameter is typically used to schedule pipelines on specific workers because of the resources they provide. For more information about these properties, see https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector and https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass.

The following code shows how to define nodeSelector in YAML format:

scheduling_spec:

node_selector:

kubernetes.io/my-hostname: my-node

The following code shows how to define nodeSelector in JSON format:

"scheduling_spec": {

"node_selector": {"kubernetes.io/my-hostname": "my-node"

}

To define the PriorityClass parameter in YAML format, use the following code:

scheduling_spec:

priority_class: high-priority

kubernetes.io/my-hostname: my-node

Or, if you prefer to write in JSON format, set PriorityClass like this:

"scheduling_spec": {

"priority_class": "high-priority"

}

Next, we'll learn how to set a timeout for a job.

job_timeout

The job_timeout parameter enables you to set a timeout for a Pachyderm job run. This means that if your job does not finish within the specified period, it will fail. By default, this parameter is disabled. You can set it to your preferred time value.

The following code shows how to define a job_timeout in YAML format:

job_timeout: 10m

Here is the same example in JSON format:

"job_timeout": "10m"

Next, we'll learn about datum_timeout.

datum_timeout

The datum_timeout timeout is similar to job_timeout, except that the timeout is set at the datum level of granularity. The syntax is the same.

datum_tries

The datum_tries parameter defines how many times Pachyderm will try to rerun a pipeline on a failed datum. By default, this parameter is set to 3. If you want your pipeline to run only once and not try to process the failed datums again, set this value to 1.

The following code shows how to define the job_tries parameter in YAML format:

datum_tries: 1

If you prefer to write in JSON format, you can use the following code instead:

"datum_tries": "1"

In this section, we learned how to achieve optimal performance by fine-tuning our pipeline specification. In the next section, you will learn how to configure some of the service parameters that assist in pipeline operations.

Exploring service parameters

Now, let's look at service parameters. Service parameters include the parameters that let you collect statistics about your data, as well as patch your pipeline's Kubernetes configuration.

enable_stats

The enable_stats parameter, as its name suggests, enables pipeline statistics logging. By default, this parameter is disabled. For debugging purposes, it is recommended that you set this parameter to true. Once you enable statistics collection, the statistics are saved in the stats folder. You cannot disable statistics collection.

The following code shows how to define the enable_stats parameter in YAML format:

enable_stats: true

The following code shows how to define the enable_stats parameter in JSON format:

"enable_stats": true,

Next, we'll learn about pod_patch.

pod_patch

The pod_patch parameter enables you to rewrite any field in your pipeline Pods. This can be useful for many things, but one example is mounting a volume in your pipeline. To create a pod_patch, you would typically use a JSON patch builder, convert it into a one-liner, and add it to your pipeline specification.

The following code shows how to define the pod_patch parameter in YAML format:

pod_patch: '[{"op": "add","path": "spec/initContainers/0/resources/limits/my-volume"}]'

The same in JSON format looks like this:

"pod_patch": "[{"op": "add","path": "spec/initContainers/0/resources/limits/my-volume"}]"

This is all you need to know about service parameters. In the next section, we will look at some parameters that enable you to configure the output branch and write your pipeline results to external storage.

Exploring output parameters

Output parameters enable you to configure what happens to your processed data after the result lands in the output repository. You can set it up to be placed in an external S3 repository or configure an egress.

s3_out

The s3_out parameter enables your Pachyderm pipeline to write output to an S3 repository instead of the standard pfs/out. This parameter requires a Boolean value. To access the output repository, you would have to use an S3 protocol address, such as s3://<output-repo>. The output repository will still be eponymous to your pipeline's name.

The following code shows how to define an s3_out parameter in YAML format:

s3_out: true

Here's how to do the same in JSON format:

"s3_out": true

Now, let's learn about egress.

egress

The egress parameter enables you to specify an external location for your output data. Pachyderm supports Amazon S3 (the s3:// protocol), Google Cloud Storage (the gs:// protocol), and Azure Blob Storage (the wasb:// protocol).

The following code shows how to define an egress parameter in YAML format:

egress:

URL: gs://mystorage/mydir

Here's the same example but in JSON format:

"egress": {

"URL": "gs://mystorage/mydir"

Next, let's learn about how to configure an output branch in Pachyderm.

output_branch

The output_branch parameter enables you to write your results into a branch in an output repository that is different from the default master branch. You may want to do this if you want to create an experiment or a development output that you don't want the downstream pipeline to pick up.

The following code shows how to define the output_branch parameter in YAML format:

output_branch: test

The following code shows how to define the output_branch parameter in JSON format:

"output_branch": "test",

This concludes our overview of the Pachyderm pipeline specification.

Summary

In this chapter, we learned about all the parameters you can specify in a Pachyderm pipeline, how to optimize it, and how to configure the transformation section. The pipeline specification is the most important configuration attribute of your pipeline as you will use it to create your pipeline. As you have learned, the pipeline specification provides a lot of flexibility regarding performance optimization. While it may be tricky to find the right parameters for your type of data right away, Pachyderm provides a lot of fine-tuning options that can help you achieve the best performance for your ML workflow.

In the next chapter, you will learn how to install Pachyderm on your local computer.

Table of Contents for Chapter 3: Pachyderm Pipeline Specification

Create new playlist

Sign In

Sign Up

Chapter 3: Pachyderm Pipeline Specification

Pipeline specification overview

Understanding inputs

pfs

Exploring informational parameters

name

description

metadata

Exploring transformation

image

cmd

stdin

err_cmd

err_stdin

env

secrets

image_pull_secrets

accept_return_code

debug

user

working_dir

dockerfile

Optimizing your pipeline

parallelism_spec

reprocess_spec

cache_size

max_queue_size

chunk_spec

resource_limits

resource_requests

sidecar_resource_limits

scheduling_spec

job_timeout

datum_timeout

datum_tries

Exploring service parameters

enable_stats

pod_patch

Exploring output parameters

s3_out

egress

output_branch

Summary

Further reading

Table of Contents for
Chapter 3: Pachyderm Pipeline Specification