Choosing the right compute options for your training job

It is important to choose the right compute options for a training job in order to optimally utilize the platform resources. This results in minimizing the training time and the cost. We need to set runtime attributes for the training job. A training job is a standard object on the AI Platform. The structure of the training job is as follows (input parameters are highlighted in bold font—please find the complete configuration at this link: https://github.com/PacktPublishing/Hands-On-Artificial-Intelligence-on-Google-Cloud-Platform):

{
  "jobId": string,               //Required: user defined identifier for the job
  "createTime": string,          //Output Parameter: indicates when a job was created 
  "labels": {                    //Optional Input Parameter: recommended to be used for organizing 
    string: string,              //and troubleshooting the run-time jobs.   
    ...
  },
  "trainingInput": {             //Required: specifies the input parameters for the training job.  
    object (TrainingInput)
  },
  "predictionInput": {           //Required: specifies the input parameters for the prediction job.  
    object (PredictionInput)
  }
}

Specifically, we need to fill the TrainingInput or PredictionInput resource for a runtime job configuration. These are mutually exclusive for a specific request, and only one of these input parameters needs to be used at runtime. Let's have a look at the following JSON structure of the TrainingInput parameter in detail (please find the complete configuration at this link: https://github.com/PacktPublishing/Hands-On-Artificial-Intelligence-on-Google-Cloud-Platform):

{
  "scaleTier": enum (ScaleTier),    //Required: specifies machine types, count of replicas, workers,parameter servers
  "packageUris": [                 //Required: These are the Google Cloud Storage locations for
    string                         // the packages containing the training program along with
  ],                               //          additional dependencies
  "pythonModule": string,          //Required: The python module to run after importing all the 
  "args": [                        //packages and resolving the dependencies 
    string
  ],
  "hyperparameters": {             //Optional: The set of hyper-parameters to be tuned. 
    object (HyperparameterSpec)
  },
  "region": string,               //Required: The compute engine region on which the training job will run
}

We will take a look at the ScaleTier and HyperParameterSpec objects in detail. Before that, let's understand the JSON structure of the PredictionInput object that is used while submitting the prediction job, as shown in the following code block:

{
  "dataFormat": enum (DataFormat),        //Required: Format of the input data file (JSON, TEXT, etc.)    
  "outputDataFormat": enum (DataFormat),  //Optional: Format of the output data file (default-JSON)
  "inputPaths": [                         //Required: Cloud storage location of input data files
    string
  ],
  "maxWorkerCount": string,               //Optional: Maximum number of workers (default-10)
  "region": string,                       //Required: Google Compute Engine region
  "runtimeVersion": string,               //Optional: AI Platform run-time version
  "batchSize": string,                    //Optional: Number of records per batch (default-64)
  "signatureName": string,                //Optional: Name of signature defined in the saved model
  "modelName": string,                    
  "versionName": string,
  "uri": string
  "outputPath": string
}

The training and prediction performance is optimized a great deal when the right parameters are chosen. We need to understand various scale tiers and hyperparameters for further optimizing the performance and cost. It is important to select the appropriate scale tiers based on the volume of the training data and the complexity of the algorithm. The idea is to use just the right amount of resources for training and prediction. This helps in minimizing the cost of training units. The advanced tiers have additional capacity in terms of the number of CPU cores and utilization of GPU, as desired. However, the cost also increases, along with the advancement of the tier. There are various scale tiers available on GCP, as follows:

BASIC: Provides a single worker instance and is suitable for learning and experimenting. This can also be used for a small-size proof of concept (POC).
STANDARD_1: Provides a number of worker nodes and only a few parameter servers.
PREMIUM_1: Provides a large number of workers, with many parameter servers.
BASIC_GPU: Provides a single worker instance with a GPU.
CUSTOM: This tier allows for the setting of custom values for the master type, worker count, parameter server count, and the parameter server type. These parameters from TrainingInput become mandatory when the CUSTOM scale tier is selected.

Apart from the scale tier, we need to carefully select the hyperparameter values to further optimize training performance.

Table of Contents for Choosing the right compute options for your training job

Create new playlist

Sign In

Sign Up

Table of Contents for
Choosing the right compute options for your training job