The GPU is capable of executing kernels from concurrent CPU processes. However, by default, they are only executed in a time-sliced manner even though each kernel doesn't fully utilize GPU compute resources. To address this unnecessary serialization, the GPU provides Multi-Process Service (MPS) mode. This enables different processes to execute their kernels simultaneously on a GPU to fully utilize GPU resources. When it is enabled, the nvidia-cuda-mps-control daemon monitors the target GPU and manages process kernel operations using that GPU. This feature is only available on Linux. Here, we can see the MPS in which multiple processes share the same GPU:
As we can see, each process has a part that runs in parallel in the GPU (green bars), while some part runs on the CPU (blue bars). Ideally, you would need both the blue bars and green bars to get the best performance. This can be made possible by making use of the MPS feature, which is supported by all the latest GPUs.
The good thing about this is that no changes need to be made to the application to make use of MPS. The MPS process runs as a daemon, as shown in the following commands:
$nvidia-smi -c EXCLUSIVE_PROCESS
$nvidia-cuda-mps-control –d
After running this command, all the processes submit their commands to the MPS daemon, which takes care of submitting the CUDA commands to GPU. For the GPU, there is only one process accessing the GPU (MPS Daemon) and hence multiple kernels can run concurrently from multiple processes. This can help overlap memory copies from one process with kernel executions from other MPI processes.