This feature is currently only available in the v4 API using INF2 or TRN1.

When deploying your model on Cerebrium, setting up a multi-GPU inference is as simple as specifying the gpu_count parameter. If using Huggingface transformers, you then just need to set the device_map to auto and the model will be automatically distributed across all available GPUs.

While you can select any type of GPU for your multi-GPU deployment, if you are looking for the best performance possible, we recommend using the A100 GPUs as these are connected with NVLINK. This means that the GPUs can communicate with each other much faster than with other GPUs.

You can look at an example of a multi-GPU deployment here

Speeding up inference

Larger models are, by nature, slower as there are more calculations to be done due to there being more parameters. However, there are a few ways to speed up inference.

The first is to use mixed precision. This is a technique that uses half-precision floating point numbers (float16) instead of the standard float32. Since the model is smaller, it is faster to run.

If using Huggingface transformers, you can use the Accelerate library to prepare your model after loading in the weights. This is done with:

from accelerate import Accelerator
accelerator = Accelerator()
model = accelerator.prepare(model)

Most transformers models also allow you to use bettertransformers to speed up inference. This is done with: