GPUs provide specialized computing power that dramatically accelerates computational workloads. While originally designed for graphics rendering, modern GPUs excel at parallel processing tasks, making them essential for a wide range of applications.

Applications deployed on Cerebrium can access GPU computing power without managing complex infrastructure. The platform supports sophisticated AI models, large-scale data processing, and GPU-accelerated applications through simple configuration.

Specifying GPUs

The GPU configuration in the cerebrium.toml is handled through the [cerebrium.hardware] section, where you can specify both the type (Using the compute parameter and quantity of GPUs (gpu_count) for your app. We address additional deployment configurations and GPU scaling considerations in more detail in the sections below.

Available GPUs

The platform offers a range of GPUs to match various computational needs and budgets, from cost-effective development options to high-end enterprise hardware.

GPU ModelIdentifierVRAM (GB)Max GPUsPlan requiredProvider
NVIDIA H100HOPPER_H100808EnterpriseAWS
NVIDIA A100AMPERE_A100_80GB808EnterpriseAWS
NVIDIA A100AMPERE_A100_40GB408EnterpriseAWS
NVIDIA L40sADA_L40488Hobby+AWS
NVIDIA L4ADA_L4248Hobby+AWS
NVIDIA A10AMPERE_A10248Hobby+AWS
NVIDIA T4TURING_T4168Hobby+AWS
AWS Inferentia 2INF2328Hobby+AWS
AWS TrainiumTRN1328Hobby+AWS

The identifier is used in the cerebrium.toml file. It consists of the GPU model generation and model name to avoid ambiguity.

GPU selection is also possible using the --compute and --gpu-count flags during application initialization.

Multi-GPU Configuration

Multiple GPUs can enhance application performance through parallel processing and increased memory capacity.

Use Cases

Multiple GPUs become essential when:

  • Models exceed single GPU memory capacity
  • Workloads require parallel inference processing
  • Applications need distributed training capabilities
  • Production environments demand high availability

Configuration

Multiple GPUs are configured in the cerebrium.toml file:

[cerebrium.hardware]
compute = "AMPERE_A100_80GB"
gpu_count = 4        # Number of GPUs needed
cpu = 8
memory = 128.0

Selecting the Right GPU

Selecting appropriate hardware requires balancing performance requirements with resource efficiency. GPU selection involves calculating VRAM usage based on model parameters and input requirements. This is particularly important for:

  • LLMs and transformer architectures: Account for attention processes and positional encoding
  • CNNs: Consider filter numbers and input sizes
  • Batch processing: Factor in concurrent processing requirements

VRAM Requirement Calculation

Base VRAM requirements can be calculated using:

modelVRAM = numParams × numBytesPerDataType

Common data types:

  • FP32 (32-bit floating point): 4 bytes
  • FP16 (16-bit floating point): 2 bytes
  • INT8 (8-bit quantization): 1 byte

Example calculations:

  1. 7B parameter model with FP32:
modelVRAM = 7B × 4 bytes = 28GB
  1. Same model with INT8 quantization:
modelVRAM = 7B × 1 byte = 7GB

Safety Buffer

A 1.5x VRAM buffer should be added to account for runtime memory requirements, attention mechanisms, intermediate computations, and input variations:

recommendedVRAM = modelVRAM × 1.5

Using the 7B parameter example:

  • FP32: 28GB × 1.5 = 42GB (requires A100_40GB or higher)
  • INT8: 7GB × 1.5 = 10.5GB (T4 or higher is sufficient)

The Pricing Calculator provides detailed cost comparisons for different GPU configurations.

For custom GPU configurations, contact Cerebrium support.