Specifying GPUs
The GPU configuration in thecerebrium.toml
is handled through the [cerebrium.hardware]
section, where you can specify both the type (using the compute
parameter) and quantity of GPUs (gpu_count
) for your app. We address additional deployment configurations and GPU scaling considerations in more detail in the sections below.
Available GPUs
The platform offers a range of GPUs to match various computational needs and budgets, from cost-effective development options to high-end enterprise hardware.GPU Model | Identifier | VRAM (GB) | Max GPUs | Plan required | Provider |
---|---|---|---|---|---|
NVIDIA H100 | HOPPER_H100 | 80 | 8 | Enterprise | AWS |
NVIDIA A100 | AMPERE_A100_80GB | 80 | 8 | Enterprise | AWS |
NVIDIA A100 | AMPERE_A100_40GB | 40 | 8 | Enterprise | AWS |
NVIDIA L40s | ADA_L40 | 48 | 8 | Hobby+ | AWS |
NVIDIA L4 | ADA_L4 | 24 | 8 | Hobby+ | AWS |
NVIDIA A10 | AMPERE_A10 | 24 | 8 | Hobby+ | AWS |
NVIDIA T4 | TURING_T4 | 16 | 8 | Hobby+ | AWS |
AWS Inferentia 2 | INF2 | 32 | 8 | Hobby+ | AWS |
AWS Trainium | TRN1 | 32 | 8 | Hobby+ | AWS |
The identifier is used in the
cerebrium.toml
file. It consists of the GPU
model generation and model name to avoid ambiguity.GPU selection is also possible using the
--compute
and --gpu-count
flags
during application initialization.Multi-GPU Configuration
Multiple GPUs can enhance application performance through parallel processing and increased memory capacity.Use Cases
Multiple GPUs become essential when:- Models exceed single GPU memory capacity
- Workloads require parallel inference processing
- Applications need distributed training capabilities
- Production environments demand high availability
Configuration
Multiple GPUs are configured in thecerebrium.toml
file:
Selecting the Right GPU
Selecting appropriate hardware requires balancing performance requirements with resource efficiency. GPU selection involves calculating VRAM usage based on model parameters and input requirements. This is particularly important for:- LLMs and transformer architectures: Account for attention processes and positional encoding.
- CNNs: Consider filter numbers and input sizes.
- Batch processing: Factor in concurrent processing requirements.
VRAM Requirement Calculation
Base VRAM requirements can be calculated using:- FP32 (32-bit floating point): 4 bytes
- FP16 (16-bit floating point): 2 bytes
- INT8 (8-bit quantization): 1 byte
- 7B parameter model with FP32:
- Same model with INT8 quantization:
Safety Buffer
A 1.5× VRAM buffer should be added to account for runtime memory requirements, attention mechanisms, intermediate computations, and input variations:- FP32: 28GB × 1.5 = 42GB (requires A100_40GB or higher)
- INT8: 7GB × 1.5 = 10.5GB (T4 or higher is sufficient)