Skip to main content
The configuration is organized into the following main sections:
  • [cerebrium.deployment] Core settings like app name, Python version, and file inclusion rules
  • [cerebrium.runtime.custom] Custom web server settings and app startup behavior
  • [cerebrium.hardware] Compute resources including CPU, memory, and GPU specifications
  • [cerebrium.scaling] Auto-scaling behavior and replica management
  • [cerebrium.dependencies] Package management for Python (pip), system (apt), and Conda dependencies

Deployment Configuration

The [cerebrium.deployment] section defines core deployment settings.
OptionTypeDefaultDescription
namestringrequiredDesired app name
python_versionstring”3.12”Python version to use (3.10, 3.11, 3.12)
disable_authbooleanfalseDisable default token-based authentication on app endpoints
includestring[][”*“]Files/patterns to include in deployment
excludestring[][”.*“]Files/patterns to exclude from deployment
shell_commandsstring[][]Commands to run at the end of the build
pre_build_commandsstring[][]Commands to run before dependencies install
docker_base_image_urlstring”debian:bookworm-slim”Base Docker image
use_uvbooleanfalseUse UV for faster Python package installation
deployment_initialization_timeoutinteger600 (10 minutes)The max time to wait for app initialisation during build before timing out. Value must be between 60 and 830
Changes to python_version or docker_base_image_url trigger full rebuilds since they affect the base environment.

UV Package Manager

UV is a fast Python package installer written in Rust that can significantly speed up deployment times. When enabled, UV will be used instead of pip for installing Python dependencies.
UV typically installs packages 10-100x faster than pip, especially beneficial for:
  • Large dependency trees
  • Multiple packages
  • Clean builds without cache
Example with UV enabled:
[cerebrium.deployment]
use_uv = true

Monitoring UV Usage

Check your build logs for these indicators:
  • UV_PIP_INSTALL_STARTED - UV is successfully being used
  • PIP_INSTALL_STARTED - Standard pip installation (when use_uv is false)
While UV is compatible with most packages, some edge cases may cause build failures, such as legacy packages with non-standard metadata.

Deploying with UV Lock Files

read only if you’re using pyproject.toml and uv.lock
Generate your lock file locally. This creates a uv.lock file with exact dependency versions.
# In your project directory with pyproject.toml
uv sync
Export your locked dependencies to requirements.txt
uv pip compile pyproject.toml -o requirements.txt
# Or if you want to use the lock file:
uv pip compile uv.lock -o requirements.txt
Include in your deployment:
  • Ensure requirements.txt is in your project directory
  • Deploy with UV enabled

Runtime Configuration

The [cerebrium.runtime.custom] section configures custom web servers and runtime behavior.
OptionTypeDefaultDescription
portintegerrequiredPort the application listens on
entrypointstring[]requiredCommand to start the application
healthcheck_endpointstring""HTTP path for health checks (empty uses TCP). Failure causes the instance to restart
readycheck_endpointstring""HTTP path for readiness checks (empty uses TCP). Failure ensures the load balancer does not route to the instance
The port specified in entrypoint must match the port parameter. All endpoints will be available at https://api.aws.us-east-1.cerebrium.ai/v4/{project - id} /{app - name}/your/endpoint

Hardware Configuration

The [cerebrium.hardware] section defines compute resources.
OptionTypeDefaultDescription
cpufloatrequiredNumber of CPU cores
memoryfloatrequiredMemory allocation in GB
computestring”CPU”Compute type (CPU, AMPERE_A10, etc.)
gpu_countinteger0Number of GPUs
providerstring”aws”Cloud provider
regionstring”us-east-1”Deployment region
Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your workload.

Scaling Configuration

The [cerebrium.scaling] section controls auto-scaling behavior.
OptionTypeDefaultCLI RequirementDescription
min_replicasinteger02.1.2+Minimum running instances
max_replicasinteger22.1.2+Maximum running instances
replica_concurrencyinteger102.1.2+Concurrent requests per replica
response_grace_periodinteger36002.1.2+Grace period in seconds
cooldowninteger18002.1.2+Time window (seconds) that must pass at reduced concurrency before scaling down. Helps avoid cold starts from brief traffic dips.
scaling_metricstring”concurrency_utilization”2.1.2+Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization)
scaling_targetinteger1002.1.2+Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second)
scaling_bufferintegeroptional2.1.2+Additional replica capacity above what scaling metric suggests
evaluation_intervalinteger302.1.5+Time window in seconds over which metrics are evaluated before scaling decisions (6-300s)
load_balancingstring""2.1.5+Algorithm for distributing traffic across replicas. Default: round-robin if replica_concurrency > 3, first-available otherwise. Options: round-robin, first-available, min-connections, random-choice-2
roll_out_duration_secondsinteger02.1.2+Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.
Setting min_replicas > 0 maintains warm instances for immediate response but increases costs.
The scaling_metric options are:
  • concurrency_utilization: Maintains a percentage of your replica_concurrency across instances. For example, with replica_concurrency=200 and scaling_target=80, maintains 160 requests per instance.
  • requests_per_second: Maintains a specific request rate across all instances. For example, scaling_target=5 maintains 5 requests/s average across instances.
  • cpu_utilization: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with cpu=2 and scaling_target=80, maintains 80% CPU utilization (1.6 CPUs) per instance.
  • memory_utilization: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with memory=10 and scaling_target=80, maintains 80% memory utilization (8GB) per instance.
The scaling_buffer option is only available with concurrency_utilization and requests_per_second metrics. It ensures extra capacity is maintained above what the scaling metric suggests.For example, with min_replicas=0 and scaling_buffer=3, the system will maintain 3 replicas as baseline capacity.

Dependencies

Pip Dependencies

The [cerebrium.dependencies.pip] section lists Python package requirements.
[cerebrium.dependencies.pip]
torch = "==2.0.0"      # Exact version
numpy = "latest"       # Latest version
pandas = ">=1.5.0"     # Minimum version

APT Dependencies

The [cerebrium.dependencies.apt] section specifies system packages.
[cerebrium.dependencies.apt]
ffmpeg = "latest"
libopenblas-base = "latest"

Conda Dependencies

The [cerebrium.dependencies.conda] section manages Conda packages.
[cerebrium.dependencies.conda]
cuda = ">=11.7"
cudatoolkit = "11.7"

Dependency Files

The [cerebrium.dependencies.paths] section allows using requirement files.
[cerebrium.dependencies.paths]
pip = "requirements.txt"
apt = "pkglist.txt"
conda = "conda_pkglist.txt"

Complete Example

[cerebrium.deployment]
name = "llm-inference"
python_version = "3.12"
disable_auth = false
include = ["*"]
exclude = [".*"]
shell_commands = []
pre_build_commands = []
docker_base_image_url = "debian:bookworm-slim"
use_uv = true
# Enable fast package installation with UV (omit or set to false if you want to use pip)

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"
gpu_count = 1
provider = "aws"
region = "us-east-1"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
replica_concurrency = 10
response_grace_period = 3600
cooldown = 1800
scaling_metric = "concurrency_utilization"
scaling_target = 100
evaluation_interval = 30
# load_balancing = ""  # Auto-selects based on replica_concurrency
roll_out_duration_seconds = 0

[cerebrium.dependencies.pip]
torch = "latest"
transformers = "latest"
uvicorn = "latest"

[cerebrium.dependencies.apt]
ffmpeg = "latest"

[cerebrium.dependencies.conda]
# Optional conda dependencies

[cerebrium.dependencies.paths]
# Optional paths to dependency files
# pip = "requirements.txt"
# apt = "pkglist.txt"
# conda = "conda_pkglist.txt"