The configuration is organized into the following main sections:

  • [cerebrium.deployment] Core settings like app name, Python version, and file inclusion rules
  • [cerebrium.runtime.custom] Custom web server settings and app startup behavior
  • [cerebrium.hardware] Compute resources including CPU, memory, and GPU specifications
  • [cerebrium.scaling] Auto-scaling behavior and replica management
  • [cerebrium.dependencies] Package management for Python (pip), system (apt), and Conda dependencies

Deployment Configuration

The [cerebrium.deployment] section defines core deployment settings.

OptionTypeDefaultDescription
namestringrequiredDesired app name
python_versionstring”3.12”Python version to use (3.10, 3.11, 3.12)
disable_authbooleanfalseDisable default token-based authentication on app endpoints
includestring[][”*“]Files/patterns to include in deployment
excludestring[][”.*“]Files/patterns to exclude from deployment
shell_commandsstring[][]Commands to run at the end of the build
pre_build_commandsstring[][]Commands to run before dependencies install
docker_base_image_urlstring”debian:bookworm-slim”Base Docker image

Changes to python_version or docker_base_image_url trigger full rebuilds since they affect the base environment.

Runtime Configuration

The [cerebrium.runtime.custom] section configures custom web servers and runtime behavior.

OptionTypeDefaultDescription
portintegerrequiredPort the application listens on
entrypointstring[]requiredCommand to start the application
healthcheck_endpointstring""HTTP path for health checks (empty uses TCP)

The port specified in entrypoint must match the port parameter. All endpoints will be available at https://api.cortex.cerebrium.ai/v4/{project - id}/ {app - name}/your/endpoint

Hardware Configuration

The [cerebrium.hardware] section defines compute resources.

OptionTypeDefaultDescription
cpufloatrequiredNumber of CPU cores
memoryfloatrequiredMemory allocation in GB
computestring”CPU”Compute type (CPU, AMPERE_A10, etc.)
gpu_countinteger0Number of GPUs
providerstring”aws”Cloud provider
regionstring”us-east-1”Deployment region

Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your workload.

Scaling Configuration

The [cerebrium.scaling] section controls auto-scaling behavior.

OptionTypeDefaultDescription
min_replicasinteger0Minimum running instances
max_replicasinteger2Maximum running instances
replica_concurrencyinteger10Concurrent requests per replica
response_grace_periodinteger3600Grace period in seconds
cooldowninteger1800Time to wait before scaling down an idle container
scaling_metricstring”concurrency_utilization”Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization)
scaling_targetinteger100Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second)
scaling_bufferintegeroptionalAdditional replica capacity above what scaling metric suggests
roll_out_duration_secondsinteger0Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.

Setting min_replicas > 0 maintains warm instances for immediate response but increases costs.

The scaling_metric options are:

  • concurrency_utilization: Maintains a percentage of your replica_concurrency across instances. For example, with replica_concurrency=200 and scaling_target=80, maintains 160 requests per instance.
  • requests_per_second: Maintains a specific request rate across all instances. For example, scaling_target=5 maintains 5 requests/s average across instances.
  • cpu_utilization: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with cpu=2 and scaling_target=80, maintains 80% CPU utilization (1.6 CPUs) per instance.
  • memory_utilization: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with memory=10 and scaling_target=80, maintains 80% memory utilization (8GB) per instance.

The scaling_buffer option is only available with concurrency_utilization and requests_per_second metrics. It ensures extra capacity is maintained above what the scaling metric suggests.

For example, with min_replicas=0 and scaling_buffer=3, the system will maintain 3 replicas as baseline capacity.

Dependencies

Pip Dependencies

The [cerebrium.dependencies.pip] section lists Python package requirements.

[cerebrium.dependencies.pip]
torch = "==2.0.0"      # Exact version
numpy = "latest"       # Latest version
pandas = ">=1.5.0"     # Minimum version

APT Dependencies

The [cerebrium.dependencies.apt] section specifies system packages.

[cerebrium.dependencies.apt]
ffmpeg = "latest"
libopenblas-base = "latest"

Conda Dependencies

The [cerebrium.dependencies.conda] section manages Conda packages.

[cerebrium.dependencies.conda]
cuda = ">=11.7"
cudatoolkit = "11.7"

Dependency Files

The [cerebrium.dependencies.paths] section allows using requirement files.

[cerebrium.dependencies.paths]
pip = "requirements.txt"
apt = "pkglist.txt"
conda = "conda_pkglist.txt"

Complete Example

[cerebrium.deployment]
name = "llm-inference"
python_version = "3.12"
include = ["*"]
exclude = [".*"]

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/health"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
replica_concurrency = 10
cooldown = 1800

[cerebrium.dependencies.pip]
torch = "latest"
transformers = "latest"
uvicorn = "latest"

[cerebrium.dependencies.apt]
ffmpeg = "latest"