The configuration is organized into the following main sections:

  • [cerebrium.deployment] Core settings like app name, Python version, and file inclusion rules
  • [cerebrium.runtime.custom] Custom web server settings and app startup behavior
  • [cerebrium.hardware] Compute resources including CPU, memory, and GPU specifications
  • [cerebrium.scaling] Auto-scaling behavior and replica management
  • [cerebrium.dependencies] Package management for Python (pip), system (apt), and Conda dependencies

Deployment Configuration

The [cerebrium.deployment] section defines core deployment settings.

OptionTypeDefaultDescription
namestringrequiredDesired app name
python_versionstring”3.12”Python version to use (3.10, 3.11, 3.12)
disable_authbooleanfalseDisable default token-based authentication on app endpoints
includestring[][”*“]Files/patterns to include in deployment
excludestring[][”.*“]Files/patterns to exclude from deployment
shell_commandsstring[][]Commands to run at the end of the build
pre_build_commandsstring[][]Commands to run before dependencies install
docker_base_image_urlstring”debian:bookworm-slim”Base Docker image

Changes to python_version or docker_base_image_url trigger full rebuilds since they affect the base environment.

Runtime Configuration

The [cerebrium.runtime.custom] section configures custom web servers and runtime behavior.

OptionTypeDefaultDescription
portintegerrequiredPort the application listens on
entrypointstring[]requiredCommand to start the application
healthcheck_endpointstring""HTTP path for health checks (empty uses TCP). Failure causes the instance to restart
readycheck_endpointstring""HTTP path for readiness checks (empty uses TCP). Failure ensures the load balancer does not route to the instance

The port specified in entrypoint must match the port parameter. All endpoints will be available at https://api.cortex.cerebrium.ai/v4/{project - id}/ {app - name}/your/endpoint

Hardware Configuration

The [cerebrium.hardware] section defines compute resources.

OptionTypeDefaultDescription
cpufloatrequiredNumber of CPU cores
memoryfloatrequiredMemory allocation in GB
computestring”CPU”Compute type (CPU, AMPERE_A10, etc.)
gpu_countinteger0Number of GPUs
providerstring”aws”Cloud provider
regionstring”us-east-1”Deployment region

Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your workload.

Scaling Configuration

The [cerebrium.scaling] section controls auto-scaling behavior.

OptionTypeDefaultDescription
min_replicasinteger0Minimum running instances
max_replicasinteger2Maximum running instances
replica_concurrencyinteger10Concurrent requests per replica
response_grace_periodinteger3600Grace period in seconds
cooldowninteger1800Time to wait before scaling down an idle container
scaling_metricstring”concurrency_utilization”Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization)
scaling_targetinteger100Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second)
scaling_bufferintegeroptionalAdditional replica capacity above what scaling metric suggests
roll_out_duration_secondsinteger0Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.

Setting min_replicas > 0 maintains warm instances for immediate response but increases costs.

The scaling_metric options are:

  • concurrency_utilization: Maintains a percentage of your replica_concurrency across instances. For example, with replica_concurrency=200 and scaling_target=80, maintains 160 requests per instance.
  • requests_per_second: Maintains a specific request rate across all instances. For example, scaling_target=5 maintains 5 requests/s average across instances.
  • cpu_utilization: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with cpu=2 and scaling_target=80, maintains 80% CPU utilization (1.6 CPUs) per instance.
  • memory_utilization: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with memory=10 and scaling_target=80, maintains 80% memory utilization (8GB) per instance.

The scaling_buffer option is only available with concurrency_utilization and requests_per_second metrics. It ensures extra capacity is maintained above what the scaling metric suggests.

For example, with min_replicas=0 and scaling_buffer=3, the system will maintain 3 replicas as baseline capacity.

Dependencies

Pip Dependencies

The [cerebrium.dependencies.pip] section lists Python package requirements.

[cerebrium.dependencies.pip]
torch = "==2.0.0"      # Exact version
numpy = "latest"       # Latest version
pandas = ">=1.5.0"     # Minimum version

APT Dependencies

The [cerebrium.dependencies.apt] section specifies system packages.

[cerebrium.dependencies.apt]
ffmpeg = "latest"
libopenblas-base = "latest"

Conda Dependencies

The [cerebrium.dependencies.conda] section manages Conda packages.

[cerebrium.dependencies.conda]
cuda = ">=11.7"
cudatoolkit = "11.7"

Dependency Files

The [cerebrium.dependencies.paths] section allows using requirement files.

[cerebrium.dependencies.paths]
pip = "requirements.txt"
apt = "pkglist.txt"
conda = "conda_pkglist.txt"

Complete Example

[cerebrium.deployment]
name = "llm-inference"
python_version = "3.12"
include = ["*"]
exclude = [".*"]

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
replica_concurrency = 10
cooldown = 1800

[cerebrium.dependencies.pip]
torch = "latest"
transformers = "latest"
uvicorn = "latest"

[cerebrium.dependencies.apt]
ffmpeg = "latest"