TOML Reference - Cerebrium

The configuration is organized into the following main sections:

[cerebrium.deployment] Core settings like app name, Python version, and file inclusion rules
[cerebrium.runtime.custom] Custom web server settings and app startup behavior
[cerebrium.hardware] Compute resources including CPU, memory, and GPU specifications
[cerebrium.scaling] Auto-scaling behavior and replica management
[cerebrium.dependencies] Package management for Python (pip), system (apt), and Conda dependencies

Deployment Configuration

The [cerebrium.deployment] section defines core deployment settings.

Option	Type	Default	Description
name	string	required	Desired app name
python_version	string	”3.12”	Python version to use (3.10, 3.11, 3.12)
disable_auth	boolean	false	Disable default token-based authentication on app endpoints
include	string[]	[”*“]	Files/patterns to include in deployment
exclude	string[]	[”.*“]	Files/patterns to exclude from deployment
shell_commands	string[]	[]	Commands to run at the end of the build
pre_build_commands	string[]	[]	Commands to run before dependencies install
docker_base_image_url	string	”debian:bookworm-slim”	Base Docker image

Changes to python_version or docker_base_image_url trigger full rebuilds since they affect the base environment.

Runtime Configuration

The [cerebrium.runtime.custom] section configures custom web servers and runtime behavior.

Option	Type	Default	Description
port	integer	required	Port the application listens on
entrypoint	string[]	required	Command to start the application
healthcheck_endpoint	string	""	HTTP path for health checks (empty uses TCP). Failure causes the instance to restart
readycheck_endpoint	string	""	HTTP path for readiness checks (empty uses TCP). Failure ensures the load balancer does not route to the instance

The port specified in entrypoint must match the port parameter. All endpoints will be available at https://api.cortex.cerebrium.ai/v4/{project - id}/ {app - name}/your/endpoint

Hardware Configuration

The [cerebrium.hardware] section defines compute resources.

Option	Type	Default	Description
cpu	float	required	Number of CPU cores
memory	float	required	Memory allocation in GB
compute	string	”CPU”	Compute type (CPU, AMPERE_A10, etc.)
gpu_count	integer	0	Number of GPUs
provider	string	”aws”	Cloud provider
region	string	”us-east-1”	Deployment region

Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your workload.

Scaling Configuration

The [cerebrium.scaling] section controls auto-scaling behavior.

Option	Type	Default	Description
min_replicas	integer	0	Minimum running instances
max_replicas	integer	2	Maximum running instances
replica_concurrency	integer	10	Concurrent requests per replica
response_grace_period	integer	3600	Grace period in seconds
cooldown	integer	1800	Time to wait before scaling down an idle container
scaling_metric	string	”concurrency_utilization”	Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization)
scaling_target	integer	100	Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second)
scaling_buffer	integer	optional	Additional replica capacity above what scaling metric suggests
roll_out_duration_seconds	integer	0	Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development.

Setting min_replicas > 0 maintains warm instances for immediate response but increases costs.

The scaling_metric options are:

concurrency_utilization: Maintains a percentage of your replica_concurrency across instances. For example, with replica_concurrency=200 and scaling_target=80, maintains 160 requests per instance.
requests_per_second: Maintains a specific request rate across all instances. For example, scaling_target=5 maintains 5 requests/s average across instances.
cpu_utilization: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with cpu=2 and scaling_target=80, maintains 80% CPU utilization (1.6 CPUs) per instance.
memory_utilization: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with memory=10 and scaling_target=80, maintains 80% memory utilization (8GB) per instance.

The scaling_buffer option is only available with concurrency_utilization and requests_per_second metrics. It ensures extra capacity is maintained above what the scaling metric suggests.

For example, with min_replicas=0 and scaling_buffer=3, the system will maintain 3 replicas as baseline capacity.

Dependencies

Pip Dependencies

The [cerebrium.dependencies.pip] section lists Python package requirements.

[cerebrium.dependencies.pip]
torch = "==2.0.0"      # Exact version
numpy = "latest"       # Latest version
pandas = ">=1.5.0"     # Minimum version

APT Dependencies

The [cerebrium.dependencies.apt] section specifies system packages.

[cerebrium.dependencies.apt]
ffmpeg = "latest"
libopenblas-base = "latest"

Conda Dependencies

The [cerebrium.dependencies.conda] section manages Conda packages.

[cerebrium.dependencies.conda]
cuda = ">=11.7"
cudatoolkit = "11.7"

Dependency Files

The [cerebrium.dependencies.paths] section allows using requirement files.

[cerebrium.dependencies.paths]
pip = "requirements.txt"
apt = "pkglist.txt"
conda = "conda_pkglist.txt"

Complete Example

[cerebrium.deployment]
name = "llm-inference"
python_version = "3.12"
include = ["*"]
exclude = [".*"]

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"

[cerebrium.hardware]
cpu = 4
memory = 16.0
compute = "AMPERE_A10"
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
replica_concurrency = 10
cooldown = 1800

[cerebrium.dependencies.pip]
torch = "latest"
transformers = "latest"
uvicorn = "latest"

[cerebrium.dependencies.apt]
ffmpeg = "latest"

Reference

​Deployment Configuration

​Runtime Configuration

​Hardware Configuration

​Scaling Configuration

​Dependencies

​Pip Dependencies

​APT Dependencies

​Conda Dependencies

​Dependency Files

​Complete Example

Deployment Configuration

Runtime Configuration

Hardware Configuration

Scaling Configuration

Dependencies

Pip Dependencies

APT Dependencies

Conda Dependencies

Dependency Files

Complete Example