TOML Reference
Complete reference for all parameters available in Cerebrium’s defaultcerebrium.toml
configuration file.
The configuration is organized into the following main sections:
- [cerebrium.deployment] Core settings like app name, Python version, and file inclusion rules
- [cerebrium.runtime.custom] Custom web server settings and app startup behavior
- [cerebrium.hardware] Compute resources including CPU, memory, and GPU specifications
- [cerebrium.scaling] Auto-scaling behavior and replica management
- [cerebrium.dependencies] Package management for Python (pip), system (apt), and Conda dependencies
Deployment Configuration
The [cerebrium.deployment]
section defines core deployment settings.
Option | Type | Default | Description |
---|---|---|---|
name | string | required | Desired app name |
python_version | string | ”3.12” | Python version to use (3.10, 3.11, 3.12) |
disable_auth | boolean | false | Disable default token-based authentication on app endpoints |
include | string[] | [”*“] | Files/patterns to include in deployment |
exclude | string[] | [”.*“] | Files/patterns to exclude from deployment |
shell_commands | string[] | [] | Commands to run at the end of the build |
pre_build_commands | string[] | [] | Commands to run before dependencies install |
docker_base_image_url | string | ”debian:bookworm-slim” | Base Docker image |
Changes to python_version or docker_base_image_url trigger full rebuilds since they affect the base environment.
Runtime Configuration
The [cerebrium.runtime.custom]
section configures custom web servers and runtime behavior.
Option | Type | Default | Description |
---|---|---|---|
port | integer | required | Port the application listens on |
entrypoint | string[] | required | Command to start the application |
healthcheck_endpoint | string | "" | HTTP path for health checks (empty uses TCP) |
The port specified in entrypoint must match the port parameter. All endpoints
will be available at https://api.cortex.cerebrium.ai/v4/{project - id}/ {app - name}/your/endpoint
Hardware Configuration
The [cerebrium.hardware]
section defines compute resources.
Option | Type | Default | Description |
---|---|---|---|
cpu | float | required | Number of CPU cores |
memory | float | required | Memory allocation in GB |
compute | string | ”CPU” | Compute type (CPU, AMPERE_A10, etc.) |
gpu_count | integer | 0 | Number of GPUs |
provider | string | ”aws” | Cloud provider |
region | string | ”us-east-1” | Deployment region |
Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your workload.
Scaling Configuration
The [cerebrium.scaling]
section controls auto-scaling behavior.
Option | Type | Default | Description |
---|---|---|---|
min_replicas | integer | 0 | Minimum running instances |
max_replicas | integer | 2 | Maximum running instances |
replica_concurrency | integer | 10 | Concurrent requests per replica |
response_grace_period | integer | 3600 | Grace period in seconds |
cooldown | integer | 1800 | Time to wait before scaling down an idle container |
scaling_metric | string | ”concurrency_utilization” | Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization) |
scaling_target | integer | 100 | Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second) |
scaling_buffer | integer | optional | Additional replica capacity above what scaling metric suggests |
roll_out_duration_seconds | integer | 0 | Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development. |
Setting min_replicas > 0 maintains warm instances for immediate response but increases costs.
The scaling_metric
options are:
- concurrency_utilization: Maintains a percentage of your replica_concurrency across instances. For example, with
replica_concurrency=200
andscaling_target=80
, maintains 160 requests per instance. - requests_per_second: Maintains a specific request rate across all instances. For example,
scaling_target=5
maintains 5 requests/s average across instances. - cpu_utilization: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with
cpu=2
andscaling_target=80
, maintains 80% CPU utilization (1.6 CPUs) per instance. - memory_utilization: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with
memory=10
andscaling_target=80
, maintains 80% memory utilization (8GB) per instance.
The scaling_buffer option is only available with concurrency_utilization and requests_per_second metrics. It ensures extra capacity is maintained above what the scaling metric suggests.
For example, with min_replicas=0
and scaling_buffer=3
, the system will maintain 3 replicas as baseline capacity.
Dependencies
Pip Dependencies
The [cerebrium.dependencies.pip]
section lists Python package requirements.
APT Dependencies
The [cerebrium.dependencies.apt]
section specifies system packages.
Conda Dependencies
The [cerebrium.dependencies.conda]
section manages Conda packages.
Dependency Files
The [cerebrium.dependencies.paths]
section allows using requirement files.