- [cerebrium.deployment] Core settings like app name, Python version, and file inclusion rules
- [cerebrium.runtime.custom] Custom web server settings and app startup behavior
- [cerebrium.hardware] Compute resources including CPU, memory, and GPU specifications
- [cerebrium.scaling] Auto-scaling behavior and replica management
- [cerebrium.dependencies] Package management for Python (pip), system (apt), and Conda dependencies
Deployment Configuration
The[cerebrium.deployment]
section defines core deployment settings.
Option | Type | Default | Description |
---|---|---|---|
name | string | required | Desired app name |
python_version | string | ”3.12” | Python version to use (3.10, 3.11, 3.12) |
disable_auth | boolean | false | Disable default token-based authentication on app endpoints |
include | string[] | [”*“] | Files/patterns to include in deployment |
exclude | string[] | [”.*“] | Files/patterns to exclude from deployment |
shell_commands | string[] | [] | Commands to run at the end of the build |
pre_build_commands | string[] | [] | Commands to run before dependencies install |
docker_base_image_url | string | ”debian:bookworm-slim” | Base Docker image |
use_uv | boolean | false | Use UV for faster Python package installation |
Changes to python_version or docker_base_image_url trigger full rebuilds since
they affect the base environment.
UV Package Manager
UV is a fast Python package installer written in Rust that can significantly speed up deployment times. When enabled, UV will be used instead of pip for installing Python dependencies.UV typically installs packages 10-100x faster than pip, especially beneficial for:
- Large dependency trees
- Multiple packages
- Clean builds without cache
Monitoring UV Usage
Check your build logs for these indicators:- UV_PIP_INSTALL_STARTED - UV is successfully being used
- PIP_INSTALL_STARTED - Standard pip installation (when
use_uv=false
)
While UV is compatible with most packages, some edge cases may cause build
failures, such as legacy packages with non-standard metadata.
Runtime Configuration
The[cerebrium.runtime.custom]
section configures custom web servers and runtime behavior.
Option | Type | Default | Description |
---|---|---|---|
port | integer | required | Port the application listens on |
entrypoint | string[] | required | Command to start the application |
healthcheck_endpoint | string | "" | HTTP path for health checks (empty uses TCP). Failure causes the instance to restart |
readycheck_endpoint | string | "" | HTTP path for readiness checks (empty uses TCP). Failure ensures the load balancer does not route to the instance |
The port specified in entrypoint must match the port parameter. All endpoints
will be available at
https://api.aws.us-east-1.cerebrium.ai/v4/{project - id}/ {app - name}/your/endpoint
Hardware Configuration
The[cerebrium.hardware]
section defines compute resources.
Option | Type | Default | Description |
---|---|---|---|
cpu | float | required | Number of CPU cores |
memory | float | required | Memory allocation in GB |
compute | string | ”CPU” | Compute type (CPU, AMPERE_A10, etc.) |
gpu_count | integer | 0 | Number of GPUs |
provider | string | ”aws” | Cloud provider |
region | string | ”us-east-1” | Deployment region |
Memory refers to RAM, not GPU VRAM. Ensure sufficient memory for your
workload.
Scaling Configuration
The[cerebrium.scaling]
section controls auto-scaling behavior.
Option | Type | Default | Description |
---|---|---|---|
min_replicas | integer | 0 | Minimum running instances |
max_replicas | integer | 2 | Maximum running instances |
replica_concurrency | integer | 10 | Concurrent requests per replica |
response_grace_period | integer | 3600 | Grace period in seconds |
cooldown | integer | 1800 | Time to wait before scaling down an idle container |
scaling_metric | string | ”concurrency_utilization” | Metric for scaling decisions (concurrency_utilization, requests_per_second, cpu_utilization, memory_utilization) |
scaling_target | integer | 100 | Target value for scaling metric (percentage for utilization metrics, absolute value for requests_per_second) |
scaling_buffer | integer | optional | Additional replica capacity above what scaling metric suggests |
roll_out_duration_seconds | integer | 0 | Gradually send traffic to new revision after successful build. Max 600s. Keep at 0 during development. |
Setting min_replicas > 0 maintains warm instances for immediate response but
increases costs.
scaling_metric
options are:
- concurrency_utilization: Maintains a percentage of your replica_concurrency across instances. For example, with
replica_concurrency=200
andscaling_target=80
, maintains 160 requests per instance. - requests_per_second: Maintains a specific request rate across all instances. For example,
scaling_target=5
maintains 5 requests/s average across instances. - cpu_utilization: Maintains CPU usage as a percentage of cerebrium.hardware.cpu. For example, with
cpu=2
andscaling_target=80
, maintains 80% CPU utilization (1.6 CPUs) per instance. - memory_utilization: Maintains RAM usage as a percentage of cerebrium.hardware.memory. For example, with
memory=10
andscaling_target=80
, maintains 80% memory utilization (8GB) per instance.
The scaling_buffer option is only available with concurrency_utilization and requests_per_second metrics.
It ensures extra capacity is maintained above what the scaling metric suggests.For example, with
min_replicas=0
and scaling_buffer=3
, the system will maintain 3 replicas as baseline capacity.Dependencies
Pip Dependencies
The[cerebrium.dependencies.pip]
section lists Python package requirements.
APT Dependencies
The[cerebrium.dependencies.apt]
section specifies system packages.
Conda Dependencies
The[cerebrium.dependencies.conda]
section manages Conda packages.
Dependency Files
The[cerebrium.dependencies.paths]
section allows using requirement files.