Learn to optimise for cost and performance by scaling out apps
cerebrium.toml
file controls scaling behavior through several key parameters:
min_replicas
parameter defines how many instances remain active at all times. Setting this to 1 or higher maintains warm instances ready for immediate response, eliminating cold starts but increasing costs. This configuration suits apps that require consistent response times or need to meet specific SLA requirements.
max_replicas
parameter sets an upper limit on concurrent instances, controlling costs and protecting backend systems. When traffic increases, new instances start automatically up to this configured maximum.
cooldown
. Each new request resets this timer. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.
replica_concurrency
parameter. This is a hard limit, and an individual replica will
not accept more than this limit at a time. By default, once this concurrency limit is reached on an instance and there are still requests to be processed in-flight,
the system will scale out by the number of new instances required to fulfil the in-flight requests. For example, if replica_concurrency=1
and there are
3 requests in flight with no replicas currently available, Cerebrium will scale out 3 instances of the application to meet that demand.
replica_concurrency
is set
to 1. If the workload requires GPU but higher throughput is desired,
replica_concurrency
may be increased so long as access to GPU resources is
controlled within the application through batching.response_grace_period
parameter stipulates how long in seconds a request would need at most to finish, and provides time for instances to complete active requests during normal operation and shutdown.
During normal replica operation, this simply corresponds to a request timeout value. During replica shutdown, the Cerebrium system sends a SIGTERM signal to the replica,
waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.
replica_concurrency
. However, this strategy may be insufficient for some use cases
and so Cerebrium currently provides four scaling metrics to choose from:
concurrency_utilization
requests_per_second
cpu_utilization
memory_utilization
cerebrium.scaling
section as such, by specifying one of these metrics and a target:
concurrency_utilization
is the default scaling metric, and defaults to a target of 100% if not set explicitly.
This scaling metric works by maintaining a maximum percentage of your replica_concurrency
averaged across every instance of
the app. For example, if an application has replica_concurrency=1
and scaling_target=70
, Cerebrium will attempt
to maintain 0.7 requests per instance across your entire deployed service. This strategy would always ensure an extra
30% capacity is provisioned.
As a different example, say an app has replica_concurrency=200
and scaling_target=80
. In this case, Cerebrium will
maintain 160 requests per instance, and will begin to scale out once that target has exceeded.
requests_per_second
is straightforward criterion which aims to maintain a maximum application throughput
measured in requests per second averaged over every application instance. This can be a more effective scale metric than concurrency_utilization
if appropriate performance evaluation has been done on the application to determine the throughput. This criterion
is not recommended for most GPU applications, since this scaling metric does not enforce concurrency limits. For example,
if scaling_target=5
, Cerebrium will attempt to maintain a 5 requests/s average across all app instances.
cpu_utilization
uses a maximum CPU percentage utilization averaged over all instances of an application to scale out, relative to the
cerebrium.hardware.cpu
value. For example, if an application has cpu=2
and scaling_target=80
, Cerebrium will attempt
to maintain 80% CPU utilization (1.6 CPUs) per instance across your entire deployed service. Since there is no notion of
scaling relative to 0 CPU units, it is required that min_replicas=1
if using this metric.
memory_utilization
uses a maximum memory percentage utilization averaged over all instances of an application to scale out, relative to the
cerebrium.hardware.memory
value. Note this refers to RAM, not GPU VRAM utilization. For example, if an application has memory=10
and scaling_target=80
, Cerebrium will attempt
to maintain 80% Memory utilization (8GB) per instance across your entire deployed service. Since there is no notion of
scaling relative to 0GB of memory, it is required that min_replicas=1
if using this metric.
scaling_buffer
option in the config. Currently, this
is only available when using the following scaling metrics:
concurrency_utilization
requests_per_second
cerebrium.scaling
section as such, by specifying scaling_buffer
:
min_replicas=0
, since scaling_buffer
is actually modifying the suggested replica count from the autoscaler of 0, and 0+3=3.
Since the config has specified 100
as a target for concurrency_utilization
and replica_concurrency=1
, if the application now receives 1 request
the autoscaler will suggest a value of 1 replica for scale out. Since however, we have scale_buffer=3
, the application will actually scale one more replica to (1+3)=4.
In other words, the scale buffer will simply add a static amount of replicas to the number of replicas the autoscaler suggests using the scale target.
Once this request has completed, the usual cooldown
period will apply, and the app replica count will scale down back to the baseline of 3 replicas.