Keep models warm
Based on traffic and implementation you might want to keep instances running
While we strive to lower cold start times and improve model loading onto the GPU, you may prefer to keep your instances warm and waiting to handle incoming requests. There are two ways to do this based on your use case:
- Set min replicas to 1 or more.
This is set through the min_replicas option in your cerebrium.toml
file. This is typically the best option if you would like to sustain a base load or would like
to meet minimum SLA’s with customers. Please note that you are charged for 24/7 usage of the instances
- Set your cooldown period
You set this using the cooldown parameter in your cerebrium.toml
and is by default set to 60 seconds. This is the number of seconds of inactivity from when your last
request finishes that a container must experience before terminating. Every time you get a new request, this time is reset. It is important to note that you are charged
for the cooldown time since your container is constantly running.
Example
[cerebrium.scaling]
min_replicas = 1
cooldown = 60