While we strive to lower cold start times and improve model loading onto the GPU, you may prefer to keep your instances warm and waiting to handle incoming requests. There are two ways to do this based on your use case:

  1. Set min replicas to 1 or more.

This is set through the min_replicas option in your cerebrium.toml file. This is typically the best option if you would like to sustain a base load or would like to meet minimum SLA’s with customers. Please note that you are charged for 24/7 usage of the instances

  1. Set your cooldown period

You set this using the cooldown parameter in your cerebrium.toml and is by default set to 60 seconds. This is the number of seconds of inactivity from when your last request finishes that a container must experience before terminating. Every time you get a new request, this time is reset. It is important to note that you are charged for the cooldown time since your container is constantly running.

Example

[cerebrium.scaling]
min_replicas = 1
cooldown = 60