Scaling of Deployments

Deployments on Cerebrium are scaled up and down automatically based on the number of requests your deployment is receiving. This is done to ensure that you are not paying for idle resources and that your deployment is always available to serve requests.
There are three parameters that regulate the scaling of your deployment:

  • minReplicas: The minimum number of replicas you would like to allow for your deployment. Set to 0 if you would like serverless deployments. Otherwise, latency sensitive applications, you can set this to a higher number to skip scale-up time and keep servers waiting however you will be charged for runtime. Defaults to 0.
  • maxReplicas: The maximum number of replicas you would like to allow for your deployment. Useful for cost-sensitive applications when you need to limit the number of replicas that can be created. By default this is not set and we scale your model to fit the your volume of requests.
  • cooldown: Cooldown period in seconds is the period of inactivity before the number of replicas for your deployment is scaled down by 1. Defaults to 60s.

While these parameters are set at deployment time, you can change them at any time using the Cerebrium CLI. To change the scaling parameters for your deployment, run the following command:

cerebrium model-scaling <<<YOUR_APP_NAME>>>                 \
                        --cooldown  <<<TIME_IN_SECONDS >>>  \
                        --min-replicas <<INTEGER>>          \
                        --max-replicas <<INTEGER>>

Each of the parameters are optional and your current values will be preserved if you do not specify them.

Updating model scaling using a rest endpoint

You may want to update your model using a rest endpoint. To do so, you can use the following endpoint:

curl --location 'https://rest-api.cerebrium.ai/update-model-scaling' \
--header 'Authorization: Bearer <JWT_TOKEN>' \
--header 'Content-Type: application/json' \
--data '{
    "name": "your-models-unique-name-here",
    "minReplicaCount": 0,
    "maxReplicaCount": 1,
    "cooldownPeriodSeconds": 180
}'

Replace the values for minReplicaCount, maxReplicaCount and cooldownPeriodSeconds with your desired values. All values are optional so if you don’t want to update your max replicas, just leave it out of the request body. Also make sure that name matches the name you gave your model when you deployed it!

You’ll receive the following confirmation response if successful:

{
    "message": "✅ Model scaling successfully"
}