Introduction
We are phasing out support for Python 3.8 at the end of September 2023. If you are using Python 3.8, please update your dependencies and move your deployment to Python >= 3.9.
Many of our more hardcore users wanted finer-grain control over their code and package dependencies, which is why we opened up our underlying architecture to allow engineers and data scientists to deploy any custom Python code.
We have been focusing hard on the developer experience that if your code runs locally, you can deploy it to serverless CPUs/GPUs using one line. Initial deploys take about a minute (installing packages etc.) however, thereafter every change is < 20 seconds, and of course, inference requests have a cold start <1 second!
Model deployment and startup time is proportional to the model size (including dependencies). For lightning-fast deployments, start-ups, and scale-ups keep your model as small as possible!
You can get started with our simple tutorial below:
Below is a brief outline of its setup
Components
You can get started with your first project by running the command below. It will create a starter template for you.
cerebrium init-cortex first-project
Currently, our implementation has four components:
- main.py - This is where your python code lives. This is mandatory to include.
- requirements.txt - This is where you define your Python packages. Deployment will be quicker if you specify specific versions. This is optional to include.
- pkglist.txt - This is where you can define Linux packages you would like to install. We run the apt-install command for items here. This is optional to include.
- conda_pkglist.txt - This is where you can define Conda packages you would like to install if you prefer using it for some libraries over pip. You can use both conda and pip in conjunction. This is optional to include.
However, you can contain images and other Python files in the same directory, and they will get packaged and uploaded when you deploy.
Every main.py you deploy needs the following mandatory layout:
from pydantic import BaseModel
class Item(BaseModel):
parameter: value
def predict(item, run_id, logger):
item = Item(**item)
if not item.parameter:
logger.info('User did not send specific parameter in request')
return {"status_code": 422} #returns a 422 status code
# Do something with parameters from item
return {"key": "value}
The Item class is where you define the parameters your model receives as well as their type. Item needs to inherit from BaseModel. You need to define a function with the name predict which receives 3 params: item, run_id and logger.
- item: This is the expected request parameters you defined above
- run_id: This is a unique identifier for the user request if you want to use it to track predictions through another system
- logger: Cerebrium supports logging via the logger (we also support “print()” statements) however, using the logger will format your logs nicer. It contains
the 3 states across most loggers:
- logger.info
- logger.debug
- logger.error
Status Codes
To return a specific status code such as 422, 404 etc just return the json parameter “status_code”. If no status_code value is returned and the function executes successfully then it will default of a 200 status code will be returned.
Both the requirements.txt and pkglist.txt follow standard layouts in that each package should be on a new line.
Deploy model
To deploy a model, download our pip package if you haven’t already:
pip install --upgrade cerebrium
Then navigate to where your model code (specifically your main.py
) is located and run the following command:
cerebrium deploy <MODEL_NAME> <PRIVATE_API_KEY> --hardware=<HARDWARE>
For the parameters above:
- <name>: Give your model a name for you to recognize on your Cerebrium dashboard
- <api_key>: This is the private API key you can find on your Cerebrium dashboard
- <cpu>: This is the number of CPU cores you want to allocate to your model. Optional as it defaults to 2. Can be an integer between 1 and 32
- <memory>: This is the number of GB of memory you’d like to allocate to your model. Optional as it defaults to 8.0GB. Depending on your hardware selection, this float can be between 2.0 and 256.0
- <hardware>: This can be any of the following options:
- TURING_4000 : A 8GB GPU that is great for lightweight models with less than 3B parameters in FP16.
- TURING_5000 : A 16GB GPU that is great for small models with less than 7B parameters in FP16. Most small HuggingFace models can run on this.
- AMPERE_A4000 : A 16GB GPU that is great for small models with less than 7B parameters in FP16. Significantly faster than an RTX 4000. Most small HuggingFace models can run on this.
- AMPERE_A5000 : A 24GB GPU that is great for medium models with less than 10B parameters in FP16. A great option for almost all HuggingFace models.
- AMPERE_A6000 : A 48GB GPU offering a great cost to performance ratio. This is great for medium models with less than 21B parameters in FP16. A great option for almost all HuggingFace models.
- A100 : A 40GB GPU offering some of the highest performance available. This is great for large models with less than 18B parameters in FP16. A great option for almost all HuggingFace models especially if inference speed is your priority.
- <cooldown>: Cooldown period, in seconds since the last request is completed, before an inactive replica of your deployment is scaled down. Defaults to 60s.
- <min_replicas>: The minimum number of replicas you would like to allow for your deployment. Set to 0 if you would like serverless deployments. Otherwise, for high volume applications, you can set this to a higher number to skip scale-up time and keep servers waiting. The minimum number of replicas is dependent on your subscription plan. Defaults to 0.
- <max_replicas>: The maximum number of replicas you would like to allow for your deployment. Useful for cost-sensitive applications when you need to limit the number of replicas that can be created. The maximum number of replicas is dependent on your subscription plan.
- <forced-rebuild>: This will force a rebuild of your deployment, even if the code has not changed. This functionality is particularly useful if you would like to clear the cached dependencies of your deployment or if something has gone wrong with your deployment and you would like to start from scratch.This flag will not clear your persistent storage, so don’t worry about losing your persistent files!
View model statistics and logs
Once you deploy a model, navigate back to the dashboard and click on the name of the model you just deployed. You will see the usual overview statistics of your model, but most importantly, you will see two tabs titled builds and runs.
- Builds: This is where you can see the logs regarding the creation of your environment and the code specified in the Init function. You will see logs only on every deploy.
- Runs: This is where you will see logs concerning every API call to your model endpoint. You can therefore debug every run based on input parameters and the model output.
We recommend you check out the advanced functionality section of Cortex as well as work through the examples to see how deployments were done for various different sample applications.
Advanced Functionality
Now that we have covered the basics of deploying a model, let’s dive into some of the more advanced functionality that Cortex provides.
Below are some links outlining some of the more advanced functionality that Cortex provides:
- Using config files to deploy models. See here.
- Handle Long running. See here.
- Quickly initialising a new cortex deployment using the CLI. See here.
- Utilising persistent storage to share files between builds and runs. See here.
- Utilising secrets to store sensitive information. See here.
- Running async functions in Cortex. See here.