We are phasing out support for Python 3.8 at the end of September 2023. If you are using Python 3.8, please update your dependencies and move your deployment to Python >= 3.9.

The Cerebrium conduit object creates an abstraction for you to deploy a model across any framework seamlessly with just a few lines of code.

Below is the general layout of the Conduit object:

from cerebrium import Conduit, model_type, hardware

Some model training logic...

c = Conduit(
    name="<MODEL_NAME>",
    api_key="<API_KEY>",
    flow=("<MODEL_TYPE>", "<MODEL_PATH>"),
    hardware="<HARDWARE>",
)

c.deploy()

In the Conduit object, there are the following parameters:

  • A tuple of the model type and the model file:
  • MODEL_TYPE: This parameter specifies the type of model you are supplying Cerebrium and must be a model_type. This is to ensure that Cerebrium knows how to handle your model. The current supported model types are:
    • model_type.SKLEARN: Expects a .pkl file (there is no requirement of the model to be a regressor or classifier).
    • model_type.SKLEARN_CLASSIFIER: Expects a .pkl file (the model must be a classifier. returns a class probability distribution instead of a single class prediction)
    • model_type.SKLEARN_PREPROCESSOR: Expects a .pkl file. This is a special model type that is used to preprocess data with the .transform method before it is sent to the model, such as a scaler or a one-hot encoder.
    • model_type.TORCH: Expects a .pkl file serialized with cloudpickle or a JIT script Torchscript .pt file.
    • model_type.XGBOOST_REGRESSOR: Expects a serialized .pkl file or a XGB .json file.
    • model_type.XGBOOST_CLASSIFIER: Expects a serialized .pkl file or a XGB .json file.
    • model_type.ONNX: Expects a serialized .onnx file
    • model_type.SPACY: Expects a folder path of your Spacy model
    • model_type.HUGGINGFACE_PIPELINE: Expects the task identifier and model identifier of the Hugging face model
    • MODEL_PATH: The path to your weights. This is either the path to the weights on your local machine or the path to the weights on in the cloud. For huggingface models, this is the Hugginface model identifier e.g. ‘meta-llama/Llama-2-13b-chat-hf’.
  • MODEL_NAME: The name you would like to give your model (alphanumeric, with hyphens and less than 20 characters). This is a unique identifier for your model and will be used to call your model in the future.
  • API_KEY: This is the API key that can be found on your profile. You can get it here.
  • HARDWARE: The hardware parameter is a enum that can be one of the following:
    • hardware.CPU: This will run your model on a CPU. This is the default option for SKLearn, XGBoost, and SpaCy models.
    • hardware.GPU: (Deprecated) This will run your model on a T4 GPU. This is the default option for Torch, ONNX, and HuggingFace models.
    • hardware.A10: (Deprecated) This will run your model on an A10 GPU, which provides 24GB of VRAM. You should use this option if you are using a model that is too large to fit on the 16GB of VRAM that a T4 GPU provides. This will include most large HuggingFace models.
    • hardware.TURING_4000 : A 8GB GPU that is great for lightweight models with less than 3B parameters in FP16.
    • hardware.TURING_5000 : A 16GB GPU that is great for small models with less than 7B parameters in FP16. Most small HuggingFace models can run on this.
    • hardware.AMPERE_A4000 : A 16GB GPU that is great for small models with less than 7B parameters in FP16. Significantly faster than an RTX 4000. Most small HuggingFace models can run on this.
    • hardware.AMPERE_A5000 : A 24GB GPU that is great for medium models with less than 10B parameters in FP16. A great option for almost all HuggingFace models.
    • hardware.AMPERE_A6000 : A 48GB GPU offering a great cost to performance ratio. This is great for medium models with less than 21B parameters in FP16. A great option for almost all HuggingFace models.
    • hardware.A100 : A 40GB GPU offering some of the highest performance available. This is great for large models with less than 18B parameters in FP16. A great option for almost all HuggingFace models especially if inference speed is your priority.
  • CPU: This is the number of CPU cores you want to allocate to your model. Optional as it defaults to 2. Can be an integer between 1 and 32
  • MEMORY: This is the number of GB of memory you’d like to allocate to your model. Optional as it defaults to 8.0GB. Depending on your hardware selection, this float can be between 2.0 and 256.0
  • COOLDOWN: Cooldown period, in seconds since the last request is completed, before an inactive replica of your deployment is scaled down. Defaults to 60s.
  • MIN_REPLICAS: The minimum number of replicas you would like to keep active. Defaults to 0 to allow serverless execution. Can be set >0 to keep a single replica active at all times. The maximum number of replicas is dependent on your subscription plan.
  • MAX_REPLICAS: The maximum number of replicas you would like to allow for your deployment. Useful for cost-sensitive applications when you need to limit the number of replicas that can be created. The maximum number of replicas is dependent on your subscription plan.
  • REQUIREMENTS_FILE: Optional path to a requirements.txt file that will be installed in the deployment environment. This is useful when you need additional libraries or packages to run your model. Defaults to None.
  • FORCE_REBUILD: Optional boolean to force a rebuild of the deployment environment. This is useful when you need to have a clean environment without any of the cached dependencies from previous deployments. Don’t worry, your persistent storage is safe. Defaults to False.
  • PYTHON_VERSION: You can choose the version of Python that you would like to use for your deployment by using this optional parameter. We support Python 3.8, 3.9, 3.10, and 3.11. If you do not specify a Python version, we will default to Python 3.10.

Every unique model name will create a separate deployment with a separate endpoint. It is important to keep track of the model names you have used so that you can call the correct model in the future. If you deploy a model with the same name as a previous model, the previous model will be archived and the new model will be deployed automatically. This is useful for versioning your models.

Once you’ve run the deploy function, give it a minute, and it should be deployed - easy-peasy! If your deployment is successful, you will see the following output:

✅ Authenticated with Cerebrium!
⬆️  Uploading conduit artifacts...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179k/179k [00:04<00:00, 42.7kB/s]
✅ Conduit artifacts uploaded successfully.
✅ Conduit deployed!
🌍 Endpoint: https://run.cerebrium.ai/v1/YOUR-PROJECT-ID/YOUR-MODEL-NAME/predict

Our deploy function will also return the endpoint of your model directly. This is the URL that you will use to call your model in the future.

API Specification and Helper methods

You can see an example of the request and response objects for calls made to your models. It should resemble what it is like calling your model locally in your own python environment.

Request Parameters

  curl --location --request POST '<ENDPOINT>' \
      --header 'Authorization: <API_KEY>' \
      --header 'Content-Type: application/json' \
      --data-raw '[<DATA_INPUT>]'
Authorizationrequired
string

This is the Cerebrium API key used to authenticate your request. You can get it from your Cerebrium dashboard.

Content-Typerequired
string

The content type of your request. Must be application/json or multipart/form-data if sending files.

datarequired
array

A list of data points you would like to send to your model. e.g. for 1 data point of 3 features: [[1,2,3]].

{
  "result": [<MODEL_PREDICTION>],
  "run_id": "<RUN_ID>",
  "run_time_ms": <RUN_TIME_MS>
  "prediction_ids": ["<PREDICTION_ID>"]
}

Response Parameters

resultrequired
array

The result of your model prediction.

run_idrequired
string

The run ID associated with your model predictions.

run_time_msrequired
float

The amount of time if took your model to run down to the millisecond. This is what we charge you based on.

prediction_idsrequired
array

The prediction IDs associated with each of your model predictions. Used to track your model predictions with monitoring tools.

You can test out your model endpoint quickly with our utility function supplied in Cerebrium, model_api_request.

from cerebrium import model_api_request
model_api_request(endpoint, data, '<API_KEY>')

The function takes in the following parameters:

  • endpoint: The endpoint of your model that was returned by the deploy function.
  • data: The data you would like to send to your model. You may feed an ndarray or Tensor directly into this function.
  • api_key: This is the Cerebrium API key used to authenticate your request.

To get started, see how easy it is to deploy any of the frameworks below:

Start with a framework