To run your model locally and ensure it is working as intended before deploying, you can use the load and run command. The load command will load your pipeline into memory using the paths you specified in deploy from a base path, while run will sequentially execute your loaded pipeline.

from cerebrium import deploy, model_type

conduit = deploy(('<MODEL_TYPE>', '<MODEL_FILE>'), '<MODEL_NAME>', '<API_KEY>')
conduit.load('./')
conduit.run(data)

Where data is the data you would send to your model. This would usually be some numerical 2D/3D array for typical models or a list of strings for a language model. You may feed an ndarray or Tensor directly into this function. However, if you are using a custom data pipeline that is expecting another type, you may need to convert your data into the appropriate format for your model.

You can also define a Conduit object directly by using the Conduit class. Then call the run method on the Conduit object to test the model locally, or the deploy method to deploy the Conduit’s model flow to Cerebrium. When you use the Conduit object directly, you can specify what hardware you wish your model to run on by using the hardware parameter in the Conduit constructor. The hardware parameter is an enum that can be one of the following:

  • hardware.CPU: This will run your model on a CPU. This is the default option for SKLearn, XGBoost, and SpaCy models.
  • hardware.GPU: (Deprecated) This will run your model on a T4 GPU. This is the default option for Torch, ONNX, and HuggingFace models.
  • hardware.A10: (Deprecated) This will run your model on an A10 GPU, which provides 24GB of VRAM. You should use this option if you are using a model that is too large to fit on the 16GB of VRAM that a T4 GPU provides. This will include most large HuggingFace models.
  • hardware.TURING_4000 : A 8GB GPU that is great for lightweight models with less than 3B parameters in FP16.
  • hardware.TURING_5000 : A 16GB GPU that is great for small models with less than 7B parameters in FP16. Most small HuggingFace models can run on this.
  • hardware.AMPERE_A4000 : A 16GB GPU that is great for small models with less than 7B parameters in FP16. Significantly faster than an RTX 4000. Most small HuggingFace models can run on this.
  • hardware.AMPERE_A5000 : A 24GB GPU that is great for medium models with less than 10B parameters in FP16. A great option for almost all HuggingFace models.
  • hardware.AMPERE_A6000 : A 48GB GPU offering a great cost to performance ratio. This is great for medium models with less than 21B parameters in FP16. A great option for almost all HuggingFace models.
  • hardware.A100 : A 80GB GPU offering some of the highest performance available. This is great for large models with less than 18B parameters in FP16. A great option for almost all HuggingFace models especially if inference speed is your priority.
from cerebrium import Conduit, model_type, hardware
conduit = Conduit(
  '<MODEL_NAME>',
  '<API_KEY>',
  [('<MODEL_TYPE>', '<MODEL_FILE>')],
  hardware=hardware.<HARDWARE_TYPE>
)
conduit.load('./')
conduit.run(data)
conduit.deploy()

Additionally, defining a conduit object directly allows you to add more models to your flow dynamically using add_model method.

conduit.add_model('<MODEL_TYPE>', '<MODEL_FILE>', {<PROCESSING_FUNCTIONS>})