Other concepts
Do's and Don'ts for Faster Deployments
Some guidelines to boost the speed of your deployments and cold starts.
Introduction
In today’s fast-paced digital landscape, where productivity and efficiency reign supreme, the ability to deploy your code quickly and reliably saves valuable time. Whether you’re a seasoned developer or a tech enthusiast, these guidelines can help you save time in deploying, building, and running inference with your Cerebrium model.
In this brief guide, we delve into the strategies, best practices, and pitfalls to avoid when it comes to the deployment process.
Setting Up Your Cerebrium Dependencies
- Do freeze the versions of your Python libraries.
Solving the requirements for your project can extend your build time unnecessarily. During development, your work setting up an environment has already determined which versions you need. Use these versions when creating yourcerebrium.dependencies
in yourcerebrium.toml
file. - Don’t include built-in modules in your requirements as this wastes time during your build.
- Do reduce unused requirements where possible. They are extra baggage that slows down your build when you’ve gotta move fast.
- Don’t dump your entire Python environment into the requirements.
Deploying Your Code
- Do check and lint your code before deploying.
Spelling mistakes and simple errors are easy to avoid and can significantly slow down your development cycle. - Do upload unchanging files to cloud storage when possible.
Downloading weights from cloud servers in yourmain.py
(instead of uploading from your local computer) leverages faster internet speeds, yielding quicker deployments. - Do use the
--include
and--exclude
flags to upload only the files you need. - Do perform prerequisite setup once in the body of
main.py
outside your predict function. - Do use global variables where possible to prevent recomputing values on each inference call.
Downloading Files and Setting Up Models
- Do use persistent storage.
- Do set the Hugging Face cache directory to
/persistent-storage
for your models, tokenizers, and datasets. - Don’t re-download models, tokenizers, or other assets if possible.
For Faster Inference
- Do use inference calls only for inference. Perform all setup outside the predict function so it runs once during the build process.
- Do take advantage of available resources. Ensure your model runs on the GPU when available:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')