Decrease Model Loading Time
Decrease the time it takes to load your model from storage into GPU
One of the biggest factors in model startup time is loading the model from storage into GPU memory. For example, in larger models of 20B+ parameters, it can take over 40 seconds to load using a normal Hugging Face load, even with 2GB/s transfer speeds from persistent storage.
While we’ve optimized the underlying hardware to load models as fast as possible, there are several ways to decrease model loading time and reduce cold-start times.
Using Serialization and Zero-Copy Initialization Libraries
Tensorizer (recommended)
Tensorizer is a library that loads models from storage into GPU memory in a single step. While initially built to fetch models from S3, it can also load models from Cerebrium’s persistent storage, which features nearly 2GB/s read speed. For large models (20B+ parameters), we’ve observed a 30–50% decrease in loading time, with even greater improvements for larger models. For more information on the underlying methods, see their GitHub page.
In this section below, we’ll show you how to use Tensorizer to load your model from storage straight into GPU memory in a single step.
Installation
Add the following to your [cerebrium.dependencies.pip]
in your cerebrium.toml
file to install Tensorizer in your deployment:
Usage
To use Tensorizer, you need to first serialise your model and save it to your persistent-storage.
This will convert your model to a protocol buffer serialised format that is optimised for faster transfer speeds and fast loading into GPU memory.
Then, the next time your deployment starts, you can load your serialised model from storage into GPU memory in a single step. You would do this as follows:
Note that your model does not need to be a Transformers or even a Hugging Face model. If you have a Diffusers, scikit-learn, or custom PyTorch model, you can still use Tensorizer to load your model from storage into GPU memory in a single step. The only requirement for deserialization speedup is that you can initialize an empty model. The deserializer object will then restore the weights into the empty model.