When deploying machine learning models, you have two main options for storing model weights:
Inside the Container: Packaging model weights directly in your container image
Pros:
Faster initial startup as weights are already in the container
No need to download or transfer weights from external storage
Cons:
Much larger container size, leading to longer deployment times
Less flexibility to update model weights without rebuilding container
Storage Volume: Storing weights in a persistent storage volume
Pros:
Smaller container sizes and faster deployments
Easy to update model weights without rebuilding container
Cons:
Initial cold start includes time to load weights from storage
Requires managing separate storage infrastructure
Storing model weights in a storage volume works best for most applications. For smaller models requiring minimal cold start times, container storage may be more appropriate.
Increasing core counts can parallelize downloads, improving pull-through times
for large images. This benefit becomes particularly notable when handling
large files from the storage layer, as multiple cores process different parts
simultaneously, reducing overall download time.
One of the biggest factors in model startup time is loading the model from storage into GPU memory. For example, in larger models of 20B+ parameters, it can take over 40 seconds to load using a normal Hugging Face load, even with 2GB/s transfer speeds from persistent storage.While we’ve optimized the underlying hardware to load models as fast as possible, there are several ways to decrease model loading time and reduce cold-start times.
Tensorizer is a library that loads models from storage into GPU memory in a single step. While initially built to fetch models from S3, it can also load models from Cerebrium’s persistent storage, which features nearly 2GB/s read speed. For large models (20B+ parameters), we’ve observed a 30–50% decrease in loading time, with even greater improvements for larger models. For more information on the underlying methods, see their GitHub page.In this section below, we’ll show you how to use Tensorizer to load your model from storage straight into GPU memory in a single step.
To use Tensorizer, you need to first serialise your model and save it to your persistent-storage.
Copy
Ask AI
from tensorizer import TensorSerializerdef serialize_model(model, save_path): """Serialize the model and save the weights to the save_path.""" try: serializer = TensorSerializer(save_path) start = time.time() serializer.write_module(model) end = time.time() print(f"Serializing model took {end - start} seconds", file=sys.stderr) serializer.close() return True except Exception as e: print("Serialization failed with error:", e, file=sys.stderr) return False
This will convert your model to a protocol buffer serialised format that is optimised for faster transfer speeds and fast loading into GPU memory.Then, the next time your deployment starts, you can load your serialised model from storage into GPU memory in a single step.
You would do this as follows:
Copy
Ask AI
from tensorizer import TensorDeserializerfrom tensorizer.utils import no_init_or_tensordef deserialize_saved_model(model_path, model_id, plaid=True): """Deserialize the model from the model_path and load into GPU memory.""" # create a config object that we can use to init an empty model config = AutoConfig.from_pretrained(model_id) # Initialize empty model without loading weights into GPU print("Initializing empty model", file=sys.stderr) start = time.time() with no_init_or_tensor(): # Load empty model from config model = AutoModelForCausalLM.from_config(config) end_init = time.time() - start # Create deserializer object # Note: plaid_mode enables faster deserialization but isn't safe for training deserializer = TensorDeserializer(model_path, plaid_mode=True) # Deserialize model directly into GPU (zero-copy) print("Loading model", file=sys.stderr) start = time.time() deserializer.load_into_module(model) end = time.time() deserializer.close() # Report timings print(f"Initializing empty model took {end_init} seconds", file=sys.stderr) print(f"\nDeserializing model took {end - start} seconds\n", file=sys.stderr) return model
Note that your model does not need to be a Transformers or even a Hugging Face model. If you have a Diffusers, scikit-learn, or custom PyTorch model, you can still use Tensorizer to load your model from storage into GPU memory in a single step. The only requirement for deserialization speedup is that you can initialize an empty model. The deserializer object will then restore the weights into the empty model.