Understanding Concurrency
Concurrency in Cerebrium allows each instance to process multiple requests simultaneously. Thereplica_concurrency
setting in the cerebrium.toml
file determines how many requests each instance handles in parallel:
Understanding Batching
Batching determines how concurrent requests are processed together within an instance. While concurrency controls the number of simultaneous requests, batching manages how these requests are grouped and executed (the default concurrency is 1 request per container). Cerebrium supports two approaches to request batching.Framework-native Batching
Many frameworks include features for processing multiple requests efficiently. vLLM, for example, automatically handles batched model inference requests:Check out the complete vLLM batching
example
for more information.
Custom Batching
Applications requiring precise control over request processing can implement custom batching through Cerebrium’s custom runtime feature. This approach allows for specific batching strategies and custom processing logic. As an example, implementation with LitServe requires additional configuration in thecerebrium.toml
file:
Check out the complete Litserve
example
for more information.