Understanding Concurrency
Each instance can process multiple requests simultaneously. Thereplica_concurrency setting in cerebrium.toml determines how many requests each instance handles in parallel:
Understanding Batching
Batching determines how concurrent requests are grouped and executed within an instance. Concurrency controls the number of simultaneous requests; batching controls how those requests are processed together. The default concurrency is 1 request per container. Cerebrium supports two approaches to request batching.Framework-native Batching
Many frameworks handle batched processing natively. vLLM, for example, automatically batches model inference requests:Custom Batching
Implement custom batching through Cerebrium’s custom runtime feature for precise control over request processing and custom batching strategies. LitServe implementation requires additional configuration incerebrium.toml: