Batching and Concurrency

Understanding Concurrency

Concurrency in Cerebrium allows each instance to process multiple requests simultaneously. The replica_concurrency setting in the cerebrium.toml file determines how many requests each instance handles in parallel:

[cerebrium.scaling]
replica_concurrency = 4    # Process up to 4 requests simultaneously.

When requests arrive at an instance that hasn’t reached its concurrency limit, they begin processing immediately. Once an instance reaches its maximum concurrent requests, additional requests queue until capacity becomes available. This parallel processing capability helps applications maintain consistent performance during periods of high traffic. Modern GPUs excel at parallel processing, making concurrent request handling particularly effective for workloads. For instance, when an instance processes multiple image classification requests concurrently, it utilizes GPU resources more efficiently than processing requests sequentially.

Understanding Batching

Batching determines how concurrent requests are processed together within an instance. While concurrency controls the number of simultaneous requests, batching manages how these requests are grouped and executed (the default concurrency is 1 request per container). Cerebrium supports two approaches to request batching.

Framework-native Batching

Many frameworks include features for processing multiple requests efficiently. vLLM, for example, automatically handles batched model inference requests:

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
cooldown = 10
replica_concurrency = 4 # Each container can now handle multiple requests.

[cerebrium.dependencies.pip]
sentencepiece = "latest"
torch = "latest"
vllm = "latest"
transformers = "latest"
accelerate = "latest"
xformers = "latest"

When multiple requests arrive, vLLM automatically combines them into optimal batch sizes and processes them together, maximizing GPU utilization through its internal batching functionality.

Check out the complete vLLM batching example for more information.

Custom Batching

Applications requiring precise control over request processing can implement custom batching through Cerebrium’s custom runtime feature. This approach allows for specific batching strategies and custom processing logic. As an example, implementation with LitServe requires additional configuration in the cerebrium.toml file:

[cerebrium.runtime.custom]
port = 8000
entrypoint = ["python", "app/main.py"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"

[cerebrium.dependencies.pip]
litserve = "latest"
fastapi = "latest"

Check out the complete Litserve example for more information.

Custom batching provides complete control over request grouping and processing, particularly valuable for frameworks without native batching support or applications with specific processing requirements. The Container Images Guide provides detailed implementation instructions. Together, batching and concurrency create an efficient request processing system. Concurrency enables parallel request handling, while batching optimizes how these concurrent requests are processed, leading to better resource utilization and application performance.

Getting Started

Container Images

GPUs and Compute Resources

Scaling apps

Deployments

Endpoints

Storage

Partner Services

Other concepts

Batching and Concurrency

Understanding Concurrency

Understanding Batching

Framework-native Batching

Custom Batching

Getting Started

Container Images

GPUs and Compute Resources

Scaling apps

Deployments

Endpoints

Storage

Partner Services

Other concepts

​Understanding Concurrency

​Understanding Batching

​Framework-native Batching

​Custom Batching

Understanding Concurrency

Understanding Batching

Framework-native Batching

Custom Batching