Skip to main content

Graceful Termination

Cerebrium runs in a shared, multi-tenant environment. To efficiently scale, optimize compute usage, and roll out updates, the platform continuously adjusts its capacity - spinning down nodes and launching new ones as needed. During this process, workloads are seamlessly migrated to new nodes. In addition, your application has its own metric-based autoscaling criteria that dictate when instances should scale or remain active, as well as handle instance shifting during new app deployments. Therefore, in order to prevent requests from ending prematurely when we mark app instances for termination, you need to implement graceful termination.

Understanding Instance Termination

For both application autoscaling and our own internal node scaling, we will send your application a SIGTERM signal, as a warning to the application that we are intending to shut down this instance. For Cortex applications (Cerebriums default runtime), this is handled. On custom runtimes, should you wish to gracefully shut down, you will need to catch and handle this signal. Once at least response_grace_period has elapsed, we will send your application a SIGKILL signal, terminating the instance immediately. When Cerebrium needs to terminate an contanier, we do the following:
  1. Stop routing new requests to the container.
  2. Send a SIGTERM signal to your container.
  3. Waits for response_grace_period seconds to elaspse.
  4. Sends SIGKILL if the container hasn’t stopped
Below is a chart that shows it more eloquently: If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending SIGTERM, which can interrupt in-flight requests and cause 502 errors.

Example: FastAPI Implementation

For custom runtimes using FastAPI, implement the lifespan pattern to respond to SIGTERM. The code below tracks active requests using a counter and prevents new requests during shutdown. When SIGTERM is received, it sets a shutdown flag and waits for all active requests to complete before the application terminates.
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
import asyncio

active_requests = 0
shutting_down = False
lock = asyncio.Lock()

@asynccontextmanager
async def lifespan(app: FastAPI):
    yield  # Application startup complete
    
    # Shutdown: runs when Cerebrium sends SIGTERM
    global shutting_down
    shutting_down = True
    
    # Wait for active requests to complete
    while active_requests > 0:
        await asyncio.sleep(1)

app = FastAPI(lifespan=lifespan)

@app.middleware("http")
async def track_requests(request, call_next):
    global active_requests
    if shutting_down:
        raise HTTPException(503, "Shutting down")
    
    async with lock:
        active_requests += 1
    try:
        return await call_next(request)
    finally:
        async with lock:
            active_requests -= 1
Your entrypoint must use exec or SIGTERM won’t reach your application: In your Dockerfile:
ENTRYPOINT ["exec", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Or in cerebrium.toml:
[cerebrium.runtime.custom]
entrypoint = ["fastapi", "run", "app.py", "--port", "8000"]

In bash scripts:
exec fastapi run app.py --port ${PORT:-8000}
Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so your shutdown code never runs and Cerebrium force-kills the container after the grace period.
Test SIGTERM handling locally before deploying: start your app, send SIGTERM with Ctrl+C, and verify you see graceful shutdown logs.
I