Architecture Overview
Our application consists of two main components:- A frontend interface running on CPU instances using FastAPI and Gradio.
- A separate Llama model endpoint running on GPU instances. (While this is beyond the scope of this article, you can find a comprehensive example for deploying Llama 8B with TensorRT here.)
- Keep the frontend always available while minimizing costs (CPU-only).
- Scale our GPU-intensive model independently based on demand.
- Optimize resource allocation for different components.
Prerequisites
Before starting, you’ll need:- A Cerebrium account (sign up here).
- The Cerebrium CLI installed:
pip install --upgrade cerebrium
. - A Llama model endpoint (or other LLM API endpoint).
Basic Setup
First, create a new directory for your project and initialize it:cerebrium.toml
file:
- Disables the default JWT authentication that is automatically placed on all Cerebrium endpoints, making your Gradio interface publicly accessible.
- Sets the entrypoint for the ASGI server to run through Uvicorn.
- Sets the default port to 8080 for serving your app.
- Sets the health endpoint to
/health
for checking app availability through our FastAPI application. - Configures hardware settings for the CPU instance running your app.
- Defines scaling configuration with minimum and maximum replicas, cooldown period, and replica concurrency (set to 10 requests per replica).
- Specifies required dependencies: Gradio, FastAPI, Requests, HTTPX, Uvicorn, and Starlette.
main.py
). To start, let’s create our FastAPI application:
- Initializes a FastAPI application to forward requests to our Gradio app running as a subprocess on a different port.
- Sets up a health check endpoint at
/health
. - Creates a catchall proxy that routes all requests to Gradio, including headers.
main.py
, let’s add the following code:
- A class
GradioServer
that handles the communication with the Llama model endpoint - A
chat_with_llama
method that sends a message to the Llama model and returns the response - A
run_server
method that creates a Gradio chat interface - A
start
method that starts the Gradio server in a separate process - A
stop
method that stops the Gradio server - An
on_event
startup and shutdown event that starts and stops the Gradio server respectively
main.py
file should look like this: