In this tutorial, we’ll create an OpenAI-compatible endpoint that works with any open-source model. This lets you use your existing OpenAI code with Cerebrium serverless functions by changing just two lines of code.
To see the final code implementation, you can view it here
Cerebrium setup
If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get set up.
In your IDE, run the following command to create our Cerebrium starter project: cerebrium init 1-openai-compatible-endpoint
. This creates two files:
main.py
: Our entrypoint file where our code lives
cerebrium.toml
: A configuration file that contains all our build and environment settings
Add the following pip packages and hardware requirements to your cerebrium.toml
to create your deployment environment:
[cerebrium.hardware]
cpu = 2
memory = 12.0
compute = "AMPERE_A10"
[cerebrium.dependencies.pip]
vllm = "latest"
pydantic = "latest"
Let’s define our imports and initialize our model. We’ll use Meta’s Llama 3.1 model, which requires Hugging Face authorization. Add your HF token to your secrets in the Cerebrium dashboard, then add this code to main.py
:
from vllm import SamplingParams, AsyncLLMEngine
from vllm.engine.arg_utils import AsyncEngineArgs
from pydantic import BaseModel
from typing import Any
import time
import json
import os
from huggingface_hub import login
# Your huggingface token (HF_AUTH_TOKEN) should be stored in your project secrets on your dashboard
login(token=os.environ.get("HF_AUTH_TOKEN"))
engine_args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
gpu_memory_utilization=0.9, # Increase GPU memory utilization
max_model_len=8192 # Decrease max model length
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
Next, define the required output format for OpenAI endpoints using Pydantic:
class Message(BaseModel):
role: str
content: str
class ChatCompletionResponse(BaseModel):
id: str
object: str
created: int
model: str
choices: List[Any]
async def run(messages: list, model: str, run_id: str, stream: bool = True, temperature: float = 0.8, top_p: float = 0.95):
prompt = " ".join([f"{Message(**msg).role}: {Message(**msg).content}" for msg in messages])
sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
results_generator = engine.generate(prompt, sampling_params, run_id)
previous_text = ""
full_text = "" # Collect all generated text here
async for output in results_generator:
prompt = output.outputs
new_text = prompt[0].text[len(previous_text):]
previous_text = prompt[0].text
full_text += new_text # Append new text to full_text
response = ChatCompletionResponse(
id=run_id,
object="chat.completion",
created=int(time.time()),
model=model,
choices=[{
"text": new_text,
"index": 0,
"logprobs": None,
"finish_reason": prompt[0].finish_reason or "stop"
}]
)
print(response.model_dump())
yield f"data: {json.dumps(response.model_dump())}\n\n"
# Send the final [DONE] message
yield "data: [DONE]\n\n"
The function:
- Takes parameters through its signature, with optional and default values available
- Automatically receives a unique
run_id
for each request
- Processes the entire prompt through the model
- Streams results when
stream=True
using async functionality
- Returns the complete result at the end if streaming is disabled
Deploy & Inference
To deploy the model use the following command:
After deployment, you’ll see a curl command like this:
curl --location 'https://api.cortex.cerebrium.ai/v4/p-<YOUR PROJECT ID>/5-openai-compatible-endpoint/{function}' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <YOUR TOKEN HERE>' \
--data '{"..."}'
In Cerebrium, each function name becomes an endpoint (ending with /run
). While OpenAI-compatible endpoints typically end with /chat/completions
, we’ve made all endpoints OpenAI-compatible. Here’s how to call the endpoint:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.cortex.cerebrium.ai/v4/p-xxxxxxx/5-openai-compatible-endpoint/run",
api_key="<CEREBRIUM_JWT_TOKEN>",
)
chat_completion = client.chat.completions.create(
messages=[
{"role": "user", "content": "What is a mistral?"},
{"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
{"role": "user", "content": "How does the mistral wind form?"}
],
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
stream=True
)
for chunk in chat_completion:
print(chunk)
print("Finished receiving chunks.")
Set the base URL to the one from your deploy command (ending in /run
). Use your JWT token from either the curl command or your Cerebrium dashboard’s API Keys section.
Voilà! You now have an OpenAI-compatible endpoint that you can customize to your needs!