Skip to main content
All Cerebrium endpoints are OpenAI-compatible, supporting both /chat/completions and /embedding. Below is a basic implementation of a streaming OpenAI-compatible endpoint. For a full example using vLLM, see the OpenAI-compatible endpoint example. A streaming-compatible Cerebrium function must:
  • Specify all the parameters that OpenAI sends in the function signature.
  • Return yield data, where yield signals streaming and data is the JSON-serializable object returned to the caller.
Here’s a small snippet from the example listed above:

def run(messages: list, model: str,...):
  ##existing code

  async for output in results_generator:
    prompt = output.outputs
    new_text = prompt[0].text[len(previous_text):]
    previous_text = prompt[0].text
    full_text += new_text

    response = ChatCompletionResponse(
        id=run_id,
        object="chat.completion",
        created=int(time.time()),
        model=model,
        choices=[{
            "text": new_text,
            "index": 0,
            "logprobs": None,
            "finish_reason": prompt[0].finish_reason or "stop"
        }]
    )
    yield json.dumps(response.model_dump())
Once deployed, set the base URL to the target function and use the Cerebrium JWT (from the dashboard) as the API key. Client code:
import os
from openai import OpenAI

client = OpenAI(
    # This is the default and can be omitted
    base_url="https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxxx/1-openai-compatible-endpoint/run", ##This is the name of the function you are calling
    api_key="<CEREBRIUM_JWT_TOKEN>",
)

chat_completion = client.chat.completions.create(
    messages=[
   {"role": "user", "content": "What is a mistral?"},
   {"role": "assistant", "content": "A mistral is a type of cold, dry wind that blows across the southern slopes of the Alps from the Valais region of Switzerland into the Ligurian Sea near Genoa. It is known for its strong and steady gusts, sometimes reaching up to 60 miles per hour."},
   {"role": "user", "content": "How does the mistral wind form?"}
 ],
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    stream=True
)
print("Starting to receive chunks...")
for chunk in chat_completion:
    print(chunk)
print("Finished receiving chunks.")
The output then looks like this:
Starting to receive chunks...
ChatCompletionChunk(id='412f0e25-61c4-93b8-a00f-09a5076cd9fa', choices=[Choice(delta=None, finish_reason='stop', index=0, logprobs=None, text=' The')], created=1724166657, model='gpt-3.5-turbo', object='chat.completion', service_tier=None, system_fingerprint=None, usage=None)
ChatCompletionChunk(id='412f0e25-61c4-93b8-a00f-09a5076cd9fa', choices=[Choice(delta=None, finish_reason='stop', index=0, logprobs=None, text=' formation')], created=1724166657, model='gpt-3.5-turbo', object='chat.completion', service_tier=None, system_fingerprint=None, usage=None)
ChatCompletionChunk(id='412f0e25-61c4-93b8-a00f-09a5076cd9fa', choices=[Choice(delta=None, finish_reason='stop', index=0, logprobs=None, text=' of')], created=1724166657, model='gpt-3.5-turbo', object='chat.completion', service_tier=None, system_fingerprint=None, usage=None)
ChatCompletionChunk(id='412f0e25-61c4-93b8-a00f-09a5076cd9fa', choices=[Choice(delta=None, finish_reason='stop', index=0, logprobs=None, text=' the')], created=1724166657, model='gpt-3.5-turbo', object='chat.completion', service_tier=None, system_fingerprint=None, usage=None)
ChatCompletionChunk(id='412f0e25-61c4-93b8-a00f-09a5076cd9fa', choices=[Choice(delta=None, finish_reason='stop', index=0, logprobs=None, text=' mist')], created=1724166657, model='gpt-3.5-turbo', object='chat.completion', service_tier=None, system_fingerprint=None, usage=None)
...