Streaming LLM Output
Stream outputs live from Falcon 7B using SSE
This example is only compatible with CLI v1.20 and later. Should you be making
use of an older version of the CLI, please run pip install --upgrade cerebrium
to upgrade it to the latest version.
In this tutorial, we’ll show you how to implement streaming with Server-Sent Events (SSE) to return results to your users as quickly as possible.
To see the final implementation, you can view it here
Basic Setup
Developing models with Cerebrium is similar to developing on a virtual machine or Google Colab, making conversion straightforward. Make sure you have the Cerebrium package installed and are logged in. If not, check our docs here.
First, create your project:
Add the following packages to the [cerebrium.dependencies.pip]
section of your cerebrium.toml
file:
Create a main.py
file for our Python code. This simple implementation can be done in a single file. First, let’s define our request object:
We use Pydantic for data validation. The prompt
parameter is required, while others are optional with default values. If prompt
is missing from the request, users receive an automatic error message.
Falcon Implementation
Model Setup
We import the required packages and instantiate the tokenizer and model outside the predict
function. This ensures model weights load only once at startup, not with every request.
Streaming Implementation
Below, we define our stream
function to handle streaming results from our endpoint:
The function receives inputs from our request object and uses TextIteratorStreamer
to stream model output. The yield
keyword returns output as it’s generated.
Deploy
Configure your compute and environment settings in cerebrium.toml
:
Deploy the model using this command:
After deployment, make this request:
The endpoint path should include stream
since that’s our function name.
The model outputs as Server-Sent Events (SSE). Here’s an example from Postman: