Streaming Endpoints

Streaming allows users to stream live output from their models using a server-sent event (SSE) stream. This works for Python objects that use the iterator or generator protocol. Currently, your generator/iterator is required to yield data, as it will be sent downstream via the text/event-stream Content-Type. You may still send data in JSON format and then can decode it appropriately. Let’s see how we can implement a simple example below:

import time

def run(upper_range: int):
    for i in range(upper_range):
        yield f"Number {i} "
        time.sleep(1)

Once you deploy this code snippet and hit the stream endpoint, you will see the SSE events progressively appear every second. You can do this as follows:

curl -X POST https://api.cortex.cerebrium.ai/v4/<YOUR-PROJECT-ID>/2-streaming-endpoint/run \
       -H 'Content-Type: application/json'\
       -H 'Accept: text/event-stream\
       -H 'Authorization: Bearer <YOUR-JWT-TOKEN>\
       --data '{"upper_range": 3}'

This should output:

HTTP/1.1 200 OK
cache-control: no-cache
content-encoding: gzip
content-type: text/event-stream; charset=utf-8
date: Tue, 28 May 2024 21:12:46 GMT
server: envoy
transfer-encoding: chunked
vary: Accept-Encoding
x-envoy-upstream-service-time: 198995
x-request-id: e6b55132-32af-96d7-a064-8915c4a42452

data: Number 0
...

Progressively, you will see the rest of the data stream every second:

...
data: Number 1

data: Number 2

The latest Postman also has great functionality to show this. Streaming

If you want to see an example of implementing this with Falcon-7B, please check out the example here.

REST API WebSocket Endpoints

Getting Started

Container Images

GPUs and Compute Resources

Scaling apps

Deployments

Endpoints

Storage

Partner Services

Other concepts

Streaming Endpoints