Streaming allows users to stream live output from their models using server-sent event (SSE) streams. This works for Python objects which iterator or generator protocol. Currently, your generator/iterator is required to yield string data, as it will be sent downstream via the text/event-stream Content-Type. You will also need to inform the API that your application accepts text/event-stream content by setting the Accept header. If you do not set this header, the stream will not work. Instead the response will be delivered as if the content was application/json with the data concatenated together.

Let us see how we can implement a simple example below:

import time

def run(upper_range: int):
    for i in range(upper_range):
        yield f"Number {i} "
        time.sleep(1)

Once you deploy this code snippet and hit the stream endpoint, you will see the SSE events progressively appear every second.

You can do this as follows:

curl -X POST https://api.cortex.cerebrium.ai/v4/<YOUR-PROJECT-ID>/stream-example/run \
       -H 'Content-Type: application/json'\
       -H 'Accept: text/event-stream\
       -H 'Authorization: Bearer <YOUR-JWT-TOKEN>\
       --data '{"upper_range": 3}'

This should output:

HTTP/1.1 200 OK
cache-control: no-cache
content-encoding: gzip
content-type: text/event-stream; charset=utf-8
date: Tue, 28 May 2024 21:12:46 GMT
server: envoy
transfer-encoding: chunked
vary: Accept-Encoding
x-envoy-upstream-service-time: 198995
x-request-id: e6b55132-32af-96d7-a064-8915c4a42452

data: Number 0
...

Progressively, you will see the rest of the data stream in every second:

...
data: Number 1

data: Number 2

event: end

The latest Postman also has great functionality to show this.

Streaming

If you want to see an example of implementing this with Falcon-7b, please check out the example here