In this tutorial, we will transcribe an hour audio file using Distil Whisper - an optimised version of Whisper-large-v2 but 60% faster and within 1% of the error rate. We will accept either a base64 encode string of the audio file or a url from which we can download the audio file from.

To see the final implementation, you can view it here

Basic Setup

It is important to think of the way you develop models using Cerebrium should be identical to developing on a virtual machine or Google Colab - so converting this should be very easy! Please make sure you have the Cerebrium package installed and have logged in. If not, please take a look at our docs here

First we create our project:

cerebrium init distil-whisper

It is important to think of the way you develop models using Cerebrium should be identical to developing on a virtual machine or Google Colab - so converting this should be very easy!

Let us add the following packages to the [cerebrium.dependencies.pip] section of our cerebrium.toml file:

accelerate = "latest"
transformers = ">=4.35.0"
openai-whisper = "latest"
pydantic = "latest"

To start let us create a file for our utility functions - downloading a file from a url or converting a base64 string to a file. Our would look something like below:

import base64
import uuid

DOWNLOAD_ROOT = "/tmp/"  # Change this to /persistent-storage/ if you want to save files to the persistent storage

def download_file_from_url(url: str, filename: str):
    print("Downloading file...")

    response = requests.get(url)
    if response.status_code == 200:
        print("Download was successful")

        with open(filename, "wb") as f:

        return filename

        raise Exception("Download failed")

# Saves a base64 encoded file string to a local file
def save_base64_string_to_file(audio: str):
    print("Converting file...")

    decoded_data = base64.b64decode(audio)

    filename = f"{DOWNLOAD_ROOT}/{uuid.uuid4()}"

    with open(filename, "wb") as file:

    print("Decoding base64 to file was successful")
    return filename

Now that our utility functions are complete, go to the file which will contain our main Python code. We would like the user to send us either a base64 encoded string of the file or a public url from which we can download the file. We would then pass this file to our model and return the output to the user. So let us define our request object.

from typing import Optional
from pydantic import BaseModel, HttpUrl

class Item(BaseModel):
    audio: Optional[str]
    file_url: Optional[HttpUrl]
    webhook_endpoint: Optional[HttpUrl]

Above, we use Pydantic as our data validation library. Due to the way that we have defined the Base Model, “audio” and “file_url” are optional parameters but we must do a check to make sure we are given the one or the other. The webhook_endpoint parameter is something Cerebrium automatically includes in every request and can be used for long running requests. Currently, Cerebrium has a max timeout of 3 minutes for each inference request. For long audio files (2 hours) which take a couple minutes to process it would be best to use a webhook_endpoint which is a url we will make a POST request to with the results of your function.

Setup Model and inference

Below, we import the required packages and load in our Whisper model. This will download during your deployment however in subsequent deploys or inference requests it will be automatically cached in your persistent storage for subsequent use. You can read more about persistent storage here We do this outside our predict function since we only want this code to run on a cold start (ie: on startup). If the container is already warm, we just want it to do inference and it will execute just the predict function.

from huggingface_hub import hf_hub_download
from whisper import load_model, transcribe
from util import download_file_from_url, save_base64_string_to_file

distil_large_v2 = hf_hub_download(repo_id="distil-whisper/distil-large-v2", filename="original-model.bin")
model = load_model(distil_large_v2)

def predict(run_id, audio=None, file_url=None, webhook_endpoint=None):
    item = Item(audio=audio, file_url=file_url, webhook_endpoint=webhook_endpoint)
    input_filename = f"{run_id}.mp3"

    if audio is None and file_url is None:
        raise 'Either audio or file_url must be provided'
        if is not None:
            file = save_base64_string_to_file(
        elif item.file_url is not None:
            file = download_file_from_url(item.file_url, input_filename)
        print("Transcribing file...")

        result = transcribe(model, audio=file)
        return result

In our predict function, which only runs on inference requests, we simply create a audio file from the download URL or string given to us via the request. We then transcribe the file and return the output to a user.


Your cerebrium.toml file is where you can set your compute/environment. Your cerebrium.toml file should look like:

predict_data = "{\"prompt\": \"Here is some example predict data for your config.yaml which will be used to test your predict function on build.\"}"
hide_public_endpoint = false
disable_animation = false
disable_build_logs = false
disable_syntax_check = false
disable_predict = true
log_level = "INFO"
disable_confirmation = false

name = "distil-whisper"
python_version = "3.11"
include = "[./*,, cerebrium.toml]"
exclude = "[./example_exclude]"
cuda_version = "12"

region = "us-east-1"
provider = "aws"
gpu = "AMPERE_A10"
cpu = 3
memory = 12.0
gpu_count = 1

min_replicas = 0
max_replicas = 5
cooldown = 60

accelerate = "latest"
transformers = ">=4.35.0"
openai-whisper = "latest"
pydantic = "latest"


"ffmpeg" = "latest"

To deploy the model use the following command:

cerebrium deploy distill-whisper

Once deployed, we can make the following request:

curl --location '<YOUR PROJECT ID>/distill-whisper/predict' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <YOUR TOKEN HERE>' \
--data '{"file_url": ""}''

You will notice that you get an immediate response with a 202 status code and a run_id. This run_id is a unique identifier for you to be able to correlate the result to the initial workload.

Our endpoint will then get the following results:

  "run_id": "2R5PnHprwNqiS5tcFMor-4c6rSrxuzrVtBU1JfjT5iWFG6s4pHo1Ug==",
  "message": "Finished inference request with run_id: `2R5PnHprwNqiS5tcFMor-4c6rSrxuzrVtBU1JfjT5iWFG6s4pHo1Ug==`",
  "result": {
    "text": " Testing, one, two, three, testing.",
    "segments": [
        "id": 0,
        "seek": 0,
        "start": 0,
        "end": 4,
        "text": " Testing, one, two, three, testing.",
        "tokens": [
          50364, 45517, 11, 472, 11, 220, 20534, 11, 220, 27583, 11, 220, 83,
          8714, 13, 50564
        "temperature": 0,
        "avg_logprob": -0.3824356023003073,
        "compression_ratio": 1,
        "no_speech_prob": 0.019467202946543694
    "language": "en"
  "status_code": 200,
  "run_time_ms": 2053.8525581359863