In this tutorial, we’ll transcribe an hour-long audio file using Distill Whisper - an optimized version of Whisper-large-v2 that’s 60% faster while maintaining accuracy within 1% of the original. We’ll accept either a base64-encoded string of the audio file or a URL to download the audio file.

To see the final implementation, you can view it here

Basic Setup

Developing models with Cerebrium is similar to developing on a virtual machine or Google Colab, making conversion straightforward. Make sure you have the Cerebrium package installed and are logged in. If not, check our docs here.

First, create your project:

cerebrium init 1-whisper-transcription

Add the following packages to the [cerebrium.dependencies.pip] section of your cerebrium.toml file:

[cerebrium.dependencies.pip]
accelerate = "latest"
transformers = ">=4.35.0"
openai-whisper = "latest"
pydantic = "latest"

Let’s create a util.py file for our utility functions - downloading a file from a URL or converting a base64 string to a file:

import base64
import uuid

DOWNLOAD_ROOT = "/tmp/"

def download_file_from_url(url: str, filename: str):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)
        return filename
    else:
        raise Exception("Download failed")


def save_base64_string_to_file(audio: str):
    decoded_data = base64.b64decode(audio)
    filename = f"{DOWNLOAD_ROOT}/{uuid.uuid4()}"
    with open(filename, "wb") as file:
        file.write(decoded_data)
    return filename

Now that our utility functions are complete, let’s update main.py with our main Python code. Users can send either a base64-encoded string or a public URL of the audio file. We’ll pass this file to our model and return the output. First, let’s define our request object:

from typing import Optional
from pydantic import BaseModel, HttpUrl

class Item(BaseModel):
    audio: Optional[str]
    file_url: Optional[HttpUrl]
    webhook_endpoint: Optional[HttpUrl]

We use Pydantic for data validation. While audio and file_url are optional parameters, we ensure at least one is provided. The webhook_endpoint parameter, automatically included by Cerebrium in every request, is useful for long-running requests.

Note: Cerebrium has a 3-minute timeout for each inference request. For long audio files (2+ hours) that take several minutes to process, use a webhook_endpoint - a URL where we’ll send a POST request with your function’s results.

Setup Model and inference

Below, we import the required packages and load our Whisper model. While the model downloads during initial deployment, it’s automatically cached in persistent storage for subsequent use. We load the model outside our predict function since this code should only run on cold start (startup). For warm containers, only the predict function executes for inference.

from huggingface_hub import hf_hub_download
from whisper import load_model, transcribe
from util import download_file_from_url, save_base64_string_to_file

distil_large_v2 = hf_hub_download(repo_id="distil-whisper/distil-large-v3", filename="original-model.bin")
model = load_model(distil_large_v2)

def predict(run_id, audio=None, file_url=None, webhook_endpoint=None):
    item = Item(audio=audio, file_url=file_url, webhook_endpoint=webhook_endpoint)
    input_filename = f"{run_id}.mp3"

    if audio is None and file_url is None:
        raise 'Either audio or file_url must be provided'
    else:
        if item.audio is not None:
            file = save_base64_string_to_file(item.audio)
        elif item.file_url is not None:
            file = download_file_from_url(item.file_url, input_filename)
        print("Transcribing file...")

        result = transcribe(model, audio=file)
        return result

The predict function, which runs only on inference requests, creates an audio file from either the download URL or base64 string, transcribes it, and returns the output.

Deploy

Configure your compute and environment settings in cerebrium.toml:

[cerebrium.deployment]
name = "1-whisper-transcription"
python_version = "3.11"
include = ["./*", "main.py", "cerebrium.toml"]
exclude = ["./example_exclude"]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"

[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "AMPERE_A10"
cpu = 3
memory = 12.0
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 60

[cerebrium.dependencies.pip]
accelerate = "latest"
transformers = ">=4.35.0"
openai-whisper = "latest"
pydantic = "latest"

[cerebrium.dependencies.conda]

[cerebrium.dependencies.apt]
"ffmpeg" = "latest"

Deploy the app using this command:

cerebrium deploy

After deployment, make this request:

curl --location 'https://api.cortex.cerebrium.ai/v4/p-<YOUR PROJECT ID>/1-whisper-transcription/predict' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <YOUR TOKEN HERE>' \
--data '{"file_url": "https://your-public-url.com/test.mp3"}''

You’ll receive an immediate response with a 202 status code and a run_id - a unique identifier to correlate the result with the initial workload.

The endpoint returns results in this format:

{
  "run_id": "2R5PnHprwNqiS5tcFMor-4c6rSrxuzrVtBU1JfjT5iWFG6s4pHo1Ug==",
  "message": "Finished inference request with run_id: `2R5PnHprwNqiS5tcFMor-4c6rSrxuzrVtBU1JfjT5iWFG6s4pHo1Ug==`",
  "result": {
    "text": " Testing, one, two, three, testing.",
    "segments": [
      {
        "id": 0,
        "seek": 0,
        "start": 0,
        "end": 4,
        "text": " Testing, one, two, three, testing.",
        "tokens": [
          50364, 45517, 11, 472, 11, 220, 20534, 11, 220, 27583, 11, 220, 83,
          8714, 13, 50564
        ],
        "temperature": 0,
        "avg_logprob": -0.3824356023003073,
        "compression_ratio": 1,
        "no_speech_prob": 0.019467202946543694
      }
    ],
    "language": "en"
  },
  "status_code": 200,
  "run_time_ms": 2053.8525581359863
}