Transcribe 1 hour podcast
Using Distill Whisper to transcribe an audio file
In this tutorial, we’ll transcribe an hour-long audio file using Distill Whisper - an optimized version of Whisper-large-v2 that’s 60% faster while maintaining accuracy within 1% of the original. We’ll accept either a base64-encoded string of the audio file or a URL to download the audio file.
To see the final implementation, you can view it here
Basic Setup
Developing models with Cerebrium is similar to developing on a virtual machine or Google Colab, making conversion straightforward. Make sure you have the Cerebrium package installed and are logged in. If not, check our docs here.
First, create your project:
Add the following packages to the [cerebrium.dependencies.pip]
section of your cerebrium.toml
file:
Let’s create a util.py
file for our utility functions - downloading a file from a URL or converting a base64 string to a file:
Now that our utility functions are complete, let’s update main.py
with our main Python code. Users can send either a base64-encoded string or a public URL of the audio file. We’ll pass this file to our model and return the output. First, let’s define our request object:
We use Pydantic for data validation. While audio
and file_url
are optional parameters, we ensure at least one is provided. The webhook_endpoint
parameter, automatically included by Cerebrium in every request, is useful for long-running requests.
Note: Cerebrium has a 3-minute timeout for each inference request. For long audio files (2+ hours) that take several minutes to process, use a webhook_endpoint
- a URL where we’ll send a POST request with your function’s results.
Setup Model and inference
Below, we import the required packages and load our Whisper model. While the model downloads during initial deployment, it’s automatically cached in persistent storage for subsequent use. We load the model outside our predict
function since this code should only run on cold start (startup). For warm containers, only the predict
function executes for inference.
The predict
function, which runs only on inference requests, creates an audio file from either the download URL or base64 string, transcribes it, and returns the output.
Deploy
Configure your compute and environment settings in cerebrium.toml
:
Deploy the app using this command:
After deployment, make this request:
You’ll receive an immediate response with a 202 status code and a run_id
- a unique identifier to correlate the result with the initial workload.
The endpoint returns results in this format: