Cerebrium’s partnership with Deepgram enables simple deployment of speech-to-text (STT) services with simplified configuration and independent scaling.

Using Deepgram services requires an Enterprise Deepgram account and API key for self-hosted models. Contact Deepgram support to access this feature.

Links to the Deepgram model files referenced below (file extension .dg) should should be obtained from the Deepgram Account Representative.

Please ensure you chat with your Deepgram representative on how to get parity with the Deepgram API.

Deepgram Partner Service is available from CLI version 1.39.0 and greater

Setup

  1. Create a simple cerebrium app with the CLI:
cerebrium init deepgram
  1. Create a self-hosted API key from the Deepgram dashboard. Navigate to the Secrets tab in the Cerebrium dashboard and add the API key with the name DEEPGRAM_API_KEY. This secret automatically becomes available as an environment variable in the deployment.

  2. Download model files from Deepgram’s self-hosted section in the Deepgram dashboard using the guide available, here. Select the ‘license proxy’ deployment type. Upload downloaded model files using the links provided by your Account Representative (with .dg extension) to persistent-storage in the /deepgram-models folder. This folder automatically attaches to the engine container. Use this command to upload the files:

cerebrium cp <model>.dg deepgram-models/<model>.dg

#example
cerebrium cp nova-3-general.en.streaming.123456.dg deepgram-models/nova-3-general.en.streaming.123456.dg
  1. Create a file named engine.toml with the following content and upload to your /persistent-storage/ root directory. These are the default settings but feel free to update.

### Keep in mind that all paths are in-container paths and do not need to exist
### on the host machine.

### Limit the number of active requests handled by a single Engine container.
### Engine will reject additional requests from API beyond this limit, and the
### API container will continue with the retry logic configured in `api.toml`.
###
### The default is no limit.
# max_active_requests =


### Configure license validation by passing in a DEEPGRAM_API_KEY environment variable
### See https://developers.deepgram.com/docs/deploy-deepgram-services#credentials
[license]
server_url = ["https://license.deepgram.com"]


### Configure the server to listen for requests from the API.
[server]
### The IP address to listen on. Since this is likely running in a Docker
### container, you will probably want to listen on all interfaces.
host = "0.0.0.0"
### The port to listen on
port = 8055


### To support metrics we need to expose an Engine endpoint.
### See https://developers.deepgram.com/docs/metrics-guide#deepgram-engine
[metrics_server]
host = "0.0.0.0"
port = 9991


[model_manager]
### The number of models to have concurrently loaded in system memory.
### If managing a deployment with dozens of models this setting will
### help prevent instances where models consume too much memory and
### offload the models to disk as needed on a least-recently-used basis.
###
### The default is no limit.
# max_concurrently_loaded_models = 20

### Inference models. You can place these in one or multiple directories.
search_paths = ["/models"]


### Enable ancillary features
[features]
### Allow multichannel requests by setting this to true, set to false to disable
multichannel = true # or false
### Enables language detection *if* a valid language detection model is available
language_detection = true # or false
### Enables streaming entity formatting *if* a valid NER model is available
streaming_ner = false # or true

### Size of audio chunks to process in seconds.
[chunking.batch]
# min_duration =
# max_duration =
[chunking.streaming]
# min_duration =
# max_duration =

### How often to return interim results, in seconds. Default is 1.0s.
###
### This value may be lowered to increase the frequency of interim results.
### However, this may cause a signficant decrease in number of concurrent
### streams supported by a single GPU. Please contact your Deepgram Account
### representative for more details.
step = 0.2


### Engine will automatically enable half precision operations if your GPU supports
### them. You can explicitly enable or disable this behavior with the state parameter
### which supports enabled, disabled, and auto (the default).
[half_precision]
    state = "auto"
# state = "disabled" # or "enabled" or "auto"
  1. Create a file named api.toml with the following content and upload to your /persistent-storage/ root directory. These are the default settings but feel free to update.
### Keep in mind that all paths are in-container paths, and do not need to exist
### on the host machine.


### Configure license validation by passing in a DEEPGRAM_API_KEY environment variable
### See https://developers.deepgram.com/docs/deploy-deepgram-services#credentials
[license]
server_url = ["https://license.deepgram.com"]


### Configure how the API will listen for your requests
[server]
### The base URL (prefix) for requests to the API.
base_url = "/v1"
### The IP address to listen on. Since this is likely running in a Docker
### container, you will probably want to listen on all interfaces.
host = "0.0.0.0"
### The port to listen on
port = 8082

### How long to wait for a connection to a callback URL.
callback_conn_timeout = "1s"
### How long to wait for a response to a callback URL.
callback_timeout = "10s"

### How long to wait for a connection to a fetch URL.
fetch_conn_timeout = "1s"
### How long to wait for a response to a fetch URL.
fetch_timeout = "60s"


### By default, the API listens over HTTP. By passing both a certificate
### and a key file, the API will instead listen over HTTPS.
###
### This performs TLS termination only, and does not provide any
### additional authentication.
[server.https]
# cert_file = "/path/to/cert.pem"
# key_file = "/path/to/key.pem"


### Specify custom DNS resolution options.
[resolver]
### Specify custom domain name server(s).
### Format is "{IP} {PORT} {PROTOCOL (tcp or udp)}"
# nameservers = ["127.0.0.1 53 udp"]

### If specifying a custom DNS nameserver, set the DNS TTL value.
# max_ttl = 10


### Limit the number of active requests handled by a single API container.
### If additional requests beyond the limit are sent, API will return
### a 429 HTTP status code. Default is no limit.
[concurrency_limit]
# active_requests =


### Enable ancillary features
[features]
### Enables topic detection *if* a valid topic detection model is available
topic_detection = true # or false

### Enables summarization *if* a valid summarization model is available
summarization = true # or false

### Enables pre-recorded entity detection *if* a valid entity detection model is available
entity_detection = false # or true

### Enables pre-recorded entity-based redaction *if* a valid entity detection model is available
entity_redaction = false # or true

### Enables pre-recorded entity formatting *if* a valid NER model is available
format_entity_tags = false # or true

### If API is receiving requests faster than Engine can process them, a request
### queue will form. By default, this queue is stored in memory. Under high load,
### the queue may grow too large and cause Out-Of-Memory errors. To avoid this,
### set a disk_buffer_path to buffer the overflow on the request queue to disk.
###
### WARN: This is only to temporarily buffer requests during high load.
### If there is not enough Engine capacity to process the queued requests over time,
### the queue (and response time) will grow indefinitely.
# disk_buffer_path = "/path/to/disk/buffer/directory"

### Enables streaming TTS *if* a valid Aura TTS model is available
speak_streaming = true # or false

### Configure the backend pool of speech engines (generically referred to as
### "drivers" here). The API will load-balance among drivers in the standard
### pool; if one standard driver fails, the next one will be tried.
###
### Each driver URL will have its hostname resolved to an IP address. If a domain
### name resolves to multiple IP addresses, the API will load-balance across each
### IP address.
###
### This behavior is provided for convenience, and in a production environment
### other tools can be used, such as HAProxy.
###
### Below is a new Speech Engine ("driver") in the "standard" pool.
[[driver_pool.standard]]
### Host to connect to. If you are using a different method of orchestrating,
### then adjust the IP address accordingly.
###
### WARN: This must be HTTPS.
###
### Docker Compose and Podman Compose create a dedicated network that allows inter-container communication by app name.
### See [Networking in Compose](https://docs.docker.com/compose/networking/) for details.
url = "https://0.0.0.0:8055/v2"
### Factor to increase the timeout by for each additional retry (for
### exponential backoff).
timeout_backoff = 1.2

### Before attempting a retry, sleep for this long (in seconds)
retry_sleep = "2s"
### Factor to increase the retry sleep by for each additional retry (for
### exponential backoff).
retry_backoff = 1.6

### Maximum response to deserialize from Driver (in bytes)
max_response_size = 1073741824 # 1GB
  1. Update the cerebrium.toml file with the following configuration to set hardware requirements, scaling parameters, region, and other settings:
[cerebrium.deployment]
name = "deepgram"
# Enable below in production environments
disable_auth = true

[cerebrium.runtime.deepgram]

[cerebrium.hardware]
cpu = 4
region = "us-east-1"
memory = 32
compute = "AMPERE_A10"
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
cooldown = 120
replica_concurrency = 150
  1. Run ‘cerebrium deploy’ to deploy the app. After deployment and endpoint for the Deepgram services is provided in the terminal output (The URL for this endpoint can also be found in the App’s overview page on the dashboard).

  2. Download an example audio file for use with the deepgram service:

wget https://dpgr.am/bueller.wav
  1. Access the Deepgram service by calling the endpoint with appropriate parameters such as:
curl -X POST --data-binary @bueller.wav "https://api.cortex.cerebrium.ai/v4/p-xxxxxx/deepgram/v1/listen?model=nova-3&smart_format=true"

Parameters accepted by the Deepgram service can be found in the speech-to-text API reference.

If ‘disable_auth’ in your cerebrium.toml is set to false, use the inference token in your Authorization header in order to authenticate with your Cerebrium service. Don’t worry about the Deepgram API key, it will be pulled automatically from your secrets.

API Key Configuration

To use Deepgram services:

  1. Sign up at deepgram.com
  2. Create an API key in the Deepgram dashboard
  3. Add the API key to Cerebrium:
  • Navigate to Secrets tab in the Cerebrium dashboard
  • Add the Deepgram API key as an app-specific or project-wide secret named DEEPGRAM_API_KEY
  • This secret automatically becomes available as an environment variable in the deployment

Scaling and Concurrency

Deepgram services support independent scaling configurations:

  • min_replicas: Minimum number of instances to maintain (0 for scale-to-zero)
  • max_replicas: Maximum number of instances that can be created during high load
  • replica_concurrency: Number of concurrent requests each instance can handle
  • cooldown: Time in seconds that an instance remains active after processing its last request

Adjust these parameters based on expected traffic patterns and latency requirements.

Usage Examples

Cerebrium runs both Deepgram STT models and applications on the same network alongside LiveKit workers, reducing latency by approximately 400ms—a significant advantage for voice agent applications.

For a complete implementation reference, see the LiveKit Outbound Agent example.