In this tutorial, we’ll create a real-time voice agent that responds to queries via speech in ~500ms. This flexible implementation lets you swap in any Large Language Model (LLM) or Text-to-Speech (TTS) model. It’s ideal for voice-based use cases like customer support bots and receptionists.
To create this app, we use PipeCat, a framework that handles component integration, user interruptions, and audio data processing. We’ll demonstrate this by joining a meeting room with our voice agent using Daily (PipeCat’s creators) and deploy the app on Cerebrium for seamless deployment and scaling.
Essentially our application will have 3/4 parts:
- Your Pipecat agent which acts as the orchestrator
- Your Deepgram TTS/STT service (Requires a Deepgram Enterprise account)
- A self-hosted LLM using the vLLM framework
The reason we achieve such low latency is that each service is hosted within Cerebrium and so we have no network latency for the requests we make - communication across containers is less than 10ms.
You can find the final version of the code here
If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get set up.
Deepgram deployment
For the sake of conciseness, look at our Partner Services page to see how you can deploy a Deepgram service on Cerebrium. The link is here
You need a Deepgram Enterprise License to do deploy Deegram on Cerebrium else
you must use their API endpoint below.
LLM Deployment
For our LLM we deploy a OpenAI compatible Llama-3 endpoint using the vLLM framework - in order to have a low TTFT we deploy a quantized version (RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8”).
Run cerebrium init llama-llm
and add the following code to your cerebrium.toml:
Add the following code to your main.py - this uses the vLLM framework and makes it openAI compatible:
Make sure to add your HuggingFace token to your Secrets on Cerebrium as HF_TOKEN
.
The run cerebrium deploy
to make it live - you should see it live in your Cerebrium dashboard. We will use your deployment url in the next step.
Based on your GPU hardware and replica-concurrency in your cerebrium.toml, you can set how many concurrent calls the LLM can take.
Pipecat setup
In your IDE, run the following command to create our pipecat-agent: cerebrium init pipecat-agent
. We will be using the Pipecat framework to orchestrate our services to create a voice agent
Add the following pip packages to your cerebrium.toml
to create your deployment environment:
You can then add the following code to your main.py:
import asyncio
import os
import subprocess
import sys
import time
from multiprocessing import Process
import aiohttp
import requests
from loguru import logger
from pipecat.frames.frames import LLMMessagesFrame, EndFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import (
LLMAssistantResponseAggregator,
LLMUserResponseAggregator,
)
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.cartesia.tts import CartesiaTTSService
from deepgram import LiveOptions
from pipecat.services.rime.tts import RimeHttpTTSService, RimeTTSService, Language
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from helpers import (
CustomDeepgramSTTService,
)
from dotenv import load_dotenv
load_dotenv()
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
deepgram_voice: str = "aura-asteria-en"
async def main(room_url: str, token: str):
async with aiohttp.ClientSession() as session:
transport = DailyTransport(
room_url,
token,
"Respond bot",
DailyParams(
audio_out_enabled=True,
audio_in_enabled=True,
transcription_enabled=False,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.15)),
vad_audio_passthrough=True,
),
)
stt = CustomDeepgramSTTService(
api_key=os.environ.get("DEEPGRAM_API_KEY"),
websocket_url="ws://p-xxxxxx-deepgram.tenant-cerebrium-prod.svc.cluster.local/v1/listen",
live_options=LiveOptions(
model="nova-2-general",
language="en-US",
smart_format=True,
vad_events=True
)
)
tts = CartesiaTTSService(
api_key=os.environ.get("CARTESIA_API_KEY"),
voice_id='97f4b8fb-f2fe-444b-bb9a-c109783a857a',
)
llm = OpenAILLMService(
name="LLM",
model="RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
base_url="http://p-xxxxxx-llama-llm.tenant-cerebrium-prod.svc.cluster.local/run"
)
messages = [
{
"role": "system",
"content": "You are a fast, low-latency chatbot. Your goal is to demonstrate voice-driven AI capabilities at human-like speeds. The technology powering you is Daily for transport, Cerebrium for serverless infrastructure, Llama 3 (8-B version) LLM, and Deepgram for speech-to-text and text-to-speech. You are hosted on the east coast of the United States. Respond to what the user said in a creative and helpful way, but keep responses short and legible. Ensure responses contain only words. Check again that you have not included special characters other than '?' or '!'.",
},
]
tma_in = LLMUserResponseAggregator(messages)
tma_out = LLMAssistantResponseAggregator(messages)
pipeline = Pipeline(
[
transport.input(),
stt,
tma_in,
llm,
tts,
transport.output(),
tma_out,
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
allow_interruptions=True,
enable_metrics=True
),
)
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
time.sleep(1.5)
messages.append(
{
"role": "system",
"content": "Introduce yourself by saying 'hello, I'm FastBot, how can I help you today?'",
}
)
await task.queue_frame(LLMMessagesFrame(messages))
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
await task.queue_frame(EndFrame())
@transport.event_handler("on_call_state_updated")
async def on_call_state_updated(transport, state):
if state == "left":
await task.queue_frame(EndFrame())
runner = PipelineRunner()
await runner.run(task)
await session.close()
async def start_bot(room_url: str, token: str = None):
try:
await main(room_url, token)
except Exception as e:
logger.error(f"Exception in main: {e}")
sys.exit(1)
return {"message": "session finished"}
def create_room():
url = "https://api.daily.co/v1/rooms/"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ.get('DAILY_TOKEN')}",
}
data = {
"properties": {
"exp": int(time.time()) + 60 * 5,
"eject_at_room_exp": True,
}
}
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
room_info = response.json()
token = create_token(room_info["name"])
if token and "token" in token:
room_info["token"] = token["token"]
else:
logger.error("Failed to create token")
return {
"message": "There was an error creating your room",
"status_code": 500,
}
return room_info
else:
data = response.json()
if data.get("error") == "invalid-request-error" and "rooms reached" in data.get(
"info", ""
):
logger.error(
"We are currently at capacity for this demo. Please try again later."
)
return {
"message": "We are currently at capacity for this demo. Please try again later.",
"status_code": 429,
}
logger.error(f"Failed to create room: {response.status_code}")
return {"message": "There was an error creating your room", "status_code": 500}
def create_token(room_name: str):
url = "https://api.daily.co/v1/meeting-tokens"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ.get('DAILY_TOKEN')}",
}
data = {"properties": {"room_name": room_name, "is_owner": True}}
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
token_info = response.json()
return token_info
else:
logger.error(f"Failed to create token: {response.status_code}")
return None
There is a lot happening above but below we will give a summary:
- We are using the WebRTC functionality from Daily to create our room which you can switch out for Twilio/Telenyx. We then have two functions to create/authenticate our meeting room - create_room() and create_token()
- For the Deepgram and LLM service, we use a local url to connect to the services within the Cerebrium cluster. We are working on making this better but for now just edit with your project key.
- For TTS, we are using the Cartesia service to showcase how versatile Pipecat is but you can use the TTS service from Deepgram too!
The Daily Python SDK provides event webhooks to trigger functionality based on events like users joining or leaving calls. Add this event handling code to the main()
function:
This code handles these events:
- First participant joins: Bot introduces itself via a conversation message
- Additional participants join: Bot listens and responds to all participants
- Participant leaves or call ends: Bot terminates itself
Based on your CPU hardware and replica-concurrency in your cerebrium.toml, you can set how many concurrent calls this Pipecat agent can take.
Create a .env file within your pipecat-agent folder with the following set:
Get your Daily developer token from your profile. If you don’t have an account, sign up here (they offer a generous free tier). Navigate to the “developers” tab to get your API key and add it to your Cerebrium Secrets.
To test your voice bot locally, you uncomment that main code at the bottom and then run python main.py
. Your code should then work
That’s it! You now have a fully functioning AI bot that interacts with users through speech in ~500ms. Imagine the possibilities!
Now, let’s create a user interface for the bot.
Deploy to Cerebrium
Deploy the app to Cerebrium by running this command in your terminal: cerebrium deploy
We’ll add these endpoints to our frontend interface.
Connect frontend
We created a public fork of the PipeCat frontend to show you a nice demo of this application. You can clone the repo here.
Follow the instructions in the README.md and then populate the following variables in your .env.development.local
You can now run yarn dev and go to the URL: http://localhost:5173/ to test your application!
Conclusion
This tutorial provides a foundation for implementing voice in your app and extending into image and vision capabilities. PipeCat is an extensible, open-source framework for building voice-enabled apps, while Cerebrium provides seamless deployment and autoscaling with pay-as-you-go compute.
Tag us as @cerebriumai to showcase your work and join our Slack or Discord communities for questions and feedback.