This tutorial demonstrates creating a real-time voice agent that responds to phone calls via Twilio. The flexible implementation supports any LLM or Text-to-Speech (TTS) model, making it ideal for voice applications like customer support bots and receptionists.
We’ll use PipeCat to handle component integration, user interruptions, and audio data processing.
You can find the final version of the code here
Cerebrium setup
Set up Cerebrium:
- Sign up here
- Follow installation docs here
- Create a starter project:
cerebrium init 4-twilio-voice-agent
This creates:
main.py
: Entrypoint file
cerebrium.toml
: Build and environment configuration
Add these pip packages to your cerebrium.toml
:
[cerebrium.dependencies.pip]
torch = ">=2.0.0"
"pipecat-ai[silero, daily, openai, deepgram, cartesia, twilio]" = "0.0.47"
aiohttp = ">=3.9.4"
torchaudio = ">=2.3.0"
channels = ">=4.0.0"
requests = "==2.32.2"
twilio = "latest"
fastapi = "latest"
uvicorn = "latest"
python-dotenv = "latest"
loguru = "latest"
Set up a FastAPI server to handle Twilio calls and upgrade to WebSocket connections for real-time communication. Add this code to main.py
:
import json
from fastapi import FastAPI, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from starlette.responses import HTMLResponse
from bot import main
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Allow all origins for testing
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.post("/")
async def start_call():
print("POST TwiML")
return HTMLResponse(content=open("templates/streams.xml").read(), media_type="application/xml")
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
start_data = websocket.iter_text()
await start_data.__anext__()
call_data = json.loads(await start_data.__anext__())
print(call_data, flush=True)
stream_sid = call_data["start"]["streamSid"]
print("WebSocket connection accepted")
await main(websocket, stream_sid)
Create a templates
folder with stream.xml
inside. This XML response tells Twilio to upgrade to a WebSocket connection:
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<!--Update with your project ID below-->
<Stream url="wss://api.cortex.cerebrium.ai/v4/p-xxxxxxx/4-twilio-agent/ws"></Stream>
</Connect>
<Pause length="40"/>
</Response>
Replace the stream URL with your deployment’s base endpoint, using your project ID. Next, let’s set up your Twilio number.
Configure Cerebrium to run the FastAPI server by adding this to cerebrium.toml
:
[cerebrium.runtime.custom]
port = 8765
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8765"]
healthcheck_endpoint = "/health"
You can read more about run custom web servers here.
Twilio setup
Twilio provides cloud communications APIs for messaging, voice, video, and authentication. While we use Twilio for this demo, other providers work too. Sign up for a free account here.
Purchase a local number (not toll-free) from the phone numbers page. Then set up a webhook to connect calls to your agent.
You should then save the changes above and move on to setting up our AI Agent.
AI Agent Setup
Create bot.py
to set up the AI agent using PipeCat for component integration, interruption handling, and audio processing:
import os
import sys
from loguru import logger
from pipecat.frames.frames import LLMMessagesFrame, EndFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.services.openai import OpenAILLMService
from pipecat.processors.aggregators.openai_llm_context import (
OpenAILLMContext,
)
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.vad.silero import SileroVADAnalyzer
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse
from pipecat.transports.network.fastapi_websocket import (
FastAPIWebsocketTransport,
FastAPIWebsocketParams,
)
from pipecat.serializers.twilio import TwilioFrameSerializer
from pipecat.services.cartesia import CartesiaTTSService
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
twilio = Client(
os.environ.get("TWILIO_ACCOUNT_SID"), os.environ.get("TWILIO_AUTH_TOKEN")
)
async def main(websocket_client, stream_sid):
transport = FastAPIWebsocketTransport(
websocket=websocket_client,
params=FastAPIWebsocketParams(
audio_out_enabled=True,
add_wav_header=False,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
vad_audio_passthrough=True,
serializer=TwilioFrameSerializer(stream_sid),
),
)
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
llm = OpenAILLMService(
name="LLM",
api_key=os.environ.get("OPENAI_API_KEY"),
model="gpt-4",
)
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22", # British Lady
)
messages = [
{
"role": "system",
"content": "You are a helpful LLM in an audio call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
},
]
print('here', flush=True)
context = OpenAILLMContext(messages=messages)
context_aggregator = llm.create_context_aggregator(context)
pipeline = Pipeline(
[
transport.input(), # Websocket input from client
stt, # Speech-To-Text
context_aggregator.user(),
llm, # LLM
tts, # Text-To-Speech
transport.output(), # Websocket output to client
context_aggregator.assistant(),
]
)
task = PipelineTask(pipeline, params=PipelineParams(allow_interruptions=True))
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
# Kick off the conversation.
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMMessagesFrame(messages)])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
await task.queue_frames([EndFrame()])
runner = PipelineRunner(handle_sigint=False)
await runner.run(task)
The code:
- Connects to WebSocket transport for audio I/O
- Sets up services:
- Uses Secrets for authentication
- Creates a customizable PipelineTask supporting:
- Image and Vision use cases (docs)
- Built-in interruption handling
- Easy model swapping
- Handles call events (join/leave) via webhooks
For lower latency (~500ms end-to-end), run parts or all of the pipeline locally. Learn more in our voice agents guide and RAG voice agent blog post.
Deploy to Cerebrium
To deploy this app to Cerebrium you can simply run the command: cerebrium deploy in your terminal.
If it deployed successfully, you should see something like this:
Test the app by calling your Twilio number - the agent will respond automatically.
Scaling Pipecat
For scaling PipeCat on Cerebrium:
- Use large CPU instances (10 CPUs, 8GB memory) for Twilio’s less than 1s response requirement
- Run concurrent PipeCat processes:
- Each process uses ~0.5 CPUs
- 10 CPU instance handles 20 concurrent calls
- Adjust based on traffic needs
For scaling criteria, use Cerebrium’s replica_concurrency
setting to spawn new containers based on utilization, preventing cold starts for subsequent calls.
To make the two updates above you can update your cerebrium.toml to contain the following:
[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "CPU"
cpu = 10
memory = 8.0
[cerebrium.scaling]
min_replicas = 1
max_replicas = 3
cooldown = 30
replica_concurrency = 20
scaling_metric = "concurrency_utilization"
scaling_target = 80
Conclusion
This tutorial provides a foundation for implementing voice features and expanding into image and vision capabilities. PipeCat offers an extensible, open-source framework for building voice-enabled apps, while Cerebrium provides seamless deployment and autoscaling with pay-as-you-go compute.
Tag us as @cerebriumai to showcase your work and join our Slack or Discord communities for questions and feedback.