This tutorial demonstrates creating a real-time voice agent that responds to phone calls via Twilio. The flexible implementation supports any LLM or Text-to-Speech (TTS) model, making it ideal for voice applications like customer support bots and receptionists.

We’ll use PipeCat to handle component integration, user interruptions, and audio data processing.

You can find the final version of the code here

Cerebrium setup

Set up Cerebrium:

  1. Sign up here
  2. Follow installation docs here
  3. Create a starter project:
    cerebrium init 4-twilio-voice-agent
    
    This creates:
    • main.py: Entrypoint file
    • cerebrium.toml: Build and environment configuration

Add these pip packages to your cerebrium.toml:

[cerebrium.dependencies.pip]
torch = ">=2.0.0"
"pipecat-ai[silero, daily, openai, deepgram, cartesia, twilio]" = "0.0.47"
aiohttp = ">=3.9.4"
torchaudio = ">=2.3.0"
channels = ">=4.0.0"
requests = "==2.32.2"
twilio = "latest"
fastapi = "latest"
uvicorn = "latest"
python-dotenv = "latest"
loguru = "latest"

Set up a FastAPI server to handle Twilio calls and upgrade to WebSocket connections for real-time communication. Add this code to main.py:

import json

from fastapi import FastAPI, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from starlette.responses import HTMLResponse

from bot import main

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Allow all origins for testing
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


@app.post("/")
async def start_call():
    print("POST TwiML")
    return HTMLResponse(content=open("templates/streams.xml").read(), media_type="application/xml")


@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    start_data = websocket.iter_text()
    await start_data.__anext__()
    call_data = json.loads(await start_data.__anext__())
    print(call_data, flush=True)
    stream_sid = call_data["start"]["streamSid"]
    print("WebSocket connection accepted")
    await main(websocket, stream_sid)

Create a templates folder with stream.xml inside. This XML response tells Twilio to upgrade to a WebSocket connection:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <!--Update with your project ID below-->
    <Stream url="wss://api.cortex.cerebrium.ai/v4/p-xxxxxxx/4-twilio-agent/ws"></Stream>
  </Connect>
  <Pause length="40"/>
</Response>

Replace the stream URL with your deployment’s base endpoint, using your project ID. Next, let’s set up your Twilio number.

Configure Cerebrium to run the FastAPI server by adding this to cerebrium.toml:

[cerebrium.runtime.custom]
port = 8765
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8765"]
healthcheck_endpoint = "/health"

You can read more about run custom web servers here.

Twilio setup

Twilio provides cloud communications APIs for messaging, voice, video, and authentication. While we use Twilio for this demo, other providers work too. Sign up for a free account here.

Purchase a local number (not toll-free) from the phone numbers page. Then set up a webhook to connect calls to your agent.

You should then save the changes above and move on to setting up our AI Agent.

AI Agent Setup

Create bot.py to set up the AI agent using PipeCat for component integration, interruption handling, and audio processing:

import os
import sys

from loguru import logger
from pipecat.frames.frames import LLMMessagesFrame, EndFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask

from pipecat.services.openai import OpenAILLMService
from pipecat.processors.aggregators.openai_llm_context import (
    OpenAILLMContext,
)
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.vad.silero import SileroVADAnalyzer
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse
from pipecat.transports.network.fastapi_websocket import (
    FastAPIWebsocketTransport,
    FastAPIWebsocketParams,
)
from pipecat.serializers.twilio import TwilioFrameSerializer

from pipecat.services.cartesia import CartesiaTTSService

logger.remove(0)
logger.add(sys.stderr, level="DEBUG")

twilio = Client(
    os.environ.get("TWILIO_ACCOUNT_SID"), os.environ.get("TWILIO_AUTH_TOKEN")
)

async def main(websocket_client, stream_sid):
    transport = FastAPIWebsocketTransport(
        websocket=websocket_client,
        params=FastAPIWebsocketParams(
            audio_out_enabled=True,
            add_wav_header=False,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
            vad_audio_passthrough=True,
            serializer=TwilioFrameSerializer(stream_sid),
        ),
    )

    stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
    llm = OpenAILLMService(
        name="LLM",
        api_key=os.environ.get("OPENAI_API_KEY"),
        model="gpt-4",
    )


    tts = CartesiaTTSService(
        api_key=os.getenv("CARTESIA_API_KEY"),
        voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
    )

    messages = [
        {
            "role": "system",
            "content": "You are a helpful LLM in an audio call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
        },
    ]
    print('here', flush=True)
    context = OpenAILLMContext(messages=messages)
    context_aggregator = llm.create_context_aggregator(context)

    pipeline = Pipeline(
        [
            transport.input(),  # Websocket input from client
            stt,  # Speech-To-Text
            context_aggregator.user(),
            llm,  # LLM
            tts,  # Text-To-Speech
            transport.output(),  # Websocket output to client
            context_aggregator.assistant(),
        ]
    )

    task = PipelineTask(pipeline, params=PipelineParams(allow_interruptions=True))

    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        # Kick off the conversation.
        messages.append({"role": "system", "content": "Please introduce yourself to the user."})
        await task.queue_frames([LLMMessagesFrame(messages)])

    @transport.event_handler("on_client_disconnected")
    async def on_client_disconnected(transport, client):
        await task.queue_frames([EndFrame()])

    runner = PipelineRunner(handle_sigint=False)

    await runner.run(task)

The code:

  1. Connects to WebSocket transport for audio I/O
  2. Sets up services:
  3. Uses Secrets for authentication
  4. Creates a customizable PipelineTask supporting:
    • Image and Vision use cases (docs)
    • Built-in interruption handling
    • Easy model swapping
  5. Handles call events (join/leave) via webhooks

For lower latency (~500ms end-to-end), run parts or all of the pipeline locally. Learn more in our voice agents guide and RAG voice agent blog post.

Deploy to Cerebrium

To deploy this app to Cerebrium you can simply run the command: cerebrium deploy in your terminal.

If it deployed successfully, you should see something like this:

Test the app by calling your Twilio number - the agent will respond automatically.

Scaling Pipecat

For scaling PipeCat on Cerebrium:

  1. Use large CPU instances (10 CPUs, 8GB memory) for Twilio’s less than 1s response requirement
  2. Run concurrent PipeCat processes:
    • Each process uses ~0.5 CPUs
    • 10 CPU instance handles 20 concurrent calls
    • Adjust based on traffic needs

For scaling criteria, use Cerebrium’s replica_concurrency setting to spawn new containers based on utilization, preventing cold starts for subsequent calls.

To make the two updates above you can update your cerebrium.toml to contain the following:

[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "CPU"
cpu = 10
memory = 8.0

[cerebrium.scaling]
min_replicas = 1
max_replicas = 3
cooldown = 30
replica_concurrency = 20
scaling_metric = "concurrency_utilization"
scaling_target = 80

Conclusion

This tutorial provides a foundation for implementing voice features and expanding into image and vision capabilities. PipeCat offers an extensible, open-source framework for building voice-enabled apps, while Cerebrium provides seamless deployment and autoscaling with pay-as-you-go compute.

Tag us as @cerebriumai to showcase your work and join our Slack or Discord communities for questions and feedback.