Real-time Voice AI Agent
Deploy a real-time voice AI agent
In this tutorial, we’ll create a real-time voice AI agent that responds to queries via speech in ~500ms. This flexible implementation lets you swap in any Large Language Model, Text-to-Speech (TTS) model, and Speech-to-Text (STT) model. It’s ideal for voice-based use cases like customer service bots and receptionists.
To create this app, we use PipeCat, an open-source framework for voice and multimodal conversational AI that handles user interruptions, audio data, and other essential functions. We communicate with our voice AI agent via WebRTC transport using Daily (PipeCat’s creators) and deploy the app on Cerebrium for seamless deployment and scaling.
You can find the final version of the code here
Cerebrium setup
If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get set up.
In your IDE, run the following command to create our Cerebrium starter project: cerebrium init 2-realtime-voice-agent
. This creates two files:
main.py
: Our entrypoint file where our code lives.cerebrium.toml
: A configuration file that contains all our build and environment settings.
Add the following pip packages and hardware requirements near the bottom of your cerebrium.toml
to create your deployment environment:
We specify a Docker base image that contains local Deepgram Speech-to-Text (STT) and Text-to-Speech (TTS) models, provided by Daily. Running everything locally instead of over the network helps achieve low latency.
Custom Docker files are not support yet but are rather in the works to be released soon. This is just a very early preview of how it would work.
Pipecat setup
In this example, we will be using Llama 3 8B as our LLM and serving it via vLLM. To use Llama 3, we need to be authenticated via Hugging Face.
To authenticate ourselves, we need to go to HuggingFace and accept the model permissions for Llama 8B if we haven’t already. It takes about 30 minutes or less for them to accept your request.
In your Cerebrium dashboard, you can add your HuggingFace token as a secret by navigating to “Secrets” in the sidebar. For the sake of this tutorial, I called mine “HF_TOKEN”. We can now access these values in our code at runtime without exposing them in our code.
You can then add the following code to your main.py:
Since PipeCat requires models to follow the OpenAI-compatible format and doesn’t support local instantiation, we run the vLLM server locally in a background process. We monitor the process for successful launch due to a known issue with rapidly starting multiple vLLM instances, retrying after 5 seconds if needed. We set the OUTLINES_CACHE_DIR
environment variable to address a disk I/O bug in outlines used by vLLM (see GitHub issue here).
Note, we are running the vLLM server on port 5000 (8000 is automatically used by Cerebrium) and we set the download directory of the model so that subsequent cold starts can be much quicker.
Now we implement the Pipecat framework by instantiating the various components. Create a function call main with the following code:
In our main function, we initialize the Daily transport layer to handle audio/video data from the Daily room. We pass the room URL and authentication token for programmatic joining. We set the VAD stop seconds to 200 milliseconds, which defines the pause duration before the bot responds.
Next, we connect to our locally running Deepgram models from our Docker base image on port 8082. The PipeCat framework handles audio-to-text and text-to-audio conversion. We then connect to our local LLM model from the vLLM server using the same pattern.
Finally, we combine everything into a PipelineTask, which PipeCat executes. Tasks are fully customizable and support Image and Vision use cases (learn more here). Pipeline tasks include parameters for handling interruptions, swapping models, and other features with minimal code changes.
In the code above, we are importing some helper functions at the top of our file to help with our implementation. You can copy the file from the github repository here. Make sure to name the file helpers.py.
Daily Event Webhooks
The Daily Python SDK comes with a lot of event webhooks where you can trigger functionality based on events occurring within your Daily room. We would like to handle events such as a user leaving/joining a call. Continue to add the following code to the main() function.
Above we handle the following events:
- When the first participant joins, we get the bot to introduce itself to the user. We do this by adding a message to the conversation.
- We add support for multiple participants to join and listen/respond to the bot.
- When a participant leaves or the call is ended, we get the bot to terminate itself.
From the code above, you will see the events are attached to “Transport”, which is the method of communication - in this case the meeting room. We then pass in our defined Pipeline task to our pipeline runner which executes indefinitely until we signal it to exit which in this case happens when a call ends. If you want to read further about the PipeCat infrastructure you can read more here
Starting Bot
We set min_replicas
in our cerebrium.toml
to ensure optimal user experience while supporting autoscaling. Before the bot joins a meeting, we verify the vLLM server is running with a local GET request. Note that these models take about 40 seconds to load into VRAM from disk.
We run the code in a separate execution environment to prevent multiple PipeCat instances. This background process serves as the entry point for our REST API endpoint to start the PipeCat bot. When the call ends and the bot returns, we send a response to our API endpoint. We therefore create the following function:
That’s it! You now have a fully functioning AI bot that can interact with a user through speech in ~500ms. Imagine the possibilities!
Let us now create a user facing UI in order for you to interface with this bot.
Creating Meeting Room
Cerebrium can run any Python code, not just AI workloads. For our demo, we define two functions that use the Daily REST API to create a room and temporary token, both valid for 5 minutes.
Get your Daily developer token from your profile. If you don’t have an account, sign up here (they offer a generous free tier). Navigate to the “developers” tab to get your API key and add it to your Cerebrium Secrets.
Below we create a room that only lasts 5 minutes and a temporary token to access it
Deploy to Cerebrium
To deploy this application to Cerebrium you can simply run the command: cerebrium deploy in your terminal.
If it deployed successfully, you should see something like this:
We will add these endpoints to our frontend interface.
Connect frontend
We created a public fork of the PipeCat frontend to show you a nice demo of this application. You can clone the repo here.
Follow the instructions in the README.md and then populate the following variables in your .env.development.local
You can now run yarn dev and go to the URL: http://localhost:5173/ to test your application!
Conclusion
This tutorial provides a foundation for implementing voice AI agents in your app and extending into image and vision capabilities. PipeCat is an extensible, open-source framework for building generative AI apps, while Cerebrium provides seamless deployment and autoscaling with pay-as-you-go compute.
Tag us as @cerebriumai to showcase your work and join our Slack or Discord communities for questions and feedback.