Real-time Voice AI Agent
Deploy a real-time voice AI agent
In this tutorial, I am going to create a real-time voice AI agent that can respond to any query via speech, in speech, in ~500ms. This is an extremely flexible implementation where you can swap in any Large Language model, Text-to-speech (TTS) model and Speech-to-text (STT) model of your liking. This is extremely useful for use cases involving voice such as customer service bots, receptionists and many more.
To create this application, we use the PipeCat, an open source framework for voice and multimodal conversational AI that handles some of the functionality we might need such as handling user interruptions, dealing with audio data etc. We will speak with our voice AI agent via a WebRTC transport, using Daily (the creators of Pipecat) and will deploy this application on Cerebrium to show how it handles deploying and scaling our application seamlessly.
You can find the final version of the code here
Cerebrium setup
If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get setup
In your IDE, run the following command to create our Cerebrium starter project: cerebrium init 2-realtime-voice-agent
. This creates two files:
- Main.py - Our entrypoint file where our code lives
- cerebrium.toml - A configuration file that contains all our build and environment settings Add the following pip packages and hardware requirements near the bottom of your cerebrium.toml. This will be used in creating our deployment environment.
You will also see we specify a Docker base image above. The reason for this is Daily has supplied a Docker image that contains local Deepgram Speech-to-Text (STT) and Text-to-Speech (TTS) models. This helps us achieve our low latency since everything is running locally and not going over the network.
Custom Docker files are not support yet but are rather in the works to be released soon. This is just a very early preview of how it would work.
Pipecat setup
In this example, we will be using Llama 3 8B as our LLM and serving it via vLLM. To use Llama 3, we need to be authenticated via Hugging Face.
To authenticate ourselves, we need to go to HuggingFace and accept the model permissions for Llama 8B if we haven’t already. It takes about 30 minutes or less for them to accept your request.
In your Cerebrium dashboard, you can add your HuggingFace token as a secret by navigating to “Secrets” in the sidebar. For the sake of this tutorial, I called mine “HF_TOKEN”. We can now access these values in our code at runtime without exposing them in our code.
You can then add the following code to your main.py:
Pipecat currently doesn’t support locally instantiated models and requires them to follow the OpenAI compatible format. Therefore we run the vLLM server locally on our instance in a background process. We monitor the background process to make sure it launched successfully since there seems to be a bug with rapidly starting multiple vLLM instances. If it doesn’t launch correctly, we wait 5 seconds before trying again. We set the environment variable for OUTLINES_CACHE_DIR, this has to do with a disk I/O bug in outlines that vLLM uses. GitHub issue is here
Note, we are running the vLLM server on port 5000 (8000 is automatically used by Cerebrium) and we set the download directory of the model so that subsequent cold starts can be much quicker.
Now we implement the Pipecat framework by instantiating the various components. Create a function call main with the following code:
First, in our main function, we initialize the daily transport layer to receive/send the audio/video data from the Daily room we will connect to. You can see we pass the room_url we would like to join as well as a token to authenticate us programmatically joining. We also set our VAD stop seconds which is the amount of time we wait for a pause before our bot will respond - in this example, we set it to 200 milliseconds.
Next we connect to our locally running Deepgram models that are part of our Docker base image we specified in our cerebrium.toml - these are running on port 8082. This is where the Pipecat framework helps convert audio data to text and vice versa. We then follow the same patten to connect our locally running LLM model from our vLLM server.
Lastly, we then put this all together as a PipelineTask which is what Pipecat runs all together. The make up of a task is completely customizable and has support for Image and Vision use cases. You can read more here. Pipeline tasks come with a parameters that make it easy to handle interruptions, swap models to our preference and much more only changing a few lines of code.
In the code above, we are importing some helper functions at the top of our file to help with our implementation. You can copy the file from the github repository here. Make sure to name the file helpers.py.
Daily Event Webhooks
The Daily Python SDK comes with a lot of event webhooks where you can trigger functionality based on events occurring within your Daily room. We would like to handle events such as a user leaving/joining a call. Continue to add the following code to the main() function.
Above we handle the following events:
- When the first participant joins, we get the bot to introduce itself to the user. We do this by adding a message to the conversation.
- We add support for multiple participants to join and listen/respond to the bot.
- When a participant leaves or the call is ended, we get the bot to terminate itself.
From the code above, you will see the events are attached to “Transport”, which is the method of communication - in this case the meeting room. We then pass in our defined Pipeline task to our pipeline runner which executes indefinitely until we signal it to exit which in this case happens when a call ends. If you want to read further about the PipeCat infrastructure you can read more here
Starting Bot
We can run our instance with a minimum number of instances by settings the “min_replicas” in our cerebrium.toml for the optimal user experience however we do also want to handle autoscaling use cases. We want to make sure the vLLM server is live before the bot joins the meeting and so we make a local GET request to check this. These models take about 40s to load into VRAM from disk.
Additionally, we need to run the above code in a separate execution environment so PipeCat does not get instantiate multiple instances. To do this, we need to run the above code as a background process. This will be the entry point of our REST API endpoint to start the PipeCat bot. Once the pipecat bot has returned (ie: the call has ended) then we will return a response to our API endpoint. We therefore create the following function:
That’s it! You now have a fully functioning AI bot that can interact with a user through speech in ~500ms. Imagine the possibilities!
Let us now create a user facing UI in order for you to interface with this bot.
Creating Meeting Room
Cerebrium doesn’t only have to be used to run AI heavy workloads, it can run any Python code. Therefore we define two functions for our demo that will create a room to join programmatically and a temporary token, both of which will only be usable for 5 minutes. To implement this, we use the Daily REST API.
We need to get our Daily developer token from our profile. If, you don’t have an account you can sign up for one here (they have a generous free tier). You can then go to the “developers” tab to fetch your API key - add this to your Cerebrium Secrets.
Below we create a room that only lasts 5 minutes and a temporary token to access it
Deploy to Cerebrium
To deploy this application to Cerebrium you can simply run the command: cerebrium deploy in your terminal.
If it deployed successfully, you should see something like this:
We will add these endpoints to our frontend interface.
Connect frontend
We created a public fork of the PipeCat frontend to show you a nice demo of this application. You can clone the repo here.
Follow the instructions in the README.md and then populate the following variables in your .env.development.local
You can now run yarn dev and go to the URL: http://localhost:5173/ to test your application!
Conclusion
This tutorial acts as a good starting point for you to implement voice AI agents into your application as well as extend it into image and vision capabilities. Pipecat is an extensible and open-source framework that makes it easy to build applications using generative AI and Cerebrium makes the process seamless to deploy and auto scale while only paying for the compute you need.
Tag us as @cerebriumai so we can see what you build and please feel free to ask questions/send feedback to us on Slack or Discord communities