OpenAI compatible vLLM endpoint
Create a OpenAI compatible endpoint using the vLLM framework
In this tutorial, we will create a OpenAI compatible endpoint that can be used with any open-source mode. This allows you to use the same code as your OpenAI commands but swap in Cerebrium serverless functions with a 2 line code change.
To see the final code implementation, you can view it here
Cerebrium setup
If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get setup
In your IDE, run the following command to create our Cerebrium starter project: cerebrium init 1-openai-compatible-endpoint
. This creates two files:
- Main.py - Our entrypoint file where our code lives
- cerebrium.toml - A configuration file that contains all our build and environment settings Add the following pip packages and hardware requirements near the bottom of your cerebrium.toml. This will be used in creating our deployment environment.
To start, let us define our imports and initialize our model. In this tutorial, we will use the Llama 3.1 model by Meta which requires authorization on Hugging Face. Add your HF token to your secrets section in the Cerebrium dashboard. Add the following to your main.py
We now define the require output format OpenAI endpoints expect using Pydantic and specify our endpoint
Above the following is happening:
- We specify all the parameters we send in our function signature. You can set optional or default values. The run_id parameter we automatically add to your function with a unique identifier for every request.
- We put the entire prompt through the model and loop through the generated results.
- If stream=True, we yield a result. Since we are using a async function and yield, this is how we achieve streaming functionality on Cerebrium else we return the entire result at the end.
Deploy & Inference
To deploy the model use the following command:
Once deployed, you will see we generate a curl for this application that looks something like:
In Cerebrium, every function name is now and endpoint so to call this endpoint we would end the URL with /run. However, OpenAI compatible endpoints need to end with /chat/completions. We have made all endpoints OpenAI compatible so to call the endpoint you can do the following in another file:
Above we set our base url to the one returned by our deploy command - it ends in /run since that’s the function we are calling. Lastly, we use our JWT token, which is returned in the CURL command when you deploy or can be found in your Cerebrium dashboard under the section API Keys.
Voilà! You now have a OpenAI compatible endpoint that you can customize to your liking!