Mistral 7B with vLLM
Deploy Mistral 7B with vLLM
This example is only compatible with CLI v1.20 and later. Should you be making
use of an older version of the CLI, please run pip install --upgrade cerebrium
to upgrade it to the latest version.
In this tutorial, we’ll show you how to deploy Mistral 7B using the popular vLLM inference framework.
To see the final implementation, you can view it here
Basic Setup
Developing models with Cerebrium is similar to developing on a virtual machine or Google Colab, making conversion straightforward. Make sure you have the Cerebrium package installed and are logged in. If not, check our docs here.
First, create your project:
Add these Python packages to the [cerebrium.dependencies.pip]
section in your cerebrium.toml
file:
Create a main.py
file for our Python code. This simple implementation can be done in a single file. First, let’s define our request object:
We use Pydantic for data validation. The prompt
parameter is required, while others are optional with default values. If prompt
is missing from the request, users receive an automatic error message.
vLLM Implementation
Model Setup
We load the model outside the predict
function since it only needs to be loaded once at startup, not with every request. The predict
function simply passes input parameters from the request to the model and returns the generated outputs.
Deploy
Configure your compute and environment settings in cerebrium.toml
:
Deploy the model using this command:
After deployment, make this request:
The endpoint returns results in this format: