Intro

Cerebrium supports HuggingFace Transformers through the use of pipeline. By using pipeline we are able to load both the tokenizer and model easily within the same object.

Using the Transformers library, you have instant access to pre-trained models for NLP tasks. If you would like to deploy HuggingFace models that aren’t supported by pipeline, I suggest you look at using our custom python functionality

By the end of this guide, you’ll have an API endpoint that can handle any scale of traffic by running inference on serverless CPUs/GPUs.

Project set up

Before building you need to set up a Cerebrium account. This is as simple as starting a new Project in Cerebrium and copying the API key. This will be used to authenticate all calls for this project.

Create a project

  1. Go to dashboard.cerebrium.ai
  2. Sign up or Login
  3. Navigate to the API Keys page
  4. You will need your private API key for deployments. Click the copy button to copy it to your clipboard

API Key

Develop model

Now navigate to where your model code is stored. This could be in a notebook or in a plain .py file.

To start, you should install the Cerebrium framework running the following command in your notebook or terminal

pip install --upgrade cerebrium

Using a HuggingFace model is as simple as importing the HUGGINGFACE_PIPELINE model type and passing in the task and model id. In this case, we will use the EleutherAI/gpt-neo-125M model to generate text. The call signature follows the exactly same format as the pipeline call, and so you should specify a minimum of either task or model in your model_initialization parameter. However, you can also pass in any other parameters that you would normally pass to the pipeline call() for the specific model you have chosen.

from cerebrium import Conduit, model_type, hardware

# Create a conduit
c = Conduit(
    name='my-hf-gpt-model',
    api_key='<API_KEY>',
    flow=[
        (model_type.HUGGINGFACE_PIPELINE, {"task": "text-generation", "model": "EleutherAI/gpt-neo-125M", "max_new_tokens": 100}),
    ]
)

Then, simply call the deploy method to deploy your model. HuggingFace models will typically take a few minutes to deploy.

c.deploy()

Deployed Model

Your model is now deployed and ready for inference! Navigate to the dashboard and on the Models page you will see your model.

You can run inference using curl.

curl --location --request POST '<ENDPOINT>' \
--header 'Authorization: <API_KEY>' \
--header 'Content-Type: application/json' \
--data-raw '[<INPUT_DATA>]'

Your input data should be in the same shape and typing that the HF pipeline would accept for the particular model you have chosen. In this case, we should pass in a string. For input data of ["this is a test"], the response will be:

ThunderClient Response

Navigate back to the dashboard and click on the name of the model you just deployed. You will see an API call was made and the inference time. From your dashboard you can monitor your model, roll back to previous versions and see traffic.

With one line of code, your model was deployed in seconds with automatic versioning, monitoring and the ability to scale based on traffic spikes. Try deploying your own model now or check out our other frameworks.

Adding additional parameters to your pipeline

In some cases, you may need to parse additional parameters to your pipeline. Say your model is too large and so you need to offload the model as well as quantise it to 8bit. Additionally, access to the model requires your authorization token. Normally, when using the huggingface libraries, you would do this as follows:

hf_auth_token="<YOUR_HF_AUTH_TOKEN>"
model_kwargs = {
    "torch_dtype": "torch.float16",
   "offload_folder":"./offload",
   "offload_state_dict":True,
   "device_map":"auto",
   }

pipeline(
    task="text-generation",
    model="mosaicml/mpt-7b",
    model_kwargs=model_kwargs,
    use_auth_token=hf_auth_token,
)

With Cerebrium’s conduit, we’ve kept things simple so that you can easily parse any parameters you would typically place in your pipeline(). When deploying with the conduit, the implementation of the example above is as easy as:

model_kwargs = {
    "torch_dtype": "torch.float16",
    "offload_folder": "./offload",
    "offload_state_dict": True,
    "device_map": "auto",
}
hf_auth_token = "<YOUR_HF_AUTH_TOKEN>"

c = Conduit(
    "your-name-for-your-model",
    "<API_KEY>",
    [
        (
            model_type.HUGGINGFACE_PIPELINE,
            {
                "task": "text-generation",
                "model": "mosaicml/mpt-7b",
                "model_kwargs": model_kwargs,
                "use_auth_token": hf_auth_token,
            },
        ),
    ],
)

Changing runtime parameters

Sometimes, you may want to change the runtime parameters of your model periodically. For example, you may want to change the number of tokens generated by the GPT-Neo model. To do this, you can simply change the input to your deployed model to be the following shape from a list:

{
  "data": "this is a test",
  "parameters": {
    "max_new_tokens": 100
  }
}

Therefore, your new curl command would be:

curl --location --request POST '<ENDPOINT>' \
--header 'Authorization: <API_KEY>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "data": "this is a test",
    "parameters": {
        "max_new_tokens": 100
    }
}'

Note, if you would like to include different parameters from the result of a processing function, you should follow the dictionary structure above, and include the parameters you would like to pass in the parameters key, rather than the raw input data list.