This example is only compatible with CLI v1.20 and later. Should you be making use of an older version of the CLI, please run pip install --upgrade cerebrium to upgrade it to the latest version.

In this tutorial blog post, we will guide you through the process of implementing the TensorRT-LLM framework to serve Llama 3 8B model on the Cerebrium platform. TensorRT-LLM is a powerful framework that can be used to optimise machine learning models for inference. It can lead to significant improvements in performance, especially in terms of inference speed and throughput.

We will achieve ~1700 output tokens per second on a single Nvidia A10 instance however you can go up to ~4500 output tokens per second on a single Nvidia A100 40GB instance or even ~19,000 tokens on a H100. For further improvements, you can use speculative sampling or FP8 quantisation to increase latency and throughput. You can view the official benchmarks across different GPU types, model sizes and input/output token lengths here.

Overview

TensorRT-LLM is a specialised library within NVIDIA’s TensorRT, a high-performance deep learning inference platform. It is designed to accelerate large language models (LLMs) using NVIDIA GPUs. It can significantly improve the performance of your machine learning models however it comes at the expense of a very complicated setup process.

You are required to convert and build the model using very specific arguments that replicate your workloads as closely as possible. If you don’t configure these steps properly, you might witness subpar performance and subsequently it will become very complicated to deploy. We will cover these concepts in depth throughout the tutorial.

Cerebrium Setup

If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation here to get setup

In your IDE, run the following command to create our Cerebrium starter project: cerebrium init llama-3b-tensorrt. This creates two files:

  • main.py - Our entrypoint file where our code lives
  • cerebrium.toml - A configuration file that contains all our build and environment settings

TensorRT-LLM has a demo implementation of Llama on its Github repo which you can look at here. The first thing you will notice is that TensorRT-LLM requires Python 3.10. Subsequently, the code that converts the model weights to the TensorRT format requires a lot of memory and so we need to set this in our configuration file. Please change your cerebrium.toml file to reflect the below:

[cerebrium.deployment]
name = "llama-3b-tensorrt"
python_version = "3.10"
include = "[./*, main.py, cerebrium.toml]"
exclude = "[./example_exclude]"
cuda_version = "12"

[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
gpu = "AMPERE_A10"
cpu = 3
memory = 40.0
gpu_count = 1

The most important decision to make is to decide what GPU chip you would like to run on. Larger models, longer sequence lengths and bigger batches all require more GPU memory and so if throughput is your desired metric, we recommend using a A100/H100. In this example we went with a A10 which gives a good cost/performance trade-off. Also, there is no capacity shortages and so its more stable for low-latency enterprise workloads. However, if this is a requirement for you - please reach out.

Let us then install the required pip and apt requirements. You can add the following to your cerebrium.toml

[cerebrium.dependencies.pip]
transformers = "latest"
torch = ">=2.0.0"
pydantic = "latest"
huggingface-hub = "latest"
flax = "latest"
h5py = "latest"
sentencepiece = "latest"
easydict = "latest"
mpmath = "==1.3.0"

[cerebrium.dependencies.apt]
software-properties-common = "latest"
gcc = "latest"
"g++" = "latest"
aria2 = "latest"
git = "latest"
git-lfs = "latest"
wget = "latest"
openmpi-bin = "latest"
libopenmpi-dev = "latest"

We want to install the tensorrt_llm package after the above installs and want to grab it from the Nvidia PyPI index url. To do this, we use shell commands which allows you to run command line arguments during the build process - this happens as the last step of the build process (ie: post pip, apt and conda installs).

Add the following under [cerebrium.build] in your cerebrium.toml:

shell_commands = [ "pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com https://pypi.nvidia.com/"]

We then need to write an initial code in our main.py that will:

  • Download Llama 3 7B from HuggingFace
  • Convert the model checkpoints
  • Build the TensorTRT-LLM inference engine.

At the moment, Cerebrium does not have a way to run code only during the build process (work in progress) however, one easy way for us to side step this is to check if the file output from the trtllm-build step already exists meaning its been converted.

To start we need to go to HuggingFace and accept the model permissions for Lllama 7B if we haven’t already. It takes about 30 minutes or less for them to accept your request. Since HuggingFace requires you to be authenticated to download the model weights, we need to authenticate ourselves in Cerebrium before downloading the model.

In your Cerebrium dashboard you can add your HuggingFace token as a secret by navigating to “Secrets” in the sidebar. For the sake of this tutorial I called mine “HF_AUTH_TOKEN”. We can now access these values in our code at runtime without exposing them in our code.

You can then add the following code to your main.py to download the model:

from huggingface_hub import snapshot_download
from cerebrium import get_secret

huggingface_hub.login(token=get_secret("HF_AUTH_TOKEN"))

MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
MODEL_DIR="/persistent-storage/model_input"
ENGINE_DIR="/persistent-storage/model_output"

if not os.path.exists(ENGINE_DIR):
	snapshot_download(
	      MODEL_ID,
	      local_dir=MODEL_DIR,
	      ignore_patterns=["*.pt", "*.bin"],  # using safetensors
	  )

In the above code, we login to HuggingFace using our HF_AUTH_TOKEN and download the Llama 3 7B model. We check if the ENGINE_DIR exists as a way to prevent running this code on cold start but rather only running this if it the final TensorRT-LLM engine files don’t exist.

Setup TensorRT-LLM

Next, we need to convert the downloaded model. We can use the script that exists in the (tensorRT-LLM)[https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama] repo. To download this script to your Cerebrium instance put the following code in your shell commands.

["wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/71d8d4d3dc655671f32535d6d2b60cab87f36e87/examples/llama/convert_checkpoint.py -O ./convert_checkpoint.py"]

This downloads the specific script file to your instance. Shell commands works as an array of strings so you can just add it to the existing shell command already there.

Converting and Compiling TensorRT-LLM engine

Below we convert the model into float16 format, the reason being that it results in marginally higher performance over float32. You can go even further and use the quantised model (FP8) which will give you the lowest latency.

What allows TensorRT-LLM to achieve its high throughput is that it is compiled in advance to predefined settings which you set based on your expected workloads. This therefore makes concrete choices of the CUDA kernels to execute for each operation which are then optimized for specific types and shapes of tensors for the specific hardware it runs on.

So we need to specify the maximum input and output lengths as well as the typical batch size. The closer these values are to production, the higher our throughput will be. There are many different options you can pass to the command trtllm-build to tune the engine for your specific workload, we selected just two plugins that accelerate two core components. You can read more about the plugin options here.

We then need to run the convert_checkpoint script and then run the trtllm-build script in order to build the TensorTRT-LLM model. You can add the following code to your main.py:

import tensorrt_llm
import subprocess
import torch

MAX_INPUT_LEN, MAX_OUTPUT_LEN = 256, 256
MAX_BATCH_SIZE = (
    128
)

if not os.path.exists(ENGINE_DIR):
 #snapshot_download(
	      MODEL_ID,
 				...
  )

 # Build the TRT engine
	convert_checkpoint = f"""
	python convert_checkpoint.py --model_dir {MODEL_DIR} \\
	                            --output_dir ./model_ckpt \\
	                            --dtype float16
	"""

	SIZE_ARGS = f"--max_batch_size={MAX_BATCH_SIZE} --max_input_len={MAX_INPUT_LEN} --max_output_len={MAX_OUTPUT_LEN}"
	build_engine = f"""
	trtllm-build --checkpoint_dir ./model_ckpt --output_dir {ENGINE_DIR} \\
	          --tp_size=1 --workers=1 \\
	          --max_batch_size={MAX_BATCH_SIZE} --max_input_len={MAX_INPUT_LEN} --max_output_len={MAX_OUTPUT_LEN} \\
	          --gemm_plugin=float16 --gpt_attention_plugin=float16
	"""

	print("Building engine...")
	subprocess.run(convert_checkpoint, shell=True, check=True)
	subprocess.run(build_engine, shell=True, check=True)
	print("\\nEngine built successfully! You can find it at: ", ENGINE_DIR)
else:
    print("Engine already exists at: ", ENGINE_DIR)

You will see we run these command line arguments as a subprocess. The reason I did it like this and not as shell commands is:

Currently Cerebrium doesn’t support Secrets in shell commands and I need the model to be downloaded before I can continue with the other model conversion steps. It seems much cleaner to reuse variables and use subprocesses than squash everything in the cerebrium.toml file. Model Instantiation Now that our model is converted with our specifications, let us initialise the model and set it up based on our requirements. This code will run on every cold start and takes roughly ~10-15s to load the model into GPU memory. If the container is warm, it will run your predict function immediately which we talk about in the next section.

Above your predict function, add the following code.

from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# LLaMA models do not have a padding token, so we use the EOS token
tokenizer.add_special_tokens(
    {"pad_token": tokenizer.eos_token}
)
# and then we add it from the left, to minimize impact on the output
tokenizer.padding_side = "left"
pad_id = tokenizer.pad_token_id
end_id = tokenizer.eos_token_id

runner_kwargs = dict(
    engine_dir=f"{ENGINE_DIR}",
    lora_dir=None,
    rank=tensorrt_llm.mpi_rank(),
)

model = ModelRunner.from_dir(**runner_kwargs)

We need to pass the input that a user sends into our prompt template and cater for special tokens. Add the the following function that will handle this for us

def parse_input(
        tokenizer,
        input_text,
        prompt_template=None,
        add_special_tokens=True,
        max_input_length=923
    ):

    # Apply prompt template if provided
    if prompt_template is not None:
        input_text = prompt_template.format(input_text=input_text)

    # Encode the text to input IDs
    input_ids = tokenizer.encode(
        input_text,
        add_special_tokens=add_special_tokens,
        truncation=True,
        max_length=max_input_length,
    )

    # Convert to tensor
    input_ids_tensor = torch.tensor([input_ids], dtype=torch.int32)  # Add batch dimension

    return input_ids_tensor

Before we get to our predict function that runs at runtime we need to define our Pyandtic object that will make sure user requests conform to this standard as well as have default values.

from typing import Optional
from pydantic import BaseModel

class Item(BaseModel):
    prompt: str
    temperature: Optional[float] = 0.95
    top_k: Optional[int] = 100
    top_p: Optional[float] = 1.0
    repetition_penalty: Optional[float] = 1.05
    num_tokens: Optional[int] = 250
    prompt_template: Optional[str] = "user\\n{input_text}\\nmodel\\n"

Inference function

Lastly, let us bring this all together with our predict function

def predict(prompt, temperature, top_k, top_p, repetition_penalty, num_tokens, prompt_template, run_id, logger):
    item = Item(
        prompt=prompt,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        num_tokens=num_tokens,
        prompt_template=prompt_template
    )

    stop_words_list = None
    bad_words_list = None

    batch_input_ids = parse_input(
        tokenizer=tokenizer,
        input_text=item.prompt,
        prompt_template=item.prompt_template
    )
    input_length = batch_input_ids[0].size(0)

    time_begin = time.time()
    with torch.no_grad():
        outputs = model.generate(
            batch_input_ids,
            max_new_tokens=item.num_tokens,
            max_attention_window_size=None,
            sink_token_length=None,
            end_id=end_id,
            pad_id=pad_id,
            temperature=item.temperature,
            top_k=item.top_k,
            top_p=item.top_p,
            num_beams=1,
            repetition_penalty=item.repetition_penalty,
            stop_words_list=stop_words_list,
            bad_words_list=bad_words_list,
            output_sequence_lengths=True,
            return_dict=True,
        )
        torch.cuda.synchronize()

    time_total = time.time() - time_begin

    output_ids = outputs["output_ids"]
    sequence_lengths = outputs["sequence_lengths"]

    # Decode the output
    output_begin = input_length
    output_end = sequence_lengths
    output_text = tokenizer.decode(output_ids[0][0][output_begin:output_end].tolist())

    return {
        "response_txt": output_text,
        "latency_s": time_total,
    }

Now that our code is deployed, we can deploy the application with the command: cerebrium deploy.

On initial deploy, it will take about 15-20 minutes since besides installing all your packages and dependancies, it will download the model and convert it to the TensorRT-LLM format. Once completed, it should output a curl which you can copy and paste to test your inference endpoint.

curl --location 'https://api.cortex.cerebrium.ai/v4/p-xxxxxx/predict' \\
--header 'Content-Type: application/json' \\
--header 'Authorization: Bearer <YOUR TOKEN HERE>' \\
--data '{"prompt": "Tell me about yourself?"}'

TensorRT-LLM is one of the top performing inference frameworks on the market and especially if you know details about your expected future workloads. You should now have a low latency endpoint with high throughput that can auto-scale to tens of thousands of inferences all while only paying for the compute you use.

To view the final version of the code, you can look here.