This example is only compatible with CLI v1.20 and later. Should you be making use of an older version of the CLI, please run pip install --upgrade cerebrium to upgrade it to the latest version.

In this tutorial, we will guide you through implementing the TensorRT-LLM framework to serve the Llama 3 8B model on the Cerebrium platform. TensorRT-LLM is a powerful framework for optimizing machine learning models for inference, leading to significant improvements in performance, especially in inference speed and throughput.

We will achieve approximately 1,700 output tokens per second on a single NVIDIA A10 instance. You can achieve up to 4,500 output tokens per second on a single NVIDIA A100 40GB instance or even 19,000 tokens on an H100. For further improvements, you can use speculative sampling or FP8 quantization to increase latency and throughput. View the official benchmarks across different GPU types, model sizes, and input/output token lengths here.

Overview

TensorRT-LLM is a specialized library within NVIDIA’s TensorRT, a high-performance deep learning inference platform. It is designed to accelerate large language models (LLMs) using NVIDIA GPUs. While it can significantly improve the performance of your machine learning models, it requires a complex setup process.

You must convert and build the model using specific arguments that replicate your workloads as closely as possible. Improper configuration can lead to subpar performance and deployment complications. We will cover these concepts in depth throughout this tutorial.

Cerebrium Setup

If you don’t have a Cerebrium account, you can create one by signing up here and following the documentation to get set up.

In your IDE, run the following command to create our Cerebrium starter project: cerebrium init llama-3b-tensorrt. This creates two files:

  • main.py: Our entrypoint file where our code lives.
  • cerebrium.toml: A configuration file that contains all our build and environment settings.

TensorRT-LLM has a demo implementation of Llama on its GitHub repository. TensorRT-LLM requires Python 3.10, and the code that converts model weights to the TensorRT format requires significant memory. Update your cerebrium.toml file with the following configuration:

[cerebrium.deployment]
name = "llama-3b-tensorrt"
python_version = "3.10"
include = ["./*", "main.py", "cerebrium.toml"]
exclude = ["./example_exclude"]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"

[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "AMPERE_A10"
cpu = 3
memory = 40.0
gpu_count = 1

The most important decision is choosing your GPU chip. Larger models, longer sequence lengths, and bigger batches all require more GPU memory. If throughput is your primary metric, we recommend using an A100/H100. In this example, we use an A10, which offers a good cost-to-performance ratio and stable availability for low-latency enterprise workloads. If you have specific requirements, please reach out.

Next, install the required pip and apt requirements by adding the following to your cerebrium.toml:

[cerebrium.dependencies.pip]
transformers = "latest"
torch = ">=2.0.0"
pydantic = "latest"
huggingface-hub = "latest"
flax = "latest"
h5py = "latest"
sentencepiece = "latest"
easydict = "latest"
mpmath = "==1.3.0"

[cerebrium.dependencies.apt]
software-properties-common = "latest"
gcc = "latest"
"g++" = "latest"
aria2 = "latest"
git = "latest"
git-lfs = "latest"
wget = "latest"
openmpi-bin = "latest"
libopenmpi-dev = "latest"

We need to install the tensorrt_llm package from the NVIDIA PyPI index URL after the above installations. We’ll use shell commands to run command-line arguments during the build process (after pip, apt, and conda installations).

Add the following under [cerebrium.build] in your cerebrium.toml:

shell_commands = [ "pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com https://pypi.nvidia.com/"]

We need to write initial code in our main.py that will:

  • Download Llama 3 7B from Hugging Face.
  • Convert the model checkpoints.
  • Build the TensorRT-LLM inference engine.

Currently, Cerebrium cannot run code exclusively during the build process (work in progress). However, we can check if the file output from the trtllm-build step exists to determine if the model has been converted.

First, go to Hugging Face and accept the model permissions for Llama 8B if you haven’t already. Approval typically takes 30 minutes or less. Since Hugging Face requires authentication to download model weights, we need to authenticate in Cerebrium before downloading the model.

In your Cerebrium dashboard, add your Hugging Face token as a secret by navigating to “Secrets” in the sidebar. For this tutorial, we’ll name it “HF_AUTH_TOKEN”. This allows us to access the token at runtime without exposing it in our code.

You can then add the following code to your main.py to download the model:

import os
from huggingface_hub import snapshot_download

huggingface_hub.login(token=os.environ.get("HF_AUTH_TOKEN"))

MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
MODEL_DIR="/persistent-storage/model_input"
ENGINE_DIR="/persistent-storage/model_output"

if not os.path.exists(ENGINE_DIR):
	snapshot_download(
	      MODEL_ID,
	      local_dir=MODEL_DIR,
	      ignore_patterns=["*.pt", "*.bin"],  # using safetensors
	  )

In the above code, we log in to HuggingFace using our HF_AUTH_TOKEN and download the Llama 3 7B model. We check if the ENGINE_DIR exists as a way to prevent running this code on cold start but rather only running this if it the final TensorRT-LLM engine files don’t exist.

Setup TensorRT-LLM

Next, we need to convert the downloaded model. We can use the script that exists in the (tensorRT-LLM)[https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama] repo. To download this script to your Cerebrium instance put the following code in your shell commands.

["wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/71d8d4d3dc655671f32535d6d2b60cab87f36e87/examples/llama/convert_checkpoint.py -O ./convert_checkpoint.py"]

This downloads the specific script file to your instance. Shell commands works as an array of strings so you can just add it to the existing shell command already there.

Converting and Compiling TensorRT-LLM engine

Below, we convert the model into float16 format for marginally higher performance compared to float32. For even lower latency, you can use the quantized model (FP8).

TensorRT-LLM achieves high throughput by compiling in advance with predefined settings based on your expected workloads. This enables concrete choices of CUDA kernels for each operation, optimized for specific tensor types and shapes on your target hardware.

We need to specify the maximum input and output lengths and the typical batch size. The closer these values match production conditions, the higher our throughput will be. The trtllm-build command offers many options to tune the engine for your specific workload. Here, we’ve selected two plugins that accelerate core components. Learn more about plugin options here.

We then need to run the convert_checkpoint script and then run the trtllm-build script to build the TensorTRT-LLM model. You can add the following code to your main.py:

import tensorrt_llm
import subprocess
import torch

MAX_INPUT_LEN, MAX_OUTPUT_LEN = 256, 256
MAX_BATCH_SIZE = (
    128
)

if not os.path.exists(ENGINE_DIR):
 #snapshot_download(
	      MODEL_ID,
 				...
  )

 # Build the TRT engine
	convert_checkpoint = f"""
	python convert_checkpoint.py --model_dir {MODEL_DIR} \\
	                            --output_dir ./model_ckpt \\
	                            --dtype float16
	"""

	SIZE_ARGS = f"--max_batch_size={MAX_BATCH_SIZE} --max_input_len={MAX_INPUT_LEN} --max_output_len={MAX_OUTPUT_LEN}"
	build_engine = f"""
	trtllm-build --checkpoint_dir ./model_ckpt --output_dir {ENGINE_DIR} \\
	          --tp_size=1 --workers=1 \\
	          --max_batch_size={MAX_BATCH_SIZE} --max_input_len={MAX_INPUT_LEN} --max_output_len={MAX_OUTPUT_LEN} \\
	          --gemm_plugin=float16 --gpt_attention_plugin=float16
	"""

	print("Building engine...")
	subprocess.run(convert_checkpoint, shell=True, check=True)
	subprocess.run(build_engine, shell=True, check=True)
	print("\\nEngine built successfully! You can find it at: ", ENGINE_DIR)
else:
    print("Engine already exists at: ", ENGINE_DIR)

You will see we run these command line arguments as a subprocess. The reason I did it like this and not as shell commands is:

Currently, Cerebrium doesn’t support secrets in shell commands, and we need the model downloaded before continuing with conversion steps. Using subprocesses allows cleaner variable reuse compared to putting everything in the cerebrium.toml file. Model Instantiation Now that our model is converted with our specifications, let us initialise the model and set it up based on our requirements. This code will run on every cold start and takes roughly ~10-15s to load the model into GPU memory. If the container is warm, it will run your predict function immediately which we talk about in the next section.

Above your predict function, add the following code.

from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# LLaMA models do not have a padding token, so we use the EOS token
tokenizer.add_special_tokens(
    {"pad_token": tokenizer.eos_token}
)
# and then we add it from the left, to minimize impact on the output
tokenizer.padding_side = "left"
pad_id = tokenizer.pad_token_id
end_id = tokenizer.eos_token_id

runner_kwargs = dict(
    engine_dir=f"{ENGINE_DIR}",
    lora_dir=None,
    rank=tensorrt_llm.mpi_rank(),
)

model = ModelRunner.from_dir(**runner_kwargs)

We need to pass the input that a user sends into our prompt template and cater for special tokens. Add the the following function that will handle this for us

def parse_input(
        tokenizer,
        input_text,
        prompt_template=None,
        add_special_tokens=True,
        max_input_length=923
    ):

    # Apply prompt template if provided
    if prompt_template is not None:
        input_text = prompt_template.format(input_text=input_text)

    # Encode the text to input IDs
    input_ids = tokenizer.encode(
        input_text,
        add_special_tokens=add_special_tokens,
        truncation=True,
        max_length=max_input_length,
    )

    # Convert to tensor
    input_ids_tensor = torch.tensor([input_ids], dtype=torch.int32)  # Add batch dimension

    return input_ids_tensor

Before we get to our predict function that runs at runtime we need to define our Pyandtic object that will make sure user requests conform to this standard as well as have default values.

from typing import Optional
from pydantic import BaseModel

class Item(BaseModel):
    prompt: str
    temperature: Optional[float] = 0.95
    top_k: Optional[int] = 100
    top_p: Optional[float] = 1.0
    repetition_penalty: Optional[float] = 1.05
    num_tokens: Optional[int] = 250
    prompt_template: Optional[str] = "user\\n{input_text}\\nmodel\\n"

Inference function

Lastly, let us bring this all together with our predict function

def predict(prompt, temperature, top_k, top_p, repetition_penalty, num_tokens, prompt_template, run_id, logger):
    item = Item(
        prompt=prompt,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        num_tokens=num_tokens,
        prompt_template=prompt_template
    )

    stop_words_list = None
    bad_words_list = None

    batch_input_ids = parse_input(
        tokenizer=tokenizer,
        input_text=item.prompt,
        prompt_template=item.prompt_template
    )
    input_length = batch_input_ids[0].size(0)

    time_begin = time.time()
    with torch.no_grad():
        outputs = model.generate(
            batch_input_ids,
            max_new_tokens=item.num_tokens,
            max_attention_window_size=None,
            sink_token_length=None,
            end_id=end_id,
            pad_id=pad_id,
            temperature=item.temperature,
            top_k=item.top_k,
            top_p=item.top_p,
            num_beams=1,
            repetition_penalty=item.repetition_penalty,
            stop_words_list=stop_words_list,
            bad_words_list=bad_words_list,
            output_sequence_lengths=True,
            return_dict=True,
        )
        torch.cuda.synchronize()

    time_total = time.time() - time_begin

    output_ids = outputs["output_ids"]
    sequence_lengths = outputs["sequence_lengths"]

    # Decode the output
    output_begin = input_length
    output_end = sequence_lengths
    output_text = tokenizer.decode(output_ids[0][0][output_begin:output_end].tolist())

    return {
        "response_txt": output_text,
        "latency_s": time_total,
    }

Now that our code is ready, deploy the app with the command: cerebrium deploy.

Initial deployment takes about 15-20 minutes to install packages, download the model, and convert it to the TensorRT-LLM format. Once completed, it outputs a curl command you can use to test your inference endpoint.

curl --location 'https://api.cortex.cerebrium.ai/v4/p-xxxxxx/predict' \\
--header 'Content-Type: application/json' \\
--header 'Authorization: Bearer <YOUR TOKEN HERE>' \\
--data '{"prompt": "Tell me about yourself?"}'

TensorRT-LLM is one of the top-performing inference frameworks, especially when you know your expected workload details. You now have a low-latency endpoint with high throughput that can auto-scale to tens of thousands of inferences while only paying for the compute you use.

To view the final version of the code, you can look here.