Skip to main content
In this tutorial, we’ll show you how to deploy Llama 3.2 3B using TensorRT-LLM’s PyTorch backend served through Nvidia Triton Inference Server. The TensorRT + Triton setup delivers 15x higher throughput with 100% reliability compared to the baseline (vanilla deployment), while reducing latency by 7-9x across all percentiles. See the Performance Analysis section for detailed test methodology and results. You can view the final implementation here.

Why TensorRT + Triton?

Why TensorRT?

NVIDIA TensorRT is a software development kit for high-performance deep learning inference. It compiles model weights into optimized engines that run more efficiently on specific GPU hardware through CUDA-level optimizations, custom kernels, and optional quantization. TensorRT requires you to specify optimization parameters upfront - GPU architecture, batch size, precision (FP8, INT8, etc.), and input/output shapes. This specialization allows TensorRT to generate highly optimized inference engines that maximize GPU utilization, reduce latency, and lower inference costs compared to serving raw model weights.

Why Triton?

NVIDIA Triton Inference Server streamlines production AI deployment by handling operational concerns that are critical for serving models at scale. It provides automatic request batching, health checks, metrics collection, and standardized HTTP/gRPC APIs out of the box. Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines. Here is a diagram of how Triton works. Below is the process of how the two work together in terms of handling requests:
  1. Client sends text via HTTP/gRPC to Triton
  2. Triton queues the request in the scheduler
  3. Triton batches incoming requests (waits for more or timeout)
  4. When batch is ready, Triton calls your Python backend
  5. TensorRT-LLM generates tokens for the entire batch in parallel on GPU
  6. Triton returns responses to clients
This setup allows multiple concurrent requests to be processed together on the GPU for maximum throughput. Now let’s combine Triton and TensorRT-LLM together and see how it works

Basic Setup

Install the Cerebrium CLI:
pip install cerebrium
cerebrium login
Create your project:
cerebrium init tensorrt-triton-demo
cd tensorrt-triton-demo
In order to download the model to Cerebrium, you need to be granted acces. Then add your HuggingFace token to your Cerebrium project secrets as HF_AUTH_TOKEN through the dashboard so we can authenticate while downloading.

Implementation

All files should be placed in the same project directory.

Triton Model Configuration

Create config.pbtxt to define Triton’s model interface - you can view a more comprehensive list of what is available here
name: "llama3_2"
backend: "python"
max_batch_size: 128

dynamic_batching {
  max_queue_delay_microseconds: 100
}

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

input [
  {
    name: "text_input"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "max_tokens"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  }
]

output [
  {
    name: "text_output"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]
This configuration tells Triton:
  • Use Python backend (runs our model.py)
  • Automatically batch up to 128 requests together for efficient GPU utilization
  • Use dynamic batching with a 100 microsecond queue delay to maximize batch sizes
  • Accept text input with optional sampling parameters
  • Run on a single GPU instance
  • Return generated text as output

Python Backend Implementation

Triton’s Python backend requires implementing a TritonPythonModel class with three key methods:
  • initialize(args): Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.
  • execute(requests): Called every time Triton has a batch ready. Triton automatically batches incoming requests (up to your configured max_batch_size) and passes them here. This method extracts prompts from each request, runs batch inference with TensorRT-LLM, and returns responses.
  • finalize(): Called when the model is being unloaded. Use this to clean up GPU memory and shut down the TensorRT-LLM engine.
Create model.py implementing Triton’s Python backend interface:
"""
Triton Python Backend for TensorRT-LLM.
"""

import numpy as np
import triton_python_backend_utils as pb_utils
import torch
from tensorrt_llm import LLM, SamplingParams, BuildConfig
from tensorrt_llm.plugin.plugin import PluginConfig
from transformers import AutoTokenizer

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = f"/persistent-storage/models/{MODEL_ID}"


class TritonPythonModel:
    def initialize(self, args):
        """Initialize TensorRT-LLM with PyTorch backend."""
        print("Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
        
        print("Initializing TensorRT-LLM...")
        plugin_config = PluginConfig.from_dict({
            "paged_kv_cache": True,
        })
        
        build_config = BuildConfig(
            plugin_config=plugin_config,
            max_input_len=4096,
            max_batch_size=128,  # Matches Triton max_batch_size in config.pbtxt
        )
        
        self.llm = LLM(
            model=MODEL_DIR,
            build_config=build_config,
            tensor_parallel_size=torch.cuda.device_count(),
        )
        print("✓ Model ready")
    
    def execute(self, requests):
        """
        Execute inference on batched requests.
        
        Triton automatically batches requests (up to max_batch_size: 128).
        This function processes the batch that Triton provides.
        """
        try:
            prompts = []
            sampling_params_list = []
            original_prompts = []
            
            # Extract data from each request in the batch. We need to look through requests: https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#execute
            for request in requests:
                try:
                    # Get input text - handle batched tensor structures
                    input_tensor = pb_utils.get_input_tensor_by_name(request, "text_input")
                    text_array = input_tensor.as_numpy()
                    
                    # Extract text handling different array structures
                    if text_array.ndim == 0:
                        text = text_array.item()
                    elif text_array.dtype == object:
                        text = text_array.flat[0] if text_array.size > 0 else text_array.item()
                    else:
                        text = text_array.flat[0] if text_array.size > 0 else text_array.item()
                    
                    # Decode if bytes
                    if isinstance(text, bytes):
                        text = text.decode('utf-8')
                    elif isinstance(text, np.str_):
                        text = str(text)
                    
                    # Get optional parameters with defaults
                    max_tokens = 1024
                    if pb_utils.get_input_tensor_by_name(request, "max_tokens") is not None:
                        max_tokens_array = pb_utils.get_input_tensor_by_name(request, "max_tokens").as_numpy()
                        max_tokens = int(max_tokens_array.item() if max_tokens_array.ndim == 0 else max_tokens_array.flat[0])
                    
                    temperature = 0.8
                    if pb_utils.get_input_tensor_by_name(request, "temperature") is not None:
                        temp_array = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()
                        temperature = float(temp_array.item() if temp_array.ndim == 0 else temp_array.flat[0])
                    
                    top_p = 0.95
                    if pb_utils.get_input_tensor_by_name(request, "top_p") is not None:
                        top_p_array = pb_utils.get_input_tensor_by_name(request, "top_p").as_numpy()
                        top_p = float(top_p_array.item() if top_p_array.ndim == 0 else top_p_array.flat[0])
                    
                    # Format prompt using chat template
                    prompt = self.tokenizer.apply_chat_template(
                        [{"role": "user", "content": text}],
                        tokenize=False,
                        add_generation_prompt=True
                    )
                    
                    prompts.append(prompt)
                    original_prompts.append(prompt)
                    sampling_params_list.append(SamplingParams(
                        temperature=temperature,
                        top_p=top_p,
                        max_tokens=max_tokens,
                    ))
                except Exception as e:
                    print(f"Error processing request: {e}", flush=True)
                    prompts.append("")
                    original_prompts.append("")
                    sampling_params_list.append(SamplingParams(max_tokens=1024))
            
            # Batch inference
            if not prompts:
                return []
            
            outputs = self.llm.generate(prompts, sampling_params_list)

            # Create responses
            responses = []
            for i, output in enumerate(outputs):
                try:
                    generated_text = output.outputs[0].text
                    
                    # Strip prompt from output if included
                    if original_prompts[i] and original_prompts[i] in generated_text:
                        generated_text = generated_text.replace(original_prompts[i], "").strip()
                    
                    responses.append(pb_utils.InferenceResponse(
                        output_tensors=[pb_utils.Tensor(
                            "text_output",
                            np.array([generated_text.encode('utf-8')], dtype=object)
                        )]
                    ))
                except Exception as e:
                    print(f"Error creating response {i}: {e}", flush=True)
                    responses.append(pb_utils.InferenceResponse(
                        output_tensors=[pb_utils.Tensor(
                            "text_output",
                            np.array([f"Error: {str(e)}".encode('utf-8')], dtype=object)
                        )]
                    ))
            
            return responses
            
        except Exception as e:
            print(f"Error in execute: {e}", flush=True)
            return [
                pb_utils.InferenceResponse(
                    output_tensors=[pb_utils.Tensor(
                        "text_output",
                        np.array([f"Batch error: {str(e)}".encode('utf-8')], dtype=object)
                    )]
                )
                for _ in requests
            ]
    
    def finalize(self):
        """Cleanup on shutdown."""
        if hasattr(self, 'llm'):
            self.llm.shutdown()
            torch.cuda.empty_cache()

Model Download Script

To download our model, create download_model.py:
#!/usr/bin/env python3
"""Download HuggingFace model to persistent storage."""

import os
from pathlib import Path
from huggingface_hub import snapshot_download, login

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
MODEL_DIR = Path("/persistent-storage/models") / MODEL_ID


def download_model():
    """Download model if not already present."""
    hf_token = os.environ.get("HF_AUTH_TOKEN")
    
    if not hf_token:
        print("WARNING: HF_AUTH_TOKEN not set")
        return
    
    if MODEL_DIR.exists() and any(MODEL_DIR.iterdir()):
        print("✓ Model already exists")
        return
    
    print("Downloading model...")
    login(token=hf_token)
    snapshot_download(
        MODEL_ID,
        local_dir=str(MODEL_DIR),
        token=hf_token
    )
    print("✓ Model downloaded")


if __name__ == "__main__":
    download_model()
This script checks if the model exists in persistent storage before downloading to avoid redundant downloads on subsequent deployments.

Container Setup

Create Dockerfile extending Nvidia’s Triton container:
FROM nvcr.io/nvidia/tritonserver:25.10-trtllm-python-py3

ENV PYTHONPATH=/usr/local/lib/python3.12/dist-packages:$PYTHONPATH
ENV PYTHONDONTWRITEBYTECODE=1
ENV DEBIAN_FRONTEND=noninteractive
ENV HF_HOME=/persistent-storage/models
ENV TORCH_CUDA_ARCH_LIST=8.6

# Install dependencies
RUN apt-get update && apt-get install -y \
    git \
    git-lfs \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN pip install --break-system-packages \
    huggingface_hub \
    transformers \
    || true

# Create directories
RUN mkdir -p \
    /app/model_repository/llama3_2/1 \
    /persistent-storage/models \
    /persistent-storage/engines

# Copy files
COPY model.py /app/model_repository/llama3_2/1/
COPY config.pbtxt /app/model_repository/llama3_2/

EXPOSE 8000 8001 8002

CMD ["tritonserver", "--model-repository=/app/model_repository", "--http-port=8000", "--grpc-port=8001", "--metrics-port=8002"]
The Dockerfile uses Nvidia’s official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.

Deployment Configuration

We now configure our container and autoscaling environment in cerebrium.toml. This file defines the hardware resources and scaling behavior:
[cerebrium.deployment]
name = "tensorrt-triton-demo"
python_version = "3.12"
disable_auth = true
include = ['./*', 'cerebrium.toml']
exclude = ['.*']
deployment_initialization_timeout = 830

[cerebrium.hardware]
cpu = 4.0
memory = 40.0
compute = "AMPERE_A10"
gpu_count = 1
provider = "aws"
region = "us-east-1"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 300
replica_concurrency = 128
scaling_metric = "concurrency_utilization"

[cerebrium.runtime.custom]
port = 8000
healthcheck_endpoint = "/v2/health/live"
readycheck_endpoint = "/v2/health/ready"
dockerfile_path = "./Dockerfile"
Key configuration details:
  • replica_concurrency = 128: Each replica can handle up to 128 concurrent requests, matching our Triton batch size
  • max_replicas = 5: Scale up to 5 replicas for peak load

Deploy

Download Model to Persistent Storage

Before deploying, we need to download the model to Cerebrium’s persistent storage. This ensures the model is available across all deployments and avoids redundant downloads during container startup. The cerebrium run command executes a Python script in a temporary container with the same environment and hardware configuration as your deployment. It has access to persistent storage at /persistent-storage, so any files written there will be available to your deployed containers. Run the download script on Cerebrium:
cerebrium run download_model.py
You should see in the logs that the model either exists or has been downloaded successfully!

Deploy the Model

Deploy the model:
cerebrium deploy
Once your application has been deployed successfully your should see the base endpoint url that we can use to call it - we will use this in the next section.

Test

Send a request to your deployed endpoint:
curl -X POST https://api.aws.us-east-1.cerebrium.ai/v4/<project-id>/<name>/v2/models/llama3_2/infer \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "name": "text_input",
        "shape": [1, 1],
        "datatype": "BYTES",
        "data": ["What is machine learning?"]
      }
    ],
    "outputs": [{"name": "text_output"}]
  }'
The endpoint returns results in this format:
{
  "outputs": [
    {
      "name": "text_output",
      "datatype": "BYTES",
      "shape": [1],
      "data": ["Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."]
    }
  ]
}
The response follows Triton’s standard inference protocol format with the generated text in the data field of the output tensor.

Performance Analysis

Test Setup

To validate the performance improvements of TensorRT + Triton, we compared it against a vanilla HuggingFace baseline serving the same Llama 3.2 3B Instruct model. Both deployments used identical hardware (NVIDIA A10 GPU) and were tested under the same load conditions. Vanilla Baseline Setup:
  • Model served directly using HuggingFace Transformers with PyTorch
  • Single request processing (no batching)
  • Standard FastAPI endpoint
  • Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)
TensorRT + Triton Setup:
  • TensorRT-LLM with PyTorch backend
  • Triton Inference Server with dynamic batching (max batch size: 128)
  • Automatic request queuing and batching
  • Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)
Both deployments were tested with the same load testing parameters to ensure fair comparison.

Results

MetricVanilla BaselineTensorRT + TritonImprovement
Requests Per Second (RPS)0.8312.4615x faster
Success Rate61.6%100.0%38.4% increase
P50 Latency297.7s41.7s7.1x faster
P99 Latency593.2s79.3s7.5x faster
Average Latency376.2s42.4s8.9x faster
The TensorRT + Triton setup delivers 15x higher throughput with 100% reliability compared to the baseline, while reducing latency by 7-9x across all percentiles. The baseline’s 61.6% success rate and high latency come from processing requests sequentially without batching, leading to GPU underutilization and request timeouts. TensorRT + Triton eliminates these issues by keeping the GPU fully utilized with batched, optimized inference, resulting in 100% success rate and consistent, predictable latency. These results demonstrate that TensorRT + Triton is not just faster, but also more reliable and cost-effective for production LLM serving at scale.

Get Started

The complete implementation, including all configuration files and deployment scripts, is available in our GitHub repository. Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for your own models) with TensorRT-LLM and Triton Inference Server. You’ll have a production-ready, high-performance LLM serving endpoint in minutes.