Achieve high throughput with the TensorRT-LLM framework
pip install --upgrade cerebrium
to upgrade it to the latest version.cerebrium init llama-3b-tensorrt
. This creates two files:
main.py
: Our entrypoint file where our code lives.cerebrium.toml
: A configuration file that contains all our build and environment settings.cerebrium.toml
file with the following configuration:
cerebrium.toml
:
tensorrt_llm
package from the NVIDIA PyPI index URL after the above installations. We’ll use shell commands to run command-line arguments during the build process (after pip, apt, and conda installations).
Add the following under [cerebrium.build]
in your cerebrium.toml
:
main.py
that will:
trtllm-build
step exists to determine if the model has been converted.
First, go to Hugging Face and accept the model permissions for Llama 8B if you haven’t already. Approval typically takes 30 minutes or less. Since Hugging Face requires authentication to download model weights, we need to authenticate in Cerebrium before downloading the model.
In your Cerebrium dashboard, add your Hugging Face token as a secret by navigating to “Secrets” in the sidebar. For this tutorial, we’ll name it “HF_AUTH_TOKEN”. This allows us to access the token at runtime without exposing it in our code.
You can then add the following code to your main.py to download the model:
trtllm-build
command offers many options to tune the engine for your specific workload. Here, we’ve selected two plugins that accelerate core components. Learn more about plugin options here.
We then need to run the convert_checkpoint script and then run the trtllm-build script to build the TensorTRT-LLM model. You can add the following code to your main.py:
cerebrium.toml
file.
Model Instantiation
Now that our model is converted with our specifications, let us initialise the model and set it up based on our requirements. This code will run on every cold start and takes roughly ~10-15s to load the model into GPU memory. If the container is warm, it will run your predict function immediately which we talk about in the next section.
Above your predict function, add the following code.
cerebrium deploy
.
Initial deployment takes about 15-20 minutes to install packages, download the model, and convert it to the TensorRT-LLM format. Once completed, it outputs a curl command you can use to test your inference endpoint.