GPT recently released GPT-OSS (gpt-oss-20b and gpt-oss-120b) two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. Available under the flexible Apache 2.0 license, these models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware.
GPT-OSS introduces some unique capabilities that set it apart from other open source LLMs:
Mixture of Experts (MoE) Architecture: The model comes in 20B and 120B parameter variants, but uses MoE to keep active parameters low while maintaining strong capabilities
MXFP4 Quantization: A novel 4-bit floating point format specifically designed for MoE layers, enabling efficient serving
Attention Sinks: Special attention mechanism that allows for longer context lengths without degrading output quality
Harmony Response Format: Built-in support for structured outputs like chain-of-thought reasoning and tool use. You can see some examples from OpenAI here
Please note that in vLLM, you can only run it on NVIDIA H100, H200, B200 as
well as MI300x, MI325x, MI355x and Radeon AI PRO R9700 as of 6th August 2025
In this tutorial, we will show the most simple variation of deploying this model using vllm serve however if you would like more control then please look at our OpenAI compatible endpoint with vLLM guide
We set the docker_base_image_url to the cuda:12.8.1-devel image which is quite large but necessary to contain all the require packages/libraries.
We use pre-build commands to install uv (a faster Python package installer) and install the required vLLM packages. These commands execute at the start of the build process, before dependency installation begins. This early execution timing makes them essential for setting up the build environment. You can read more here
We set our hardware to use a H100 and run in the us-east-1 region
We set the replica concurrency to 32 meaning this container on a H100 can handle 32 requests concurrently at a time
Lastly, we use vllm serve to serve this container which turns it automatically into a OpenAI compatible server. This server runs on port 8000 so we just too to export that.
To deploy the above simply run the command cerebrium deploy. You should see it create your environment and download the model.In order to test your endpoint, you can simply make the following request:
When you make the request, a container should be spun up, the modal loaded and you should be streamed the output. As of 6th August, this generating
roughly 30 tokens per/second.