Hyperparameter Sweep training Llama 3.2 with WandB

When training machine learning models, finding the perfect combination of hyper parameters can feel overwhelming however if you do it well, it can turn a good model into a great one! Hyper parameter sweeps help you find the best performing model for the least amount of compute or time spent training - think of them as your systematic approach to testing every variation to uncover the best result. In this tutorial, we’ll walk through training Llama 3.2, using Wandb (Weights and Biases) to run hyper parameter sweeps to optimize its performance and we’ll leverage Cerebrium to scale our experiments across serverless GPUs, allowing us to find the best-performing model faster than ever. If you would like to see the final version of this tutorial, you can view it on Github here. Read this section if you’re unfamiliar with sweeps.

Analogy: Pizza Topping Sweep

Forget about ML for a second. Imagine you’re making pizzas, and you want to discover the most delicious combination of toppings. You can change three things about your pizza: • Type of Cheese (mozzarella, cheddar, parmesan) • Type of Sauce (tomato, pesto) • Extra Topping (pepperoni, mushrooms, olives) There are 12 possible combinations of pizzas you can make. One of them will taste the best! To find out which pizza is the tastiest, you need to try all the combinations and rate them. This process is called a hyperparameter sweep. Your three hyperparameters are the cheese, sauce, and extra topping. If you do it one pizza at a time, it could take hours. But if you had 12 ovens, you could bake all the pizzas at once and find the best one in just a few minutes! If a kitchen is a GPU, then you need 12 GPUs to run each experiment to see which cookie is the best. The power of Cerebrium is the ability to run sweeps like this on 12 different GPUs (or 1,000 GPUs if you’d like) to get you the best version of a model fast.

Setup Cerebrium

If you don’t have a Cerebrium account, you can run the following in your cli:

pip install cerebrium --upgrade
cerebrium login
cerebrium init wandb-sweep

This creates a folder with two files:

main.py - Our entrypoint file where our code lives.
cerebrium.toml - A configuration file that contains all our build and environment settings Add the following pip packages near the bottom of your cerebrium.toml. This will be used in creating our deployment environment.

We will return back to these files later but for now, we will continue the rest of this tutorial in this folder.

Setup Wandb

Weights & Biases (Wandb) is a powerful tool for tracking, visualizing, and managing machine learning experiments in real-time. It helps you log hyperparameters, metrics, and results, making it easy to compare models, optimize performance, and collaborate effectively with your team.

pip install wandb
wandb login

You should see a link printed in your terminal - click it and copy the API from the webpage back into your terminal. Add your W&B API key to Cerebrium secrets for use in your code. You can go to your Cerebrium Dashboard and navigate to the “secrets” tab in the left side bar. Add the following:

Key: WANDB_API_KEY
Value: The value you copied from the Wandb website.

Click the “Save All Changes” button to save the changes! Cerebrium Secrets

You should then be authenticated with Wandb and ready to go!

Training Script

To train with Llama 3.2, you’ll need:

Model access permission:
- Visit the Llama 3.2 model page on Hugging Face
- Accept all permissions
Hugging Face token:
- Click your profile image (top right)
- Select “Access token”
- Create a new token if needed
- Add to Cerebrium Secrets:
  - Key: HF_TOKEN
  - Value: Your Hugging Face token
- Click “Save All Changes”

Our training script adapts this Kaggle notebook. Create a requirements.txt file with these dependencies:

transformers
datasets
accelerate
peft
trl
bitsandbytes
wandb

These packages are required both locally and on Cerebrium. Update your cerebrium.toml to include:

The requirements.txt path
Hardware requirements for training
A 1-hour max timeout using response_grace_period

Add this configuration:

#existing configuration

[cerebrium.hardware]
cpu = 6
memory = 30.0
compute = "ADA_L40"

[cerebrium.scaling]
#existing configuration
response_grace_period = 3600

[cerebrium.dependencies.paths]
pip = "requirements.txt"

Install the dependencies locally:

pip install -r requirements.txt

Add this code to main.py:

import os
from typing import Dict
import wandb
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format
import torch
import bitsandbytes as bnb
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN"))

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

def train_model(params: Dict):
    """
    Training function that receives parameters from the Cerebrium endpoint
    """
    # Initialize wandb
    wandb.login(key=os.getenv("WANDB_API_KEY"))
    wandb.init(
        project=params.get("wandb_project", "Llama-3.2-Customer-Support"),
        name=params.get("run_name", None),
        config=params,
    )

    # Model configuration
    base_model = params.get("base_model", "meta-llama/Llama-3.2-3B-Instruct")
    new_model = params.get("output_model_name", f"/persistent-storage/llama-3.2-3b-it-Customer-Support-{wandb.run.id}")

    # Set torch dtype and attention implementation
    torch_dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
    attn_implementation = "eager"

    # QLoRA config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch_dtype,
        bnb_4bit_use_double_quant=True,
    )

    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=bnb_config,
        device_map="auto",
        attn_implementation=attn_implementation
    )
    tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

    # Find linear modules
    modules = find_all_linear_names(model)

    # LoRA config
    peft_config = LoraConfig(
        r=params.get("lora_r", 16),
        lora_alpha=params.get("lora_alpha", 32),
        lora_dropout=params.get("lora_dropout", 0.05),
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=modules
    )

    # Setup model
    tokenizer.chat_template = None  # Clear existing chat template
    model, tokenizer = setup_chat_format(model, tokenizer)
    model = get_peft_model(model, peft_config)

    # Load and prepare dataset
    dataset = load_dataset(
        params.get("dataset_name", "bitext/Bitext-customer-support-llm-chatbot-training-dataset"),
        split="train"
    )
    dataset = dataset.shuffle(seed=params.get("seed", 65))
    if params.get("max_samples"):
        dataset = dataset.select(range(params.get("max_samples")))

    instruction = params.get("instruction",
        """You are a top-rated customer service agent named John.
        Be polite to customers and answer all their questions."""
    )

    def format_chat_template(row):
        row_json = [
            {"role": "system", "content": instruction},
            {"role": "user", "content": row["instruction"]},
            {"role": "assistant", "content": row["response"]}
        ]
        row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
        return row

    dataset = dataset.map(format_chat_template, num_proc=4)
    dataset = dataset.train_test_split(test_size=params.get("test_size", 0.1))

    # Training arguments
    training_arguments = TrainingArguments(
        output_dir=new_model,
        per_device_train_batch_size=params.get("batch_size", 1),
        per_device_eval_batch_size=params.get("batch_size", 1),
        gradient_accumulation_steps=params.get("gradient_accumulation_steps", 2),
        optim="paged_adamw_32bit",
        num_train_epochs=params.get("epochs", 1),
        eval_strategy="steps",
        eval_steps=params.get("eval_steps", 0.2),
        logging_steps=params.get("logging_steps", 1),
        warmup_steps=params.get("warmup_steps", 10),
        learning_rate=params.get("learning_rate", 2e-4),
        fp16=params.get("fp16", False),
        bf16=params.get("bf16", False),
        group_by_length=params.get("group_by_length", True),
        report_to="wandb"
    )

    # Initialize trainer
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        peft_config=peft_config,
        args=training_arguments,
    )

    # Train and save
    model.config.use_cache = False
    trainer.train()
    model.config.use_cache = True

    # Save model
    trainer.model.save_pretrained(new_model)

    wandb.finish()
    return {"status": "success", "model_path": new_model}

You can read a deeper explanation of the training script here but here’s a high-level explanation of the code in bullet points:

This code sets up a fine-tuning pipeline for a Large Language Model (specifically Llama 3.2) using several modern training techniques:
The function takes a dictionary of parameters for flexibility in training configurations - this is our hyper parameter sweep.
We load a customer support dataset from Hugging Face and format the data into a chat template format
We implement QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning.
We use Weights & Biases (wandb) for experiment tracking logging results to our Wandb dashboard as they are available.
At the end, we saves the final model to our Cerebrium volume and return a “success” message to show that the training was successful.

Deploy the training endpoint:

cerebrium deploy

This command:

Sets up the environment with required packages
Deploys the training script as an endpoint
Returns a POST URL (save this for later)

Cerebrium requires no special decorators or syntax - just wrap your training code in a function. The endpoint automatically scales based on request volume, making it perfect for hyperparameter sweeps.