Introduction

Meta developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Additionally, the Meta team have released the model under an Apache 2.0 license, making it available for commercial use for organizations with less than 700m MAU.

The Llama 2 family of models are available in 2 sizes:

  • Llama 2 7B - 7 billion parameters
  • Llama 2 13B - 13 billion parameters
  • Llama 2 70B - 70 billion parameters

of which, there are two variants on Hugging Face:

  • The raw, pre-trained models llama-7B-hf/13B/70B
  • The chat dialogue trained models llama-7B-chat-hf/13B/70B`.

We provide full support for the fine- tuning of the Llama 2 family model.
The fine-tuning of the Llama 2 70B model is currently in testing and will be released to the public soon. Contact us for early access

Getting Started

Creating a Project

To create a project, you will need to initialise a training configuration file (below) and curate a dataset.

Adjusting your config file

Some important parameters to take note of in your config file are the following:

ParameterDescriptionRequired Value
auth_tokenWe need your Hugging Face authentication token in order to download the model weightstrue
load_in_8bitWhether to load the model in 8bit. Set to True for Falcon-7B.true
load_in_4bitWhether to load the model in 4bit. Set to True for Llama-13B.true

Additionally, you can adjust the following parameters based on your needs to optimise your training:

  • num_train_epochs
  • learning_rate
  • lr_scheduler_type
  • prompt_template (see here for more information on prompt templates)

Config File

The config file for Llama2 is similar to the config file for Llama. The only difference is that you need to specify your Huggingface Auth token that has access to the weights. This lets Cerebrium download the weights from Huggingface and run your fine-tuning job.

You can find an example config file here.

Inferencing with Llama2

Inferencing with Llama2 is as simple as inferencing with any other model on Cerebrium.
Once you have your adapter files downloaded, place them in your Cortex deployment directory and add the following to your main.py file:

For the model setup:


from transformers import logging, AutoTokenizer, GenerationConfig, AutoModelForCausalLM
from peft import PeftModel, PeftConfig  # Add the peft libraries we need for the adapter
# Loading in base model and tokenizer
base_model_name = "meta-llama/Llama-2-7b-hf"  # or meta-llama/Llama-2-7b-chat-hf
auth_token = "YOUR HUGGINGFACE AUTH TOKEN"

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    use_auth_token=auth_token,
    dtype='torch.float16', # Load the model in 16bit so it will fit on the A6000 GPU.
    # load_in_8bit=True, # Alternatively, Load the model in 8bit and use much larger batch sizes for significantly faster training
    device_map="auto",
)

peft_model_id = "./training-output" # where your adapter_model.bin and adapter_config.json are stored
config = PeftConfig.from_pretrained(peft_model_id)
model = PeftModel.from_pretrained(model, peft_model_id)  # Add the adapter to the model
tokenizer = AutoTokenizer.from_pretrained(base_model_name, use_auth_token=auth_token)

You can then add the following code to your predict function to run inference on your model:

def predict(item, run_id, logger):
    item = Item(**item)
    # Replace this with your template used in training
    template = "### Instruction:\n{instruction}\n\n### Response:\n"

    prompt = item.prompt
    question = template.format(instruction=prompt)
    inputs = tokenizer(question, return_tensors="pt")

    generation_config = GenerationConfig(
        top_p=item.top_p,
        top_k=item.top_k,
        num_beams=item.num_beams,
        max_new_tokens=item.max_new_tokens,
    )

    outputs = model.generate(
        input_ids=inputs["input_ids"].to("cuda"),
        generation_config=generation_config,
    )
    result = tokenizer.batch_decode(
        outputs.detach().cpu().numpy(), skip_special_tokens=True
    )[0]

    return {"Prediction": result}

And you should have a working Llama2 model on Cerebrium!

Example Config File

%YAML 1.2
---
training_type: "transformer" # Type of training to run. Either "diffuser" or "transformer".

name: llama2 # Name of the experiment.
api_key: Your Cerebrium API key.
auth_token: YOUR HUGGINGFACE API TOKEN THAT HAS ACCESS TO THE WEIGHTS

# Model params:
hf_model_path: "meta-llama/Llama-2-7b-hf"
model_type: "AutoModelForCausalLM"
dataset_path: /path/to/your/dataset.json # path to your local JSON dataset.
custom_tokenizer: "" # custom tokenizer from AutoTokenizer if required.
seed: 42 # random seed for reproducibility.
log_level: "INFO" # log_level level for logging.

# Training params:
training_args:
  logging_steps: 10
  per_device_train_batch_size: 15
  per_device_eval_batch_size: 15
  warmup_steps: 0
  gradient_accumulation_steps: 4
  num_train_epochs: 30
  learning_rate: 0.0001
  group_by_length: False

base_model_args: # args for loading in the base model.
  load_in_8bit: True
  device_map: "auto"

peft_lora_args: # peft lora args.
  r: 8
  lora_alpha: 32
  lora_dropout: 0.05
  target_modules: ["q_proj", "v_proj"]
  bias: "none"
  task_type: "CAUSAL_LM"

dataset_args:
  prompt_template: "short" # Prompt template to use. Either "short" or "long". Otherwise look at our docs on templating
  instruction_column: "prompt"
  label_column: "completion"
  context_column: "context"
  cutoff_len: 512
  train_val_ratio: 0.9