Introduction

Falcon is a family of models released by the Technology Innovation Institute (TII) of the UAE. At the time of release, it topped the Huggingface leaderboards while also being one of the fastest LLMs available. Additionally, the TII team have released the model under an Apache 2.0 license, making it available for commercial use.

The Falcon family of models are available in 2 sizes:

  • Falcon 7B - 7 billion parameters
  • Falcon 40B - 40 billion parameters

of which, there are two variants:

  • The raw, pre-trained models Falcon-7B/40B
  • The instruct trained models Falcon-7B/40B-instruct.
    which have been further trained for instruct/chat capabilities.

We provide full support for the finetuning of the Falcon 7B model.
The finetuning of the Falcon 40B model is currently in testing and will be released to the public soon.

Getting Started

Creating a Project

To create a project, you will need to initialise a training configuration file and curate a dataset.
We provide an example configuration file for the Falcon 7B model here which can be used as a starting point for your own configuration file.

Adjusting your config file

Some important parameters to take note of in your config file are the following:

ParameterDescriptionRequired Value
target_modulesThe modules to apply PEFT to. For Falcon, this parameter is different to Llama, GPT2, etc.["query_key_value"]
trust_remote_codeWhether to trust the remote code. Required to setup the falcon modeltrue
load_in_8bitWhether to load the model in 8bit. Set to True for Falcon-7B.true
load_in_4bitWhether to load the model in 4bit. Set to True for Falcon-40B.true

Additionally, you can adjust the following parameters based on your needs to optimise your training:

  • num_train_epochs
  • learning_rate
  • lr_scheduler_type
  • prompt_template (see here for more information on prompt templates)

Train your model

Training your model follows the same process as training any other model on Cerebrium, you’ll need to run the same cerebrium train command.

cerebrium train --config-file <<YOUR CONFIG FILE>>

See the training page for more information on training your model.

Evaluate your model

The output of training a Falcon model on Cerebrium is an adapter file. Once you have downloaded your adapter file (see the instructions here), it can be used to load the model into your own code and run inference on it.

Deploying your model on Cortex

To load the model into your Cortex deployment, you can use the following code snippet:

# Your normal imports in Cortex:
import base64
from typing import Optional

# ADD THE FOLLOWING TO YOUR IMPORTS
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

# ADD THE FOLLOWING TO YOUR MODEL SETUP IN YOUR MAIN.PY
peft_model_id = "results/" # Path to your results/ folder which contains "adapter_model.bin" and "adapter_config.json"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             low_cpu_mem_usage=True, # You may require this file to avoid memory issues
                                             load_in_8bit=True,
                                             trust_remote_code=True,
                                             device_map="auto")
model = PeftModel.from_pretrained(model, peft_model_id, trust_remote_code=True) # Add the adapter to the model
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

model = model.to("cuda")

Your model is now ready to run inference.

If you would like control over your generation parameters, you can add them to your Item class in your main.py file. Below is an example adding max_new_tokens which controls the maximum number of tokens to generate.

class Item(BaseModel):
    prompt: str
    max_new_tokens: Optional[int] = 250 # An example optional parameter

You can then add the following code to your predict function to run inference on your model:

# ADD THE FOLLOWING TO YOUR PREDICT FUNCTION IN YOUR MAIN.PY
def predict(item, run_id, logger):
    item = Item(**item)
     # REPLACE THIS WITH YOUR TEMPLATE USED FOR TRAINING
    template =  "### Instruction:\n{instruction}\n\n### Response:\n"

    question = template.format(instruction=prompt) # Place the prompt into the template
    inputs = tokenizer(question, return_tensors="pt")

    with torch.no_grad(): # Run inference on the model
      outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=item.max_new_tokens)
      result = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]
    return result

To run Falcon, you will need the following in your requirements.txt

torch
git+https://github.com/huggingface/transformers.git
git+https://github.com/huggingface/peft.git
bitsandbytes
trl
einops

Your model can now be deployed using Cerebrium’s Cortex. Just ensure that your adapter files are in the same directory as your main.py file and run cerebrium deploy as you would for any other model.

Falcon 7B vs Falcon 40B

Training the Falcon 7B and Falcon 40B models on cerebrium uses almost the same config files. The main differences are the following:

ParameterFalcon 7BFalcon 40B
hf_model_pathtiiuae/falcon-7btiiuae/falcon-40b
per_device_train_batch_size102
per_device_eval_batch_size102
load_in_8bitTrueFalse
load_in_4bitFalseTrue

If you have a GPU with less than 40GB of memory, you can use load_in_4bit for the Falcon 40B model. Otherwise, load_in_8bit can be used if you have a GPU with more than 40GB of memory.

Example config file

%YAML 1.2
---
training_type: "transformer" # Type of training to run. Either "diffuser" or "transformer".

name: your-falcon-7b-name # Name of the experiment.
api_key: Your API KEY HERE # Your Cerebrium API key.

# Model params:
hf_model_path: "tiiuae/falcon-7b"
model_type: "AutoModelForCausalLM"
dataset_path: /path/to/your/dataset.json # path to your local JSON dataset.
custom_tokenizer: "" # custom tokenizer from AutoTokenizer if required.
seed: 42 # random seed for reproducibility.
log_level: "WARNING" # log_level level for logging.

# Training params:
training_args:
  logging_steps: 100
  per_device_train_batch_size: 10
  per_device_eval_batch_size: 10
  warmup_steps: 0
  gradient_accumulation_steps: 4
  num_train_epochs: 30
  learning_rate: 2.0e-4
  group_by_length: False
  fp16: True
  max_grad_norm: 0.3
  # max_steps: 1000 # an optional if you would like to use steps instead of epochs.
  lr_scheduler_type: "constant"

base_model_args: # args for loading in the base model.
  load_in_8bit: True
  device_map: "auto"
  trust_remote_code: True

peft_lora_args: # peft lora args.
  r: 32
  lora_alpha: 16
  lora_dropout: 0.05
  target_modules: ["query_key_value"] # This has to be query_key_value for falcon
  bias: "none"
  task_type: "CAUSAL_LM"

dataset_args:
  # prompt_template: "short"
  # if you would like a custom prompt template it's possible to specify it here as below:
  prompt_template:
    description: "A shorter template to experiment with."
    prompt_input: "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
    prompt_no_input: "### Instruction:\n{instruction}\n\n### Response:\n"
    response_split: "### Response:"
  instruction_column: "prompt"
  label_column: "completion"
  context_column: "context"
  cutoff_len: 512
  train_val_ratio: 0.9