Training Falcon Models
Introduction
Falcon
is a family of models released by the Technology Innovation Institute (TII) of the UAE.
At the time of release, it topped the Huggingface leaderboards while also being one of the fastest LLMs available.
Additionally, the TII team have released the model under an Apache 2.0 license, making it available for commercial use.
The Falcon
family of models are available in 2 sizes:
Falcon 7B
- 7 billion parametersFalcon 40B
- 40 billion parameters
of which, there are two variants:
- The raw, pre-trained models
Falcon-7B/40B
- The instruct trained models
Falcon-7B/40B-instruct
.
which have been further trained for instruct/chat capabilities.
We provide full support for the finetuning of the Falcon 7B model.
The finetuning of the Falcon 40B model is currently in testing and will be released to the public soon.
Getting Started
Creating a Project
To create a project, you will need to initialise a training configuration file and curate a dataset.
We provide an example configuration file for the Falcon 7B model here which can be used as a starting point for your own configuration file.
Adjusting your config file
Some important parameters to take note of in your config file are the following:
Parameter | Description | Required Value |
---|---|---|
target_modules | The modules to apply PEFT to. For Falcon, this parameter is different to Llama, GPT2, etc. | ["query_key_value"] |
trust_remote_code | Whether to trust the remote code. Required to setup the falcon model | true |
load_in_8bit | Whether to load the model in 8bit. Set to True for Falcon-7B. | true |
load_in_4bit | Whether to load the model in 4bit. Set to True for Falcon-40B. | true |
Additionally, you can adjust the following parameters based on your needs to optimise your training:
- num_train_epochs
- learning_rate
- lr_scheduler_type
- prompt_template (see here for more information on prompt templates)
Train your model
Training your model follows the same process as training any other model on Cerebrium, you’ll need to run the same cerebrium train
command.
cerebrium train --config-file <<YOUR CONFIG FILE>>
See the training page for more information on training your model.
Evaluate your model
The output of training a Falcon model on Cerebrium is an adapter file. Once you have downloaded your adapter file (see the instructions here), it can be used to load the model into your own code and run inference on it.
Deploying your model on Cortex
To load the model into your Cortex deployment, you can use the following code snippet:
# Your normal imports in Cortex:
import base64
from typing import Optional
# ADD THE FOLLOWING TO YOUR IMPORTS
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
# ADD THE FOLLOWING TO YOUR MODEL SETUP IN YOUR MAIN.PY
peft_model_id = "results/" # Path to your results/ folder which contains "adapter_model.bin" and "adapter_config.json"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
low_cpu_mem_usage=True, # You may require this file to avoid memory issues
load_in_8bit=True,
trust_remote_code=True,
device_map="auto")
model = PeftModel.from_pretrained(model, peft_model_id, trust_remote_code=True) # Add the adapter to the model
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = model.to("cuda")
Your model is now ready to run inference.
If you would like control over your generation parameters, you can add them to your Item
class in your main.py
file. Below is an example adding max_new_tokens
which controls the maximum number of tokens to generate.
class Item(BaseModel):
prompt: str
max_new_tokens: Optional[int] = 250 # An example optional parameter
You can then add the following code to your predict
function to run inference on your model:
# ADD THE FOLLOWING TO YOUR PREDICT FUNCTION IN YOUR MAIN.PY
def predict(item, run_id, logger):
item = Item(**item)
# REPLACE THIS WITH YOUR TEMPLATE USED FOR TRAINING
template = "### Instruction:\n{instruction}\n\n### Response:\n"
question = template.format(instruction=prompt) # Place the prompt into the template
inputs = tokenizer(question, return_tensors="pt")
with torch.no_grad(): # Run inference on the model
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=item.max_new_tokens)
result = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]
return result
To run Falcon, you will need the following in your requirements.txt
torch
git+https://github.com/huggingface/transformers.git
git+https://github.com/huggingface/peft.git
bitsandbytes
trl
einops
Your model can now be deployed using Cerebrium’s Cortex. Just ensure that your adapter files are in the same directory as your main.py
file and run cerebrium deploy
as you would for any other model.
Falcon 7B vs Falcon 40B
Training the Falcon 7B and Falcon 40B models on cerebrium uses almost the same config files. The main differences are the following:
Parameter | Falcon 7B | Falcon 40B |
---|---|---|
hf_model_path | tiiuae/falcon-7b | tiiuae/falcon-40b |
per_device_train_batch_size | 10 | 2 |
per_device_eval_batch_size | 10 | 2 |
load_in_8bit | True | False |
load_in_4bit | False | True |
If you have a GPU with less than 40GB of memory, you can use load_in_4bit for the Falcon 40B model. Otherwise, load_in_8bit can be used if you have a GPU with more than 40GB of memory.
Example config file
%YAML 1.2
---
training_type: "transformer" # Type of training to run. Either "diffuser" or "transformer".
name: your-falcon-7b-name # Name of the experiment.
api_key: Your API KEY HERE # Your Cerebrium API key.
# Model params:
hf_model_path: "tiiuae/falcon-7b"
model_type: "AutoModelForCausalLM"
dataset_path: /path/to/your/dataset.json # path to your local JSON dataset.
custom_tokenizer: "" # custom tokenizer from AutoTokenizer if required.
seed: 42 # random seed for reproducibility.
log_level: "WARNING" # log_level level for logging.
# Training params:
training_args:
logging_steps: 100
per_device_train_batch_size: 10
per_device_eval_batch_size: 10
warmup_steps: 0
gradient_accumulation_steps: 4
num_train_epochs: 30
learning_rate: 2.0e-4
group_by_length: False
fp16: True
max_grad_norm: 0.3
# max_steps: 1000 # an optional if you would like to use steps instead of epochs.
lr_scheduler_type: "constant"
base_model_args: # args for loading in the base model.
load_in_8bit: True
device_map: "auto"
trust_remote_code: True
peft_lora_args: # peft lora args.
r: 32
lora_alpha: 16
lora_dropout: 0.05
target_modules: ["query_key_value"] # This has to be query_key_value for falcon
bias: "none"
task_type: "CAUSAL_LM"
dataset_args:
# prompt_template: "short"
# if you would like a custom prompt template it's possible to specify it here as below:
prompt_template:
description: "A shorter template to experiment with."
prompt_input: "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
prompt_no_input: "### Instruction:\n{instruction}\n\n### Response:\n"
response_split: "### Response:"
instruction_column: "prompt"
label_column: "completion"
context_column: "context"
cutoff_len: 512
train_val_ratio: 0.9