Config Files
Cerebrium’s fine-tuning functionality is in public beta and we are adding more functionality each week! If there are any issues or if you have an urgent requirement, please reach out to support
In order to simplify the training experience and make deployments easier, we use YAML config files. This lets you keep track of all your training parameters in one place, giving you the flexibility to train your model with the parameters you need while still keeping the deployment process streamlined.
In this section, we introduce the parameters you may want to manipulate for your training, however, we’ll set the defaults to what we’ve found works well if you’d like to leave them out!
If you would like to override the parameters in your config file during a deployment, the --config-string
option in the command line accepts a JSON string. The parameters provided in the JSON string will override the values they may have been assigned in your config file.
For your convenience, an example of the config file is available here
Setting up a config file
Your Yaml config file can be placed anywhere on your system, just point the trainer to the file.
In your YAML file, you need to include the following required parameters or parse them into the cerebrium train
command:
Required parameters
Parameter | Suggested Value | Description |
---|---|---|
training_type | transformer | Type of training to run. Either diffuser or transformer. In this case, transformer. |
name | Your name for the fine-tuning run. | |
api_key | Your Cerebrium private API key | |
hf_model_path | decapoda-research/llama-7b-hf | |
model_type | AutoModelForCausalLM | |
dataset_path | path/to/your/dataset.json | path to your local JSON dataset. |
log_level | INFO | log_level level for logging. |
Optional parameters
Parameter | Suggested Value | Description |
---|---|---|
custom_tokenizer | ~ | custom tokenizer from AutoTokenizer if required. |
seed | 42 | random seed for reproducibility. |
Parameter | Sub parameter | Suggested Value | Description |
---|---|---|---|
training_args | |||
logging_steps | 10 | Number of steps between logging. | |
per_device_train_batch_size | 15 | Batch size per GPU for training. | |
per_device_eval_batch_size | 15 | Batch size per GPU for evaluation. | |
warmup_steps | 0 | Number of warmup steps for learning rate scheduler. | |
gradient_accumulation_steps | 4 | Number of gradient accumulation steps. | |
num_train_epochs | 50 | Number of training epochs. | |
learning_rate | 1.0e-4 | Learning rate for training. | |
group_by_length | False | Whether to group batches by length. | |
base_model_args | The kwargs for loading in the base model with AutoModelForCausalLM() | ||
load_in_8bit | True | Whether to load in the model in 8bit. | |
device_map | “auto” | Device map for loading in the model. | |
peft_lora_args | Peft lora kwargs for use by PeftConfig() | ||
r | 8 | The r value for LoRA. | |
lora_alpha | 32 | The lora_alpha value for LoRA. | |
lora_dropout | 0.05 | The lora_dropout value for LoRA. | |
target_modules | [“q_proj”, “v_proj”] | The target_modules for LoRA. These are the suggested values for Llama | |
bias | “none” | Bias for LoRA. | |
task_type | “CAUSAL_LM” | The task_type for LoRA. | |
dataset_args | Custom args for your dataset. | ||
prompt_template | “short” | Prompt template to use. Either “short” or “long”. | |
instruction_column | “prompt” | Column name of your prompt in the dataset.json | |
label_column | “completion” | Column name of your label/completion in the dataset.json | |
context_column | “context” | Optional column name of your context in the dataset.json | |
cutoff_len | 512 | Cutoff length for the prompt. | |
train_val_ratio | 0.9 | Ratio of training to validation data in the dataset split. |
Example yaml config file
%YAML 1.2
---
training_type: "transformer" # Type of training to run. Either "diffuser" or "transformer". In this case, "transformer".
name: test-config-file # Your name for the fine-tuning run.
api_key: <<<Your Cerebrium private API key>>>
auth_token: YOUR HUGGINGFACE API TOKEN # Optional. You will need this if you are finetuning llama2
# Model params:
hf_model_path: "decapoda-research/llama-7b-hf"
model_type: "AutoModelForCausalLM"
dataset_path: path/to/your/dataset.json # path to your local JSON dataset.
custom_tokenizer: "" # custom tokenizer from AutoTokenizer if required.
seed: 42 # random seed for reproducibility.
log_level: "INFO" # log_level level for logging.
# Training params:
training_args:
logging_steps: 10
per_device_train_batch_size: 15
per_device_eval_batch_size: 15
warmup_steps: 0
gradient_accumulation_steps: 4
num_train_epochs: 50
learning_rate: 1.0E-4
group_by_length: False
base_model_args: # args for loading in the base model with AutoModelForCausalLM
load_in_8bit: True
device_map: "auto"
peft_lora_args: # peft lora args.
r: 32
lora_alpha: 16
lora_dropout: 0.05
target_modules: ["q_proj", "v_proj"]
bias: "none"
task_type: "CAUSAL_LM"
dataset_args:
prompt_template: "short" # Prompt template to use. Either "short" or "long".
instruction_column: "prompt" # column name of your prompt in the dataset.json
label_column: "completion" # column name of your label/completion in the dataset.json
context_column: "context" # optional column name of your context in the dataset.json
cutoff_len: 512 # cutoff length for the prompt.
train_val_ratio: 0.9 # ratio of training to validation data in the dataset split.