LoRA Fine-tuning Configuration Guide
LoRA (Low-Rank Adaptation) is an efficient parameter-efficient fine-tuning method that achieves parameter-efficient fine-tuning by adding low-rank matrices to pre-trained models. This document will provide detailed instructions on how to configure and use LoRA fine-tuning in the ROLL framework.
LoRA Introduction
LoRA achieves parameter-efficient fine-tuning through the following approaches:
- Low-Rank Matrix Decomposition: Decompose weight update matrices into the product of two low-rank matrices
- Parameter Efficiency: Train only a small number of additional parameters instead of all model parameters
- Easy Deployment: Fine-tuned models can be easily merged into the original model
Configuring LoRA Fine-tuning
In the ROLL framework, LoRA fine-tuning can be configured by setting relevant parameters in the YAML configuration file.
Configuration Example
The following is a typical LoRA configuration example (from examples/qwen2.5-7B-rlvr_megatron/rlvl_lora_zero3.yaml
):
# LoRA global configuration
lora_target: o_proj,q_proj,k_proj,v_proj
lora_rank: 32
lora_alpha: 32
actor_train:
model_args:
attn_implementation: fa2
disable_gradient_checkpointing: true
dtype: bf16
lora_target: ${lora_target}
lora_rank: ${lora_rank}
lora_alpha: ${lora_alpha}
model_type: ~
training_args:
learning_rate: 1.0e-5
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
warmup_steps: 20
num_train_epochs: 50
strategy_args:
strategy_name: deepspeed_train
strategy_config: ${deepspeed_zero3}
device_mapping: list(range(0,16))
infer_batch_size: 4
actor_infer:
model_args:
attn_implementation: fa2
disable_gradient_checkpointing: true
dtype: bf16
lora_target: ${lora_target}
lora_rank: ${lora_rank}
lora_alpha: ${lora_alpha}
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.6
enforce_eager: false
block_size: 16
max_model_len: 8000
device_mapping: list(range(0,12))
infer_batch_size: 1
Configuration Parameter Details
lora_target: Specify the model layers to apply LoRA
- For example:
o_proj,q_proj,k_proj,v_proj
means applying LoRA to the output projection and query, key, value projection layers in the attention mechanism - Can be adjusted according to the specific model structure
- For example:
lora_rank: Rank of the LoRA matrix
- Controls the size of the LoRA matrix
- Smaller ranks can reduce the number of parameters but may affect performance
- Usually set to 8, 16, 32, 64, etc.
lora_alpha: LoRA scaling factor
- Controls the magnitude of LoRA updates
- Usually set to the same as
lora_rank
or its multiple
LoRA Parameters in model_args:
lora_target
: Specify the layers to apply LoRAlora_rank
: Rank of the LoRA matrixlora_alpha
: LoRA scaling factor
LoRA Compatibility with Training Backends
Currently, LoRA fine-tuning only supports the DeepSpeed training backend:
actor_train:
strategy_args:
strategy_name: deepspeed_train # LoRA only supports deepspeed_train
This is because DeepSpeed provides optimization features that integrate well with LoRA.
Performance Optimization Recommendations
Selecting Appropriate LoRA Layers:
- Applying LoRA to attention mechanism-related layers usually works well
- The best LoRA layer combination can be determined through experimentation
Adjusting LoRA Parameters:
lora_rank
: Adjust according to model size and task complexitylora_alpha
: Usually set tolora_rank
or its multiple
Learning Rate Setting:
- LoRA fine-tuning usually requires a higher learning rate
- Set to
1.0e-5
in the example
Notes
- LoRA fine-tuning currently only supports the DeepSpeed training backend
- Ensure the model supports LoRA fine-tuning
- Pay attention to compatibility with LoRA when using gradient checkpointing
- LoRA fine-tuning performance may differ from full parameter fine-tuning and needs to be evaluated according to specific tasks
By properly configuring LoRA fine-tuning, you can significantly reduce the number of training parameters and computational resource consumption while maintaining model performance.