Reward Feedback Learning (Reward FL)
Introduction
Reward Feedback Learning (Reward FL) is a reinforcement learning algorithm that optimize diffusion models against a scorer. Reward Fl works as follows:
- Sampling: For a given prompt and first frame latent, the model generates a corresponding video.
- Reward Assignment: Each video is evaluated and assigned a reward based on its face informations.
- Model Update: The model updates its parameters based on reward signals from the generated videos, reinforcing strategies that obtain higher rewards.
Reward FL Configuration Parameters
In ROLL, the Reward FL algorithm-specific configuration parameters are as follows (roll.pipeline.diffusion.reward_fl.reward_fl_config.RewardFLConfig
):
# reward fl
learning_rate: 2.5e-6
lr_scheduler_type: constant
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
warmup_steps: 10
num_train_epochs: 1
model_name: "wan2_2"
# wan2_2 related
model_paths: ./examples/wan2.2-14B-reward_fl_ds/wan22_paths.json
reward_model_path: /data/models/antelopev2/
tokenizer_path: /data/models/Wan-AI/Wan2.1-T2V-1.3B/google/umt5-xxl/
model_id_with_origin_paths: null
trainable_models: dit2
use_gradient_checkpointing_offload: true
extra_inputs: input_image
max_timestep_boundary: 1.0
min_timestep_boundary: 0.9
num_inference_steps: 8
Core Parameter Descriptions
num_train_epochs
: Number of optimization rounds per batch of samplestrain_batch_size
: Batch size for one train step.In deepspeed training the global train batch size isper_device_train_batch_size
*gradient_accumulation_steps
* world_sizelearning_rate
: Learning rateper_device_train_batch_size
: Training batch size per devicegradient_accumulation_steps
: Gradient accumulation stepsweight_decay
: Weight decay coefficientwarmup_steps
: Learning rate warmup stepslr_scheduler_type
: Learning rate scheduler type
Wan2_2 Related Parameters
The following parameters related to Wan2_2 are as follows:
model_paths
: Model path of json file, e.g.,wan22_paths.json
, including high_noise_model, low_noise_model, text_encoder, vae.tokenizer_path
: Tokenizer path. Leave empty to auto-download.reward_model_path
: Reward model path, e.g., face model.max_timestep_boundary
: Maximum value of the timestep interval, ranging from 0 to 1. Default is 1. This needs to be manually set only when training mixed models with multiple DiTs, for example, Wan-AI/Wan2.2-I2V-A14B.min_timestep_boundary
: Minimum value of the timestep interval, ranging from 0 to 1. Default is 1. This needs to be manually set only when training mixed models with multiple DiTs, for example, Wan-AI/Wan2.2-I2V-A14B.model_id_with_origin_paths
: Model ID with origin paths, e.g., Wan-AI/Wan2.1-T2V-1.3B:diffusion_pytorch_model*.safetensors. Comma-separated.trainable_models
: Models to train, e.g., dit, vae, text_encoder.extra_inputs
: Additional model inputs, comma-separated.use_gradient_checkpointing_offload
: Whether to offload gradient checkpointing to CPU memory.num_inference_steps
: Number of inference steps, default is 8 for the distilled wan2_2 model.
Note
- The reward model is constructed based on facial information, Please ensure that the first frame of the video contains a human face.
- Download the reward model(antelopev2.zip) and unzip the onnx files to
reward_model_path
directory. - Download the official Wan2.2 pipeline and Distilled Wan2.2 DiT safetensors. Put them in the
model_paths
directory, e.g.,wan22_paths.json
file. - According to the data/example_video_dataset/metadata.csv file, adapt your video dataset to the corresponding format
Refernece Model
Official Wan2.2 pipeline
: Wan-AI/Wan2.2-I2V-A14BDistilled Wan2.2 DiT safetensors
: lightx2v/Wan2.2-LightningReward Model
: deepinsight/insightface
Reference Example
You can refer to the following configuration file to set up Reward FL training:
./examples/docs_examples/example_reward_fl.yaml