Experiment Data Analysis and Visualization

Experiment Data Tracking

The pipeline's configuration file supports two methods for data tracking:

TensorBoard
Weights & Biases (wandb)

# wandb (Weights & Biases) offers more advanced cloud-based experiment management and collaboration features.
#track_with: wandb
#tracker_kwargs:
#  api_key:
#  project: roll-agentic
#  name: ${exp_name}_frozen_lake
#  notes: "agentic_pipeline"
#  tags:
#    - agentic
#    - roll
#    - baseline

track_with: tensorboard
tracker_kwargs:
  # log_dir is the root directory for TensorBoard log files. Each experiment run will create a timestamped subdirectory here.
  log_dir: /data/oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_sokoban

Experiment Data Visualization

The following section uses TensorBoard as an example to illustrate how to visualize experiment data.

Ensure TensorBoard is installed.

pip install tensorboard

Launch TensorBoard. After the pipeline run completes, locate the timestamped directories for your experiment runs under the log_dir specified in the configuration (e.g., /data/oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_sokoban). Launch TensorBoard using the following command; it will scan this timestamped directory for your run logs:

tensorboard --logdir /data/oss_bucket_0/yali/llm/tensorboard/roll_exp/agentic_sokoban/{latest_date}

In the terminal, you will see a prompt similar to the one below:
Open localhost:6006 in your browser to view the TensorBoard interface. If you are using it on a remote machine, ensure that port forwarding is correctly configured.

Algorithm Performance Metrics

Validation Phase

val/score/mean: The average score per episode during the validation phase. Reflects the model's average performance on unseen environments.
val/score/max / val/score/min: The maximum / minimum score per episode during the validation phase.

critic/lr: The learning rate of the value function (Critic). The learning rate determines the step size for updating model parameters by the optimizer.
critic/loss: The loss between the value network's predicted value and the true return.
critic/value: The mean of predicted values for states in the batch by the value network of the old policy (or behavior policy) at the beginning of the current PPO iteration, during data collection or training. These values typically serve as a baseline when calculating advantage functions.
critic/vpred: The mean of predicted values for states in the batch by the currently optimizing value network. This value updates with each training iteration.
critic/clipfrac: The fraction of values that were clipped due to value_clip in the value function.
critic/error: The mean squared error between the value network's predicted value and the true return.

critic/score/mean: The mean of the raw environmental rewards.
critic/score/max / critic/score/min: The maximum / minimum of the raw environmental rewards.
critic/rewards/mean: The mean of normalized/clipped rewards.
critic/rewards/max / critic/rewards/min: The maximum / minimum of normalized/clipped rewards.
critic/advantages/mean: The mean of Advantages. Reflects how much additional reward an action taken in a given state can yield compared to the average.
critic/advantages/max / critic/advantages/min: The maximum / minimum of Advantages.
critic/returns/mean: The mean of Returns. The expected cumulative reward.
critic/returns/max / critic/returns/min: he maximum / minimum of Returns.
critic/values/mean: The mean of Value Function estimates. Reflects the model's estimation of the total future reward for a given state.
critic/values/max / critic/values/min: The maximum / minimum of Value Function estimates.
tokens/response_length/mean: The average length of generated responses.
tokens/response_length/max / tokens/response_length/min: The maximum / minimum length of generated responses.
tokens/prompt_length/mean: The average length of prompts.
tokens/prompt_length/max / tokens/prompt_length/min: The maximum / minimum length of prompts.

actor/lr: The learning rate of the current policy network (Actor). The learning rate determines the step size for updating model parameters by the optimizer.
actor/ppo_ratio_high_clipfrac: The high clipping fraction during PPO policy optimization.
actor/ppo_ratio_low_clipfrac: The low clipping fraction during PPO policy optimization.
actor/ppo_ratio_clipfrac: The clipping fraction during PPO policy optimization.
actor/ratio_mean: The average ratio of the policy network (Actor) (exponent of the ratio of new to old policy log probabilities).
actor/ratio_max / actor/ratio_min: The maximum / minimum of the policy network (Actor)'s ratio.
actor/clipfrac: The clipping fraction of the policy network (Actor).
actor/kl_loss: The KL divergence penalty term between the current policy and the reference policy. Used to prevent the policy from deviating too far from the original model.
actor/total_loss: The weighted sum of policy gradient loss, KL divergence loss, and entropy loss (if present). This is the loss actually used for backpropagation of the model.
actor/approxkl: The approximate KL divergence between the current policy and the old policy. Measures the step size of each policy update.
actor/policykl: The exact KL divergence between the current policy and the old policy.

Evaluation Metrics

critic/ref_log_prob/mean: The average log probability output by the reference model. Used to measure the performance baseline of the old or reference policy.
critic/old_log_prob/mean: The average log probability output by the old policy (Actor before training). Primarily used in the PPO algorithm to measure the difference between new and old policies.
critic/entropy/mean: The average entropy of the policy. Entropy measures the randomness or explorativeness of the policy; higher entropy indicates stronger exploration.
critic/reward_clip_frac: The fraction of rewards that were clipped. Indicates how many reward values were clipped; if too high, it might require adjusting the reward range or clipping threshold.

PPO Loss Metrics

actor/pg_loss: The policy gradient loss in the PPO algorithm. The goal is to minimize this loss to improve the policy.
actor/weighted_pg_loss: The weighted value of the policy gradient loss.
actor/valid_samples: The number of valid samples in the current batch.
actor/total_samples: The total number of samples in the current batch (i.e., batch size).
actor/valid_sample_ratio: The ratio of valid samples in the current batch.
actor/sample_weights_mean: The average value of all sample weights in the batch.
actor/sample_weights_min / actor/sample_weights_max: The minimum / maximum value of all sample weights in the batch.

SFT Loss Metrics

actor/sft_loss: Supervised Fine-Tuning loss.
actor/positive_sft_loss: Positive sample Supervised Fine-Tuning loss.
actor/negative_sft_loss: Negative sample Supervised Fine-Tuning loss.

Framework Performance Metrics

Global System Metrics

system/tps: Tokens Per Second. This is a key metric to measure the overall system throughput.
system/samples: The total number of samples processed.

Phase Latency Metrics

time/rollout: Latency of the Data Collection (Rollout) phase.
time/ref_log_probs_values_reward: Latency for calculating reference model log probabilities and values.
time/old_log_probs_values: Latency for calculating old policy log probabilities and values.
time/adv: Latency of the Advantages Calculation phase.

Execution Phases

In the following time and memory metrics, {metric_infix} will be replaced by a specific execution phase identifier, for example:

train_step: Training phase
generate: Text generation/inference phase
model_update: Model parameter update/synchronization phase
compute_log_probs: Log probabilities computation phase
do_checkpoint: Model saving/checkpointing phase
compute_values: Values computation phase
compute_rewards: Rewards computation phase

Time Metrics

time/{metric_infix}/total: Total execution time of the entire operation (from entering state_offload_manager to exiting).
time/{metric_infix}/execute: Execution time of the actual business logic (i.e., the yield part, such as model training, generation, etc.).
time/{metric_infix}/onload: Time taken to load model states (strategy.load_states()) to GPU or memory.
time/{metric_infix}/offload: Time taken to offload model states (strategy.offload_states()) from GPU or memory.

GPU Memory Metrics

Memory snapshot at start (after model state offload) (start/offload)
- memory/{metric_infix}/start/offload/allocated/{device_id}: Current allocated GPU memory on a specific device_id.
- memory/{metric_infix}/start/offload/reserved/{device_id}: Current reserved GPU memory on a specific device_id.
- memory/{metric_infix}/start/offload/max_allocated/{device_id}: Peak allocated GPU memory on a specific device_id from the start of this operation to the current moment.
- memory/{metric_infix}/start/offload/max_reserved/{device_id}: Peak reserved GPU memory on a specific device_id from the start of this operation to the current moment.
Memory snapshot after model state load (before business logic execution) (start/onload)
- memory/{metric_infix}/start/onload/allocated/{device_id}: Current allocated GPU memory on a specific device_id.
- memory/{metric_infix}/start/onload/reserved/{device_id}: Current reserved GPU memory on a specific device_id.
- memory/{metric_infix}/start/onload/max_allocated/{device_id}: Peak allocated GPU memory on a specific device_id from the start of this operation to the current moment.
- memory/{metric_infix}/start/onload/max_reserved/{device_id}: Peak reserved GPU memory on a specific device_id from the start of this operation to the current moment.
Memory snapshot after business logic execution (before model state offload) (end/onload)
- memory/{metric_infix}/end/onload/allocated/{device_id}: Current allocated GPU memory on a specific device_id.
- memory/{metric_infix}/end/onload/reserved/{device_id}: Current reserved GPU memory on a specific device_id.
- memory/{metric_infix}/end/onload/max_allocated/{device_id}: Peak allocated GPU memory on a specific device_id from the start of this operation to the current moment.
- memory/{metric_infix}/end/onload/max_reserved/{device_id}: Peak reserved GPU memory on a specific device_id from the start of this operation to the current moment.
- memory/{metric_infix}/end/onload/max_allocated_frac/{device_id}: Fraction of peak allocated GPU memory to total GPU memory on a specific device_id.
- memory/{metric_infix}/end/onload/max_reserved_frac/{device_id}: Fraction of peak reserved GPU memory to total GPU memory on a specific device_id.
Memory snapshot after model state offload (after operation completion) (end/offload)
- memory/{metric_infix}/end/offload/allocated/{device_id}: Current allocated GPU memory on a specific device_id.
- memory/{metric_infix}/end/offload/reserved/{device_id}: Current reserved GPU memory on a specific device_id.
- memory/{metric_infix}/end/offload/max_allocated/{device_id}: Peak allocated GPU memory on a specific device_id from the start of this operation to the current moment.
- memory/{metric_infix}/end/offload/max_reserved/{device_id}: Peak reserved GPU memory on a specific device_id from the start of this operation to the current moment.

CPU Memory Metrics

memory/cpu/{metric_infix}/start/rss: Resident Set Size (actual physical memory used by the process) at the start of the operation.
memory/cpu/{metric_infix}/start/vms: Virtual Memory Size (virtual memory used by the process) at the start of the operation.
memory/cpu/{metric_infix}/end/rss: Resident Set Size (actual physical memory used by the process) at the end of the operation.
memory/cpu/{metric_infix}/end/vms: Virtual Memory Size (virtual memory used by the process) at the end of the operation.

Experiment Data Analysis and Visualization

Experiment Data Tracking​

Experiment Data Visualization​

Algorithm Performance Metrics​

Validation Phase​

Value-related​

Reward-related​

Policy-related​

Evaluation Metrics​

PPO Loss Metrics​

SFT Loss Metrics​

Framework Performance Metrics​

Global System Metrics​

Phase Latency Metrics​

Execution Phases​

Time Metrics​

GPU Memory Metrics​

CPU Memory Metrics​