Skip to main content
ROLL Logo

ROLL: Reinforcement Learning Optimization for Large-Scale Learning

๐Ÿš€ An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models ๐Ÿš€

LicenseGitHub issuesRepo starsWeChat QR

ROLL is an efficient and user-friendly RL library designed for Large Language Models (LLMs) utilizing Large Scale GPU resources. It significantly enhances LLM performance in key areas such as human preference alignment, complex reasoning, and multi-turn agentic interaction scenarios.

Leveraging a multi-role distributed architecture with Ray for flexible resource allocation and heterogeneous task scheduling, ROLL integrates cutting-edge technologies like Megatron-Core, SGLang and vLLM to accelerate model training and inference.

[08/11/2025] ๐ŸŽ‰ Our Paper released, see Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning. [06/09/2025] ๐ŸŽ‰ ROLL tech report is now available! Access the report here.


๐Ÿš€ Get Startedโ€‹

Documents

Installation
Quick Start: Single-Node Deployment Guide
Quick Start: Multi-Node Deployment Guide
Debugging Guide
Frequently Asked Questions

User Guidesโ€‹

Configurationโ€‹

Config System Explanation
Configuration Guide
Resource Configuration
Off-Policy Algorithms Configuration Guide
vLLM Inference Backend Configuration Guide
SGLang Inference Backend Configuration Guide
Megatron Inference and Training Backend Configuration Guide
LoRA Fine-tuning Configuration Guide
FP8 Quantization Configuration Guide
DeepSpeed Training Backend Configuration Guide

Pipelineโ€‹

RLVR Pipeline for VLM
RLVR Pipeline
DPO Pipeline
Distill Pipeline
Agentic Pipeline
Comprehensive Guide: Using the Agentic Part of ROLL

Algorithmsโ€‹

TOPR (Tapered Off-Policy REINFORCE)
Reward Feedback Learning (Reward FL)
Reinforce++ RAFT++ (Reward rAnked Fine-Tuning)
Proximal Policy Optimization (PPO)
Lite PPO Group Sequence Policy Optimization (GSPO)
Group Relative Policy Optimization (GRPO)

Agenticโ€‹

Agentic Engineering Practice Documentation
TrajWiseLearningโ€”โ€”StarPO (State-Thinking-Actions-Reward Policy Optimization)
StepWiseLearningโ€”โ€”GiGPO (Group-in-Group Policy Optimization)
Tool Use Guide

Advanced Featuresโ€‹

Agentic Asynchronous Parallel Rollout
ROLL Asynchronous Training User Guide
Checkpoint Saving and Resuming Guide
Converting MCoreAdapter Models to Hugging Face Format
GPU Time-Division Multiplexing Control Guide

Tracker & Metricsโ€‹

Trackers and Metrics

Hardware Supportโ€‹

ROLL x Ascend

Developmentโ€‹

Architectureโ€‹

AgenticPipeline RLVRPipeline

Developer Guideโ€‹

How to Add Support for a New Model
Customer Env
Prompt Generation Guide


We welcome contributions from the community! ๐Ÿค