跳到主要内容
ROLL Logo

ROLL: Reinforcement Learning Optimization for Large-Scale Learning

🚀 An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models 🚀

LicenseGitHub issuesRepo starsWeChat QR

ROLL is an efficient and user-friendly RL library designed for Large Language Models (LLMs) utilizing Large Scale GPU resources. It significantly enhances LLM performance in key areas such as human preference alignment, complex reasoning, and multi-turn agentic interaction scenarios.

Leveraging a multi-role distributed architecture with Ray for flexible resource allocation and heterogeneous task scheduling, ROLL integrates cutting-edge technologies like Megatron-Core, SGLang and vLLM to accelerate model training and inference.

[08/11/2025] 🎉 Our Paper released, see Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning. [06/09/2025] 🎉 ROLL tech report is now available! Access the report here.


🚀 快速入门

文档

快速上手

安装指南
快速上手:单机版部署指南
快速上手:多节点部署指南
ROLL 调试指南
常见问题解答 (Q&A)

使用指南

配置

ROLL 配置系统详解
ROLL 配置指南
ROLL 资源配置
Off-Policy 算法配置指南
vLLM 推理后端配置指南
SGLang 推理后端配置指南
Megatron 推理和训练后端配置指南
LoRA 微调配置指南
FP8 量化配置指南
DeepSpeed 训练后端配置指南

流水线

VLM RLVR 流水线
RLVR 流水线
DPO 流水线
Distill 流水线
Agentic 流水线
Comprehensive Guide: Using the Agentic Part of ROLL

算法

TOPR (Tapered Off-Policy REINFORCE)
Reward Feedback Learning (Reward FL)
Reinforce++ RAFT++ (Reward rAnked Fine-Tuning)
Proximal Policy Optimization (PPO)
Lite PPO Group Sequence Policy Optimization (GSPO)
Group Relative Policy Optimization (GRPO)

Agentic

Agentic 工程实践文档
TrajWiseLearning——StarPO (State-Thinking-Actions-Reward Policy Optimization)
StepWiseLearning——GiGPO (Group-in-Group Policy Optimization)
Tool Use 使用指南

高级特性

Agentic 异步并行 Rollout
ROLL 异步训练功能使用指南
检查点保存与恢复指南
MCoreAdapter 模型转换为 Hugging Face 格式
GPU 时分复用控制指南

Tracker 和 Metrics

tracker和metrics

硬件支持

ROLL x Ascend

开发

架构

AgenticPipeline RLVRPipeline

开发者指南

如何支持新模型
自定义Env
Prompt生成指南


We welcome contributions from the community! 🤝