Quick Start: Multi-Node Deployment Guide
Environment Preparation
- Purchase multiple machines equipped with GPUs and install GPU drivers synchronously, with one machine as the master node and others as worker nodes (the example below uses 2 machines with 2 GPUs each)
- Connect to the GPU instances remotely and enter the machine terminal
- Run the following command on each machine to install the Docker environment and NVIDIA container toolkit
curl -fsSL https://github.com/alibaba/ROLL/blob/main/scripts/install_docker_nvidia_container_toolkit.sh | sudo bash
Environment Configuration
Choose your desired Docker image from the image addresses. The following example uses torch2.6.0 + vLLM0.8.4
# 1. Start a Docker container with GPU support, expose container ports, and keep the container running
sudo docker run -dit \
--gpus all \
-p 9001:22 \
--ipc=host \
--shm-size=10gb \
roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-24.05-py3-torch260-vllm084 \
/bin/bash
# 2. Enter the Docker container
# You can use the `sudo docker ps` command to find the running container ID or name.
sudo docker exec -it <container_id> /bin/bash
# 3. Verify that GPUs are visible
nvidia-smi
# 4. Clone the project code
git clone https://github.com/alibaba/ROLL.git
# 5. Install project dependencies (choose the requirements file corresponding to your image)
cd ROLL
pip install -r requirements_torch260_vllm.txt -i https://mirrors.aliyun.com/pypi/simple/
Pipeline Execution
- Configure environment variables on the master node:
export MASTER_ADDR="ip of master node"
export MASTER_PORT="port of master node" # Default: 6379
export WORLD_SIZE=2
export RANK=0
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
Notes:
MASTER_ADDR
andMASTER_PORT
define the communication endpoint of the distributed cluster.WORLD_SIZE
specifies the total number of nodes in the cluster (e.g., 2 nodes).RANK
identifies the role of the node (0 represents the master node, 1, 2, 3, etc. represent worker nodes).NCCL_SOCKET_IFNAME
andGLOO_SOCKET_IFNAME
specify the network interface used for GPU/cluster communication (usually eth0).
- Run the pipeline on the master node
bash examples/agentic_demo/run_agentic_pipeline_frozen_lake_multi_nodes_demo.sh
After the Ray cluster starts, you will see log examples like the following:
- Configure environment variables on the worker node
export MASTER_ADDR="ip of master node"
export MASTER_PORT="port of master node" # Default: 6379
export WORLD_SIZE=2
export RANK=1
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
- Connect to the Ray cluster started by the master node on the worker node:
ray start --address='ip of master node:port of master node' --num-gpus=2
Reference: Multi-GPU V100 Memory Configuration Key Points
# Reduce the expected number of GPUs from 8 to the 2 V100s you actually have
num_gpus_per_node: 2
# Training processes now map to GPUs 0-3
actor_train.device_mapping: list(range(0,4))
# Inference processes now map to GPUs 0-3
actor_infer.device_mapping: list(range(0,4))
# Reference model processes now map to GPUs 0-3
reference.device_mapping: list(range(0,4))
# Significantly reduce the batch size during Rollout/Validation stages to prevent out-of-memory errors when a single GPU processes large batches
rollout_batch_size: 64
val_batch_size: 16
# V100 has better native support for FP16 than BF16 (unlike A100/H100). Switching to FP16 can improve compatibility and stability while saving GPU memory.
actor_train.model_args.dtype: fp16
actor_infer.model_args.dtype: fp16
reference.model_args.dtype: fp16
# Switch the large model training framework from DeepSpeed to Megatron-LM, where parameters can be sent in batches for faster execution
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
use_distributed_optimizer: true
recompute_granularity: full
# In Megatron training, the global training batch size is per_device_train_batch_size * gradient_accumulation_steps * world_size, where world_size = 4
actor_train.training_args.per_device_train_batch_size: 1
actor_train.training_args.gradient_accumulation_steps: 16
# Reduce the maximum number of actions per trajectory to make each Rollout trajectory shorter, reducing the length of LLM-generated content
max_actions_per_traj: 10
# Reduce the number of parallel training environment groups and validation environment groups to accommodate single GPU resources
train_env_manager.env_groups: 1
train_env_manager.n_groups: 1
val_env_manager.env_groups: 2
val_env_manager.n_groups: [1, 1]
val_env_manager.tags: [SimpleSokoban, FrozenLake]
# Reduce the total number of training steps to run a complete training process faster for quick debugging
max_steps: 100