Skip to main content
Version: 1.7.x

Scheduler

The ROCK scheduler is a periodic task framework embedded in the admin service. It dispatches background maintenance tasks (image cleanup, file cleanup, container cleanup, image pre-pull, custom tasks, ...) to every alive Ray worker on a configurable interval, so worker nodes stay healthy without manual intervention.

This guide covers how to enable the scheduler, configure built-in tasks, write your own task, and inspect runtime status.

1. How It Works

  • The scheduler runs inside the admin process as a dedicated daemon thread (SchedulerThread) with its own asyncio event loop.
  • Tasks are scheduled by APScheduler using fixed intervals (interval_seconds).
  • For each tick, the scheduler resolves the list of alive Ray workers (cached via worker_cache_ttl seconds) and dispatches the task to every worker concurrently (default concurrency: 50).
  • Dispatch is done over HTTP through the worker's rocklet service: the admin builds a RemoteSandboxRuntime(host=worker_ip, port=Port.PROXY) (see rock.deployments.constants.Port) and calls runtime.execute / read_file / write_file against it. Every worker must therefore have the rocklet server running and reachable on Port.PROXY — otherwise the scheduler cannot push commands or read/write status files on that worker.
  • Each task subclasses rock.admin.scheduler.task_base.BaseTask and must implement run_action(runtime: RemoteSandboxRuntime) — the work performed on a single worker.
  • Per-worker execution status is persisted to the worker filesystem under ROCK_SCHEDULER_STATUS_DIR (default /data/scheduler_status), and an aggregated execution report is written to <status_dir>/<task_type>_run_report.json after every run.
  • If a Nacos config provider is enabled, the scheduler subscribes to config changes and applies a diff: only tasks whose hash changed are re-installed; removed tasks are cleaned up from all workers.

Prerequisites

Before enabling the scheduler, make sure each Ray worker meets the following requirements:

RequirementWhy
rocklet process is running on the workerThe scheduler dispatches every task through the rocklet HTTP API; without it, runtime.execute calls time out.
Rocklet's listening port is reachable from the adminThe scheduler uses Port.PROXY (defined in rock.deployments.constants.Port) as the dispatch target. Make sure no firewall / security group blocks it.
ROCK_SCHEDULER_STATUS_DIR is writable inside the workerTasks read and write <task>_status.json here for idempotency / PID tracking.
Tools required by the task are available on the workere.g. docker for the cleanup / pull tasks, curl and outbound network for ImageCleanupTask to install docuum on first run.

The rocklet server is started automatically by the standard worker bootstrap scripts (docker_run.sh, docker_run_with_uv.sh, docker_run_with_pip.sh) — typically rocklet --port <Port.PROXY>. If you bring up workers with a custom entrypoint, ensure the equivalent command is invoked. See the Configuration guide for the runtime-environment options that govern how rocklet is started.

Idempotency

Each task declares an idempotency mode that affects how it is re-run:

ModeBehavior
IDEMPOTENTAlways run on every tick. Safe to repeat (e.g. docker pull, find -exec rm).
NON_IDEMPOTENTSpawns a background daemon (e.g. docuum). The scheduler reads the previous status file, checks whether the recorded PID is still alive, and skips re-launch if the daemon is still running. On task removal the daemon is killed via pkill.

2. Enabling the Scheduler

The scheduler is configured under the top-level scheduler: key of the ROCK admin YAML (e.g. rock-conf/rock-local.yml, rock-conf/rock-dev.yml).

scheduler:
enabled: true # Master switch
worker_cache_ttl: 43200 # Worker IP cache TTL in seconds
tasks:
# ... task list, see below
FieldTypeDefaultDescription
enabledboolfalseMaster switch. When false, all tasks are removed and no new ticks fire.
worker_cache_ttlint3600Seconds the alive-worker IP list is cached before refreshing from ray.nodes().
taskslist[]List of TaskConfig entries (see Section 4).
VariableDefaultPurpose
ROCK_SCHEDULER_STATUS_DIR/data/scheduler_statusDirectory on workers where per-task status JSON and run reports are written.
ROCK_LOGGING_PATH(unset)When set, scheduler-spawned daemons (docuum, container_cleanup, image_pull) redirect their stdout/stderr to <ROCK_LOGGING_PATH>/<task>.log.
ROCK_DOCUUM_INSTALL_URLhttps://raw.githubusercontent.com/stepchowfun/docuum/main/install.shInstall script URL for docuum, fetched on demand by ImageCleanupTask.

3. Built-in Tasks

ROCK ships with four built-in tasks under rock.admin.scheduler.tasks. Each task is registered by setting task_class to its fully qualified class path.

3.1 ImageCleanupTask

Runs docuum on every worker to evict the least-recently-used Docker images once disk usage crosses a threshold. Non-idempotentdocuum runs as a long-lived daemon; the scheduler tracks its PID and skips re-launch while the daemon is alive.

- task_class: rock.admin.scheduler.tasks.image_cleanup_task.ImageCleanupTask
enabled: true
interval_seconds: 43200 # Re-check daemon every 12 hours
params:
disk_threshold: "70%" # Trigger eviction when disk usage exceeds 70%
image_whitelist: # Glob patterns matching repository:tag — never evicted
- "python:3.11"
- "my-registry.example.com/base/*"
ParamTypeDefaultDescription
disk_thresholdstr"1T"Disk usage threshold passed to docuum --threshold. Accepts size (100G, 1T) or percentage (70%).
image_whitelistlist[str][]Glob patterns forwarded to docuum --keep.

3.2 FileCleanupTask

Walks each configured directory and removes files that are either older than max_age_mins or larger than max_file_size, then prunes empty subdirectories. Idempotent.

- task_class: rock.admin.scheduler.tasks.file_cleanup_task.FileCleanupTask
enabled: true
interval_seconds: 86400 # Run daily
params:
target_dirs:
# Plain string form — no exclusions
- "/data/service_status"
# Object form — per-directory exclusions
- path: "/data/logs"
exclude_files: # Plain name | relative path | absolute path
- "docuum.log"
- "./rocklet.log"
- "./access.log"
exclude_dirs:
- ".cache"
max_age_mins: 10080 # 7 days; older files are removed
max_file_size: "1G" # Files larger than this are removed (supports K/M/G/T)
ParamTypeDefaultDescription
target_dirslist[]Each entry is either a string (path only) or {path, exclude_files, exclude_dirs}.
max_age_minsint10080Files whose mtime is older than this many minutes are deleted.
max_file_sizestr"1G"Files larger than this are deleted. Suffixes K/M/G/T accepted.

The deletion condition is (-mmin +max_age_mins) OR (-size +max_file_size). After file removal, a second find -depth -type d -empty -delete pass removes empty directories left behind (also honoring exclude_dirs).

3.3 ContainerCleanupTask

Removes stopped Docker containers older than a configurable age. Helps prevent the worker's container list from growing unbounded between sandbox runs. Idempotent.

- task_class: rock.admin.scheduler.tasks.container_cleanup_task.ContainerCleanupTask
enabled: true
interval_seconds: 86400
params:
max_age_hours: 72 # Remove exited containers older than 72 hours
ParamTypeDefaultDescription
max_age_hoursint24Maximum age (hours since FinishedAt) for kept exited containers. Older ones are docker rm'd.

The task also removes any container in the created state (never started) on every run.

3.4 ImagePullTask

Pre-pulls a list of Docker images on every worker, optionally logging in to private registries first. Reduces sandbox cold-start latency. Idempotent (docker pull is a no-op when the image is already up-to-date).

- task_class: rock.admin.scheduler.tasks.image_pull_task.ImagePullTask
enabled: true
interval_seconds: 21600 # Refresh every 6 hours
params:
images:
# Plain string form — public image, no auth
- "python:3.11"
# Object form — private image with registry login
- image: "my-registry.example.com/chatos/python:313"
registry_username: "myuser"
registry_password: "bXlwYXNzd29yZA==" # base64-encoded
ParamTypeDefaultDescription
imageslist[]Each entry is either an image string or {image, registry_username, registry_password}.

registry_password must be base64-encoded; the worker decodes it and pipes it to docker login --password-stdin. The registry host is parsed from the image name, so each image can target a different registry.

4. Task Config Schema

Every entry under scheduler.tasks is loaded as a rock.config.TaskConfig:

FieldTypeDefaultDescription
task_classstr""Fully qualified Python class path. Required.
enabledbooltrueDisabled tasks are skipped at install time and torn down on reload.
interval_secondsint3600APScheduler interval in seconds.
paramsdict{}Task-specific kwargs forwarded to from_config().

A change in any field of an existing task entry causes the scheduler to uninstall the old task (cleaning up its worker-side state when non-idempotent) and install the new one — without restarting the admin process.

5. Writing a Custom Task

Any class under your Python path that subclasses BaseTask can be registered. The minimum contract is:

# my_pkg/my_tasks/disk_report_task.py
from rock.admin.proto.request import SandboxCommand as Command
from rock.admin.scheduler.task_base import BaseTask, IdempotencyType, TaskStatusEnum
from rock.sandbox.remote_sandbox import RemoteSandboxRuntime


class DiskReportTask(BaseTask):
"""Log `df -h` output from every worker."""

def __init__(self, interval_seconds: int = 3600, mount_point: str = "/"):
super().__init__(
type="disk_report", # Used as job id and status filename prefix
interval_seconds=interval_seconds,
idempotency=IdempotencyType.IDEMPOTENT,
)
self.mount_point = mount_point

@classmethod
def from_config(cls, task_config) -> "DiskReportTask":
return cls(
interval_seconds=task_config.interval_seconds,
mount_point=task_config.params.get("mount_point", "/"),
)

async def run_action(self, runtime: RemoteSandboxRuntime) -> dict:
result = await runtime.execute(
Command(command=f"df -h {self.mount_point}", shell=True),
)
return {
"status": TaskStatusEnum.SUCCESS,
"exit_code": result.exit_code,
"stdout": result.stdout,
}

Then register it from YAML:

scheduler:
enabled: true
tasks:
- task_class: my_pkg.my_tasks.disk_report_task.DiskReportTask
enabled: true
interval_seconds: 600
params:
mount_point: "/data"

Authoring Checklist

  • Always set a unique type string in super().__init__(). It is used as the APScheduler job id, the status filename (<type>_status.json), and the run-report filename (<type>_run_report.json). Two tasks must not share a type.
  • Pick the right IdempotencyType:
    • Use IDEMPOTENT when run_action finishes synchronously and is safe to re-execute.
    • Use NON_IDEMPOTENT when you nohup a long-running daemon and return its PID. The scheduler will then track the PID, skip re-launch while it is alive, and pkill it on uninstall.
  • Return a dict from run_action. Recommended keys:
    • status — a TaskStatusEnum value, persisted into the status file.
    • pid — required for NON_IDEMPOTENT daemons (use rock.utils.system.extract_nohup_pid on the nohup ... & echo PID_PREFIX${{!}}PID_SUFFIX output).
    • Any other diagnostic fields are written to the status file's extra section.
  • Override from_config(cls, task_config) to translate task_config.params into your __init__ kwargs.
  • Use runtime.execute / read_file / write_file rather than running shell commands locally — the scheduler is dispatching to remote workers via RemoteSandboxRuntime.

6. Observability

For each task, the scheduler writes two artifacts on every worker:

PathWritten ByContents
<ROCK_SCHEDULER_STATUS_DIR>/<task_type>_status.jsonBaseTask.save_task_statusLatest per-worker status: task_name, worker_ip, pid, status (pending/running/success/failed), last_run, error, plus task-specific extra fields.
<ROCK_SCHEDULER_STATUS_DIR>/<task_type>_run_report.jsonBaseTask.run (admin side, after the tick completes)Aggregated report: total/success/failed counts, list of success_ips, and failed_details (ip + traceback).

Scheduler-internal logs are written under the standard ROCK admin log path, with name="scheduler", name="task_base", name="image_clean", etc. Scheduler-spawned daemons additionally write their own logs to <ROCK_LOGGING_PATH>/<task>.log when ROCK_LOGGING_PATH is set (e.g. docuum.log, container_cleanup.log, image_pull.log).

7. Dynamic Reload via Nacos (Optional)

When the admin service is configured with a Nacos provider, the scheduler installs a YAML listener and reacts to config pushes:

  • Only the scheduler: section is inspected; other sections are ignored.
  • The new section is hashed and compared against the previous one — duplicate notifications are skipped.
  • A diff between old and new task lists determines which tasks to install, uninstall, or reinstall (changed params / interval_seconds / enabled).
  • Non-idempotent tasks that are removed or re-installed are first cleaned up: their daemon PID is killed and the status file is removed.

This means task interval changes, parameter tweaks, and adding/removing tasks can be applied without restarting the admin process.

8. Troubleshooting

SymptomLikely CauseWhat to Check
Scheduler disabled, all tasks removed in admin logscheduler.enabled is falseSet enabled: true in YAML.
No alive workers found for task '<type>'Ray cluster has no live worker nodesVerify ray.nodes() reports alive CPU workers; consider lowering worker_cache_ttl if workers were just added.
Task ticks fire but every worker shows up in failed_details with connection errorsrocklet is not running on the workers, or Port.PROXY is blockedOn the worker host, hit the rocklet liveness endpoint GET /is_alive on Port.PROXY (e.g. curl http://<worker_ip>:<Port.PROXY>/is_alive); if it does not respond, restart rocklet (rocklet --port <Port.PROXY>) or open the port in the firewall.
Task runs but never repeats on a non-idempotent workerRecorded PID still aliveInspect <status_dir>/<task>_status.json; if status: running and the PID is alive, should_run returns False.
Failed to create task '<class>'task_class import failedEnsure the module is importable inside the admin process (installed in the same venv, on PYTHONPATH).
docker login failing in ImagePullTaskregistry_password not base64-encoded, or wrong registry parsed from imageRe-encode the password with echo -n '<pwd>' | base64; double-check the image's registry host.