Recursive RLM delegation in fleet-rlm - Qredence Documentation

The ReAct chat agent does not directly run long-context work. When a task exceeds what a single context window can hold, the agent delegates through a specific ReAct tool — delegate_to_rlm — that spawns an isolated child Daytona sandbox running a bounded dspy.RLM. This is fleet-rlm’s implementation of Algorithm 1 from arXiv 2512.24601v2: inputs stored as REPL variables, sub-queries dispatched recursively, bounded by a shared call budget.

Delegation flow

User prompt
   ↓
FleetAgent  (dspy.ReAct, host LLM)
   │   decides the task exceeds one context and picks the tool:
   ↓
delegate_to_rlm(query, context="", document_url="")
   │   — src/fleet_rlm/runtime/tools/rlm_delegate.py
   │   — reads the active Daytona interpreter from a ContextVar
   │   — checks remaining LLM-call budget; returns error if exhausted
   │   — interpreter.build_delegate_child()   ← isolated child Daytona sandbox
   │   — optionally fetches document_url into the child's context
   ↓
build_recursive_subquery_rlm(
    interpreter=child,
    max_iterations=min(child.rlm_max_iterations, remaining_budget),
    max_llm_calls=remaining_budget,
)
   │   constructs the dspy.RLM bound to the child sandbox
   ↓
rlm(prompt=query, context=...)
   │   child RLM runs REPL-variable mode: may call llm_query(),
   │   sub_rlm(), sub_rlm_batched() to recurse further inside its sandbox
   ↓
{"status": "ok", "answer": "..."}        ← bubbles back into the ReAct trace

The delegation tool reads the active DaytonaInterpreter from a ContextVar, checks the remaining LLM-call budget, builds a child sandbox via interpreter.build_delegate_child(), and constructs a fresh dspy.RLM bound to that child.

REPL variable mode

Inside the child sandbox, the RLM does not see the entire context as raw tokens. Following Algorithm 1, the inputs are stored as REPL variables in the child’s persistent Python execution context (sandbox.code_interpreter.create_context(...)). The RLM works through them programmatically:

Long documents are bound to named variables in the REPL.
The RLM emits Python that inspects, slices, and reduces those variables.
Sub-queries that need an LLM call are dispatched via the host-callback bridge, not by inflating the prompt.

This is what makes recursive long-context tasks fit: the model never has to see the whole input at once. It writes code that processes the input incrementally, and only the summarized, decision-relevant fragments cross back into the LLM context window.

Since fleet-rlm 0.5.50, large-input injection rides on upstream DSPy’s native dspy.RLM SandboxSerializable contract (DSPy ≥ 3.3.0b1). Document and workspace context models implement SandboxSerializable, so REPL-variable mode is handled by DSPy directly rather than by Fleet-maintained variable-mode wrappers. The Fleet runtime keeps the private RLM hooks isolated in runtime/models/builders.py.

Two entry points, one budget

Recursive RLM work has two entry points, and they share one budget:

delegate_to_rlm() — from the host FleetAgent’s ReAct tool registry. The chat agent calls this when it decides a sub-task exceeds its context.
sub_rlm() / sub_rlm_batched() — from Python code already running inside a dspy.RLM sandbox, reaching back out through the Daytona bridge to spawn a further child.

Both go through DaytonaInterpreter.build_delegate_child() so child creation follows one backend-owned policy. rlm_max_llm_calls is a single shared semantic-call budget across the entire recursive tree — not per frame, not per RLM. sub_rlm_batched() caps sibling parallelism at 4 while sharing that same budget across siblings.

The reason the budget is global rather than per-frame is that the sandbox callbacks (llm_query, sub_rlm, etc.) dispatch through fleet-rlm’s interpreter methods over the Daytona bridge — not through DSPy’s per-forward injected counters. Budget enforcement happens at the bridge boundary, which sees every recursive call regardless of depth.

Sandbox callbacks

Code running inside a child RLM sandbox can call these primitives via the Daytona bridge:

Callback	Purpose
`llm_query(prompt, ...)`	Single semantic LLM call against the host’s planner LM
`llm_query_batched(prompts, ...)`	Batched semantic LLM calls (parallel)
`sub_rlm(query, context="")`	Recurse further — spawn a new child RLM
`sub_rlm_batched(queries, contexts)`	Batched recursion, capped at 4 sibling children
`SUBMIT(answer)`	Final-artifact capture for the parent

These are semantic primitives — they exist so the sandbox-side Python can think with the LLM without leaving the sandbox. They are dispatched to interpreter methods on the host, which is how the global budget is enforced and how MLflow traces stay connected across the tree. rlm_query and rlm_query_batched are agent-level recursive entrypoints; they are not sandbox callbacks. Sandbox-authored code should use llm_query and sub_rlm instead.

Child isolation policy

The default is RLM_CHILD_ISOLATION_MODE=auto:

Condition	Behavior
Parent has no durable mounted volume	Fork the parent Daytona sandbox into a child sandbox
Durable volume is mounted	Create a clean child Daytona sandbox with the same `repo_url`, `repo_ref`, and `context_paths`, plus a child-specific `volume_subpath`
Fork fails and `RLM_CHILD_FORK_FALLBACK=clean`	Retry with a clean child sandbox
After child run completes	Child sandbox is deleted

Child outputs return through the RLM answer. Child files and artifacts are not promoted to the parent automatically. If you want durable output, write to the mounted volume’s memory/ or artifacts/ paths from inside the child.

RLM_CHILD_ISOLATION_MODE=context is retained only as a backend/local debugging opt-out. It preserves a previous same-sandbox fresh-context behavior and should not be treated as the production isolation contract.

Local workspace snapshot fallback

When the parent turn is analyzing a local host checkout and no repo_url is available to recreate that checkout in a clean child sandbox, delegate_to_rlm() falls back:

Writes a bounded text snapshot of relevant local repository files into the child sandbox at artifacts/rlm-inputs/local_workspace_snapshot.txt.
Adds that path to the child’s context.

This preserves child-sandbox isolation while giving the child enough explicit evidence to inspect local code. There is no implicit host filesystem mount — the snapshot is the only bridge.

Budget exhaustion

When the shared rlm_max_llm_calls budget is exhausted, the delegation tool returns an error envelope instead of starting a new child:

{"status": "error", "error": "llm_call_budget_exhausted", "remaining": 0}

This bubbles back into the ReAct trace; the host agent can decide to summarize what it has, ask the user to raise the budget, or stop. Inside a child RLM, an exhausted budget cuts off further sub_rlm() / llm_query() calls the same way.

Automatic routing to native RLM

When a turn runs with execution_mode=auto, fleet-rlm previews the attached context — document path, repository, additional context paths, and inline content — and estimates its size in characters. If the estimate exceeds FLEET_RLM_LARGE_CONTEXT_THRESHOLD (default 32000), fleet-rlm routes the turn to the native RLM path instead of the ReAct chat agent. A clear URL-document task takes the same path. Large inputs then ride on REPL-variable mode from the start instead of being inlined into the planner prompt. Set the threshold via environment variable:

.env

FLEET_RLM_LARGE_CONTEXT_THRESHOLD=32000

Lower the value to delegate sooner; raise it to keep more turns on the ReAct path. Explicit modes (execution_mode=rlm or rlm_only) bypass the threshold and always route to native RLM.

Configuration

Variable	Default	Purpose
`RLM_CHILD_ISOLATION_MODE`	`auto`	`auto`, `clean`, or `context` (debug only)
`RLM_CHILD_FORK_FALLBACK`	`clean`	Behavior when fork creation fails
`rlm_max_iterations`	runtime default	Per-RLM iteration cap
`rlm_max_llm_calls`	runtime default	Tree-wide semantic call budget
`FLEET_RLM_LARGE_CONTEXT_THRESHOLD`	`32000`	Character threshold for `auto`-mode native-RLM routing

See the configuration reference for the full environment-variable list.

Implementation pointers

When you need to read the code:

src/fleet_rlm/runtime/tools/rlm_delegate.py — the delegation tool itself.
src/fleet_rlm/runtime/models/builders.py — build_recursive_subquery_rlm().
src/fleet_rlm/integrations/daytona/isolation.py — child sandbox creation hooks and isolation-mode policy (auto / clean / context).
src/fleet_rlm/integrations/daytona/bridge.py — host-callback broker that dispatches sandbox callbacks.

​Delegation flow

​REPL variable mode

​Two entry points, one budget

​Sandbox callbacks

​Child isolation policy

​Local workspace snapshot fallback

​Budget exhaustion

​Automatic routing to native RLM

​Configuration

​Implementation pointers

​See also

Delegation flow

REPL variable mode

Two entry points, one budget

Sandbox callbacks

Child isolation policy

Local workspace snapshot fallback

Budget exhaustion

Automatic routing to native RLM

Configuration

Implementation pointers

See also