> ## Documentation Index
> Fetch the complete documentation index at: https://docs.qredence.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Recursive RLM delegation in fleet-rlm

> How FleetAgent delegates to dspy.RLM through delegate_to_rlm, using REPL-variable inputs, isolated child Daytona sandboxes, and a shared LLM-call budget.

The ReAct chat agent does **not** directly run long-context work. When a task exceeds what a single context window can hold, the agent delegates through a specific ReAct tool — `delegate_to_rlm` — that spawns an isolated child Daytona sandbox running a bounded `dspy.RLM`.

This is fleet-rlm's implementation of **Algorithm 1** from [arXiv 2512.24601v2](https://arxiv.org/abs/2512.24601): inputs stored as REPL variables, sub-queries dispatched recursively, bounded by a shared call budget.

## Delegation flow

```text theme={null}
User prompt
   ↓
FleetAgent  (dspy.ReAct, host LLM)
   │   decides the task exceeds one context and picks the tool:
   ↓
delegate_to_rlm(query, context="", document_url="")
   │   — src/fleet_rlm/runtime/tools/rlm_delegate.py
   │   — reads the active Daytona interpreter from a ContextVar
   │   — checks remaining LLM-call budget; returns error if exhausted
   │   — interpreter.build_delegate_child()   ← isolated child Daytona sandbox
   │   — optionally fetches document_url into the child's context
   ↓
build_recursive_subquery_rlm(
    interpreter=child,
    max_iterations=min(child.rlm_max_iterations, remaining_budget),
    max_llm_calls=remaining_budget,
)
   │   constructs the dspy.RLM bound to the child sandbox
   ↓
rlm(prompt=query, context=...)
   │   child RLM runs REPL-variable mode: may call llm_query(),
   │   sub_rlm(), sub_rlm_batched() to recurse further inside its sandbox
   ↓
{"status": "ok", "answer": "..."}        ← bubbles back into the ReAct trace
```

The delegation tool reads the active `DaytonaInterpreter` from a `ContextVar`, checks the remaining LLM-call budget, builds a child sandbox via `interpreter.build_delegate_child()`, and constructs a fresh `dspy.RLM` bound to that child.

## REPL variable mode

Inside the child sandbox, the RLM does not see the entire context as raw tokens. Following Algorithm 1, the **inputs are stored as REPL variables** in the child's persistent Python execution context (`sandbox.code_interpreter.create_context(...)`). The RLM works through them programmatically:

* Long documents are bound to named variables in the REPL.
* The RLM emits Python that inspects, slices, and reduces those variables.
* Sub-queries that need an LLM call are dispatched via the host-callback bridge, not by inflating the prompt.

This is what makes recursive long-context tasks fit: the model never has to see the whole input at once. It writes code that processes the input incrementally, and only the **summarized, decision-relevant** fragments cross back into the LLM context window.

<Note>
  Since fleet-rlm `0.5.50`, large-input injection rides on upstream DSPy's native `dspy.RLM` `SandboxSerializable` contract (DSPy ≥ 3.3.0b1). Document and workspace context models implement `SandboxSerializable`, so REPL-variable mode is handled by DSPy directly rather than by Fleet-maintained variable-mode wrappers. The Fleet runtime keeps the private RLM hooks isolated in `runtime/models/builders.py`.
</Note>

## Two entry points, one budget

Recursive RLM work has **two** entry points, and they share **one** budget:

1. **`delegate_to_rlm()`** — from the host `FleetAgent`'s ReAct tool registry. The chat agent calls this when it decides a sub-task exceeds its context.
2. **`sub_rlm()` / `sub_rlm_batched()`** — from Python code already running *inside* a `dspy.RLM` sandbox, reaching back out through the Daytona bridge to spawn a further child.

Both go through `DaytonaInterpreter.build_delegate_child()` so child creation follows one backend-owned policy.

`rlm_max_llm_calls` is a **single shared semantic-call budget** across the entire recursive tree — not per frame, not per RLM. `sub_rlm_batched()` caps sibling parallelism at 4 while sharing that same budget across siblings.

<Note>
  The reason the budget is global rather than per-frame is that the sandbox callbacks (`llm_query`, `sub_rlm`, etc.) dispatch through fleet-rlm's interpreter methods over the Daytona bridge — **not** through DSPy's per-forward injected counters. Budget enforcement happens at the bridge boundary, which sees every recursive call regardless of depth.
</Note>

## Sandbox callbacks

Code running inside a child RLM sandbox can call these primitives via the Daytona bridge:

| Callback                             | Purpose                                                |
| ------------------------------------ | ------------------------------------------------------ |
| `llm_query(prompt, ...)`             | Single semantic LLM call against the host's planner LM |
| `llm_query_batched(prompts, ...)`    | Batched semantic LLM calls (parallel)                  |
| `sub_rlm(query, context="")`         | Recurse further — spawn a new child RLM                |
| `sub_rlm_batched(queries, contexts)` | Batched recursion, capped at 4 sibling children        |
| `SUBMIT(answer)`                     | Final-artifact capture for the parent                  |

These are **semantic** primitives — they exist so the sandbox-side Python can think with the LLM without leaving the sandbox. They are dispatched to interpreter methods on the host, which is how the global budget is enforced and how MLflow traces stay connected across the tree.

`rlm_query` and `rlm_query_batched` are agent-level recursive entrypoints; they are **not** sandbox callbacks. Sandbox-authored code should use `llm_query` and `sub_rlm` instead.

## Child isolation policy

The default is `RLM_CHILD_ISOLATION_MODE=auto`:

| Condition                                      | Behavior                                                                                                                               |
| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| Parent has **no durable mounted volume**       | Fork the parent Daytona sandbox into a child sandbox                                                                                   |
| Durable volume **is mounted**                  | Create a clean child Daytona sandbox with the same `repo_url`, `repo_ref`, and `context_paths`, plus a child-specific `volume_subpath` |
| Fork fails and `RLM_CHILD_FORK_FALLBACK=clean` | Retry with a clean child sandbox                                                                                                       |
| After child run completes                      | Child sandbox is deleted                                                                                                               |

Child outputs return through the RLM answer. Child files and artifacts are **not** promoted to the parent automatically. If you want durable output, write to the mounted volume's `memory/` or `artifacts/` paths from inside the child.

<Warning>
  `RLM_CHILD_ISOLATION_MODE=context` is retained only as a backend/local debugging opt-out. It preserves a previous same-sandbox fresh-context behavior and **should not** be treated as the production isolation contract.
</Warning>

## Local workspace snapshot fallback

When the parent turn is analyzing a local host checkout and **no `repo_url` is available** to recreate that checkout in a clean child sandbox, `delegate_to_rlm()` falls back:

1. Writes a bounded text snapshot of relevant local repository files into the child sandbox at `artifacts/rlm-inputs/local_workspace_snapshot.txt`.
2. Adds that path to the child's context.

This preserves child-sandbox isolation while giving the child enough explicit evidence to inspect local code. There is no implicit host filesystem mount — the snapshot is the only bridge.

## Budget exhaustion

When the shared `rlm_max_llm_calls` budget is exhausted, the delegation tool returns an error envelope instead of starting a new child:

```json theme={null}
{"status": "error", "error": "llm_call_budget_exhausted", "remaining": 0}
```

This bubbles back into the ReAct trace; the host agent can decide to summarize what it has, ask the user to raise the budget, or stop. Inside a child RLM, an exhausted budget cuts off further `sub_rlm()` / `llm_query()` calls the same way.

## Automatic routing to native RLM

When a turn runs with `execution_mode=auto`, fleet-rlm previews the attached context — document path, repository, additional context paths, and inline content — and estimates its size in characters. If the estimate exceeds `FLEET_RLM_LARGE_CONTEXT_THRESHOLD` (default `32000`), fleet-rlm routes the turn to the native RLM path instead of the ReAct chat agent. A clear URL-document task takes the same path.

Large inputs then ride on REPL-variable mode from the start instead of being inlined into the planner prompt. Set the threshold via environment variable:

```bash .env theme={null}
FLEET_RLM_LARGE_CONTEXT_THRESHOLD=32000
```

Lower the value to delegate sooner; raise it to keep more turns on the ReAct path. Explicit modes (`execution_mode=rlm` or `rlm_only`) bypass the threshold and always route to native RLM.

## Configuration

| Variable                            | Default         | Purpose                                                |
| ----------------------------------- | --------------- | ------------------------------------------------------ |
| `RLM_CHILD_ISOLATION_MODE`          | `auto`          | `auto`, `clean`, or `context` (debug only)             |
| `RLM_CHILD_FORK_FALLBACK`           | `clean`         | Behavior when fork creation fails                      |
| `rlm_max_iterations`                | runtime default | Per-RLM iteration cap                                  |
| `rlm_max_llm_calls`                 | runtime default | Tree-wide semantic call budget                         |
| `FLEET_RLM_LARGE_CONTEXT_THRESHOLD` | `32000`         | Character threshold for `auto`-mode native-RLM routing |

See the [configuration reference](/fleet-rlm/reference/configuration) for the full environment-variable list.

## Implementation pointers

When you need to read the code:

* `src/fleet_rlm/runtime/tools/rlm_delegate.py` — the delegation tool itself.
* `src/fleet_rlm/runtime/models/builders.py` — `build_recursive_subquery_rlm()`.
* `src/fleet_rlm/integrations/daytona/isolation.py` — child sandbox creation hooks and isolation-mode policy (auto / clean / context).
* `src/fleet_rlm/integrations/daytona/bridge.py` — host-callback broker that dispatches sandbox callbacks.

## See also

* [Architecture](/fleet-rlm/concepts/architecture) — where the delegation tool fits in the layering.
* [Daytona runtime](/fleet-rlm/concepts/daytona-runtime) — sandbox lifecycle, volumes, and the bridge.
* [Configuration](/fleet-rlm/reference/configuration) — environment variables.