fleet-rlm exposes four overlapping observability surfaces. Pick the one matched to what you need to see.
| Surface | What it shows | When to use |
|---|
| MLflow tracing | DSPy program traces — spans, prompts, model calls | Debugging program logic, GEPA evaluation, prompt regressions |
| WebSocket execution events | Live tool calls, sandbox steps, recursive delegation | Live UI, real-time UX |
| Runtime status / diagnostics | LM + Daytona connectivity, readiness | Probing health, smoke validation |
| PostHog (optional) | Product analytics | Product usage telemetry |
MLflow tracing
When MLFLOW_ENABLED=true (default), every DSPy call emits a trace to MLFLOW_TRACKING_URI under the MLFLOW_EXPERIMENT experiment (default: fleet-rlm).
Local auto-start
In APP_ENV=local, the API server auto-starts a localhost MLflow target on port 5001 unless MLFLOW_AUTO_START=false. To run it explicitly:
This keeps MLFLOW_TRACKING_URI=http://127.0.0.1:5001 aligned with the make mlflow-server target.
What ends up in a trace
Each user turn produces one MLflow trace covering:
- The top-level
FleetAgent.respond(...) call.
- All ReAct iterations — thought, action, observation.
- Every tool invocation, including
delegate_to_rlm.
- Recursive child RLM runs as nested spans.
- All
llm_query / sub_rlm callbacks from inside child sandboxes.
Because the host-callback bridge is the only path from sandbox to host, trace continuity is preserved across recursion — a child RLM’s calls show up as nested spans under their parent, not as orphan traces.
Wiring location
| File | Role |
|---|
integrations/observability/callbacks.py | Centralized DSPy callback registry (MLflow + PostHog) |
integrations/observability/mlflow_runtime.py | MLflow client + tracing setup |
integrations/observability/mlflow_traces.py | Trace context propagation |
integrations/observability/trace_context.py | Cross-thread/async trace context |
runtime/quality/mlflow_evaluation.py | DSPy evaluation under MLflow |
runtime/quality/mlflow_optimization.py | GEPA / MIPROv2 optimization runs under MLflow |
Since 0.5.50, MLflow and PostHog callbacks are installed through a single deduplicating registry. Observability startup remains lazy, and callbacks stay visible to worker-thread DSPy contexts when the global dspy.configure refuses non-owner threads or async tasks.
MLflow startup status
/api/v1/runtime/status reports an mlflow.startup_status field separate from the top-level ready flag (which also requires LM and Daytona checks). Use it to distinguish “MLflow is broken” from “everything else is broken”:
startup_status | Meaning |
|---|
ready | MLflow client initialized and MLFLOW_TRACKING_URI is reachable. |
degraded | MLflow is enabled but startup failed — version mismatch, unreachable URI, or init error. Check mlflow.startup_error for the remediation hint. |
disabled | MLFLOW_ENABLED=false. |
pending | Background MLflow warmup has not finished yet. |
If the local UI reports Cannot query field 'effectiveTraceArchivalRetention', an older mlflow server process is still listening on the configured port. Stop the stale server, run make mlflow-upgrade, then restart with make mlflow-server. Chat turns call initialize_mlflow() on entry so tracing can recover on the first turn after the server is restarted, even if background startup is still racing.
Session trace listing
GET /api/v1/sessions/{session_id}/traces returns paginated MLflow traces linked to a session, including child delegations spawned by recursive RLM turns. The Workbench uses this to render the trace strip alongside the conversation so reviewers can jump straight from a message to its full DSPy trace tree.
Trace feedback
The Workbench supports thumbs-up / thumbs-down feedback per turn. Feedback is submitted via:
POST /api/v1/traces/feedback
The feedback is attached to the corresponding MLflow trace, so GEPA optimization runs can use it as a label signal downstream.
WebSocket execution events
Two WebSocket endpoints power the live UI:
| Endpoint | Stream |
|---|
/api/v1/ws/execution | Chat stream events — user-facing turn-taking |
/api/v1/ws/execution/events | Execution graph events — tool calls, sandbox steps, recursive delegation |
Both require the same auth as HTTP endpoints when AUTH_REQUIRED=true.
Event shaping
Live UI events are shaped by src/fleet_rlm/api/events/events.py and src/fleet_rlm/runtime/execution/streaming_events.py. Since 0.5.50, direct, tool-using, and RLM turns share a single dspy.streamify replay path, so every turn flows through one websocket event pipeline regardless of routing. Each event is a JSON envelope:
event — the streamed event frame.
command_result — tool/command outcomes.
error — error envelopes.
- Execution stream frames carry sandbox steps and recursive-delegation events.
The two endpoints serve different concerns:
- Chat stream is bidirectional — clients send user messages and receive assistant frames.
- Execution stream is read-only — clients subscribe to receive artifact and execution events as they arrive from the chat runtime. Useful for opening a second UI surface (e.g., a workspace canvas) on the same session.
Decoupled delivery
Turn execution runs in a background task with its own agent context, independent of the WebSocket that submitted the message. All events flow through a shared ExecutionEventEmitter, which fans out to every subscriber registered for the same (workspace_id, user_id, session_id) tuple.
Two consequences for clients:
- Reconnect-safe turns. If your WebSocket drops mid-turn, the turn keeps running. Re-subscribe with the same identity tuple to resume receiving frames.
- Multiple viewers per session. A second client (for example, a workspace canvas alongside the chat panel) can attach to
/api/v1/ws/execution/events and observe the same stream without affecting the primary chat connection.
Frame shapes are unchanged — existing clients require no updates.
Reverse-proxy requirements
Behind a reverse proxy:
- Upgrade HTTP/1.1 connections.
- Disable response buffering.
- Forward the auth bearer token through.
Misconfigured proxies are the most common reason for “connected but no events” in production.
Runtime status and diagnostics
The runtime exposes structured health and connectivity probes.
Health
| Endpoint | Auth | Purpose |
|---|
GET /health | None | Liveness — {"ok": true, "version": "..."} |
GET /ready | None | Readiness — composite component status |
Sample readiness response:
{
"ready": true,
"planner_configured": true,
"planner": "ready",
"database": "ready",
"database_required": true,
"sandbox_provider": "daytona"
}
Runtime status
GET /api/v1/runtime/status returns a composite snapshot:
{
"app_env": "local",
"write_enabled": true,
"ready": true,
"sandbox_provider": "daytona",
"active_models": {
"planner": "openai/gpt-4o",
"delegate": "openai/gpt-4o-mini",
"delegate_small": ""
},
"llm": { "model_set": true, "api_key_set": true, "planner_configured": true },
"mlflow": { "enabled": true, "startup_status": "ready", "startup_error": null },
"daytona": { "api_key_set": true, "api_url_set": true, "target_set": true },
"tests": {
"lm": { "ok": true, "latency_ms": 850 },
"daytona": { "ok": true, "latency_ms": 640 }
},
"guidance": []
}
Connectivity probes
| Endpoint | Tests |
|---|
POST /api/v1/runtime/tests/lm | Planner LM round-trip |
POST /api/v1/runtime/tests/daytona | Daytona credentials and lifecycle |
Both return the latency and a preflight envelope, and they cache results into the runtime status payload.
Daytona smoke
For sandbox-level validation without invoking an LM:
uv run fleet-rlm daytona-smoke \
--repo https://github.com/Qredence/fleet-rlm.git \
--ref main
This exercises credentials, network, sandbox lifecycle, repo clone, volume mount, and basic execution end-to-end.
PostHog (optional)
| Variable | Default | Description |
|---|
POSTHOG_ENABLED | false | Enable PostHog analytics |
POSTHOG_HOST | https://eu.i.posthog.com | PostHog host |
When enabled, the runtime emits product-analytics events through integrations/observability/posthog_callback.py. This is not a tracing replacement — it is for product usage signals.
Reading order
When you need to debug a misbehaving turn:
- Open the MLflow trace for the turn — it has the full DSPy program execution.
- If sandbox steps look wrong, check the execution-stream WebSocket frames for the same turn.
- If startup looks wrong, hit
/ready and /api/v1/runtime/status.
- If Daytona is suspect, run
uv run fleet-rlm daytona-smoke.
See also