Skip to main content
fleet-rlm exposes four overlapping observability surfaces. Pick the one matched to what you need to see.
SurfaceWhat it showsWhen to use
MLflow tracingDSPy program traces — spans, prompts, model callsDebugging program logic, GEPA evaluation, prompt regressions
WebSocket execution eventsLive tool calls, sandbox steps, recursive delegationLive UI, real-time UX
Runtime status / diagnosticsLM + Daytona connectivity, readinessProbing health, smoke validation
PostHog (optional)Product analyticsProduct usage telemetry

MLflow tracing

When MLFLOW_ENABLED=true (default), every DSPy call emits a trace to MLFLOW_TRACKING_URI under the MLFLOW_EXPERIMENT experiment (default: fleet-rlm).

Local auto-start

In APP_ENV=local, the API server auto-starts a localhost MLflow target on port 5001 unless MLFLOW_AUTO_START=false. To run it explicitly:
make mlflow-server
This keeps MLFLOW_TRACKING_URI=http://127.0.0.1:5001 aligned with the make mlflow-server target.

What ends up in a trace

Each user turn produces one MLflow trace covering:
  • The top-level FleetAgent.respond(...) call.
  • All ReAct iterations — thought, action, observation.
  • Every tool invocation, including delegate_to_rlm.
  • Recursive child RLM runs as nested spans.
  • All llm_query / sub_rlm callbacks from inside child sandboxes.
Because the host-callback bridge is the only path from sandbox to host, trace continuity is preserved across recursion — a child RLM’s calls show up as nested spans under their parent, not as orphan traces.

Wiring location

FileRole
integrations/observability/callbacks.pyCentralized DSPy callback registry (MLflow + PostHog)
integrations/observability/mlflow_runtime.pyMLflow client + tracing setup
integrations/observability/mlflow_traces.pyTrace context propagation
integrations/observability/trace_context.pyCross-thread/async trace context
runtime/quality/mlflow_evaluation.pyDSPy evaluation under MLflow
runtime/quality/mlflow_optimization.pyGEPA / MIPROv2 optimization runs under MLflow
Since 0.5.50, MLflow and PostHog callbacks are installed through a single deduplicating registry. Observability startup remains lazy, and callbacks stay visible to worker-thread DSPy contexts when the global dspy.configure refuses non-owner threads or async tasks.

MLflow startup status

/api/v1/runtime/status reports an mlflow.startup_status field separate from the top-level ready flag (which also requires LM and Daytona checks). Use it to distinguish “MLflow is broken” from “everything else is broken”:
startup_statusMeaning
readyMLflow client initialized and MLFLOW_TRACKING_URI is reachable.
degradedMLflow is enabled but startup failed — version mismatch, unreachable URI, or init error. Check mlflow.startup_error for the remediation hint.
disabledMLFLOW_ENABLED=false.
pendingBackground MLflow warmup has not finished yet.
If the local UI reports Cannot query field 'effectiveTraceArchivalRetention', an older mlflow server process is still listening on the configured port. Stop the stale server, run make mlflow-upgrade, then restart with make mlflow-server. Chat turns call initialize_mlflow() on entry so tracing can recover on the first turn after the server is restarted, even if background startup is still racing.

Session trace listing

GET /api/v1/sessions/{session_id}/traces returns paginated MLflow traces linked to a session, including child delegations spawned by recursive RLM turns. The Workbench uses this to render the trace strip alongside the conversation so reviewers can jump straight from a message to its full DSPy trace tree.

Trace feedback

The Workbench supports thumbs-up / thumbs-down feedback per turn. Feedback is submitted via:
POST /api/v1/traces/feedback
The feedback is attached to the corresponding MLflow trace, so GEPA optimization runs can use it as a label signal downstream.

WebSocket execution events

Two WebSocket endpoints power the live UI:
EndpointStream
/api/v1/ws/executionChat stream events — user-facing turn-taking
/api/v1/ws/execution/eventsExecution graph events — tool calls, sandbox steps, recursive delegation
Both require the same auth as HTTP endpoints when AUTH_REQUIRED=true.

Event shaping

Live UI events are shaped by src/fleet_rlm/api/events/events.py and src/fleet_rlm/runtime/execution/streaming_events.py. Since 0.5.50, direct, tool-using, and RLM turns share a single dspy.streamify replay path, so every turn flows through one websocket event pipeline regardless of routing. Each event is a JSON envelope:
  • event — the streamed event frame.
  • command_result — tool/command outcomes.
  • error — error envelopes.
  • Execution stream frames carry sandbox steps and recursive-delegation events.
The two endpoints serve different concerns:
  • Chat stream is bidirectional — clients send user messages and receive assistant frames.
  • Execution stream is read-only — clients subscribe to receive artifact and execution events as they arrive from the chat runtime. Useful for opening a second UI surface (e.g., a workspace canvas) on the same session.

Decoupled delivery

Turn execution runs in a background task with its own agent context, independent of the WebSocket that submitted the message. All events flow through a shared ExecutionEventEmitter, which fans out to every subscriber registered for the same (workspace_id, user_id, session_id) tuple. Two consequences for clients:
  • Reconnect-safe turns. If your WebSocket drops mid-turn, the turn keeps running. Re-subscribe with the same identity tuple to resume receiving frames.
  • Multiple viewers per session. A second client (for example, a workspace canvas alongside the chat panel) can attach to /api/v1/ws/execution/events and observe the same stream without affecting the primary chat connection.
Frame shapes are unchanged — existing clients require no updates.

Reverse-proxy requirements

Behind a reverse proxy:
  • Upgrade HTTP/1.1 connections.
  • Disable response buffering.
  • Forward the auth bearer token through.
Misconfigured proxies are the most common reason for “connected but no events” in production.

Runtime status and diagnostics

The runtime exposes structured health and connectivity probes.

Health

EndpointAuthPurpose
GET /healthNoneLiveness — {"ok": true, "version": "..."}
GET /readyNoneReadiness — composite component status
Sample readiness response:
{
  "ready": true,
  "planner_configured": true,
  "planner": "ready",
  "database": "ready",
  "database_required": true,
  "sandbox_provider": "daytona"
}

Runtime status

GET /api/v1/runtime/status returns a composite snapshot:
{
  "app_env": "local",
  "write_enabled": true,
  "ready": true,
  "sandbox_provider": "daytona",
  "active_models": {
    "planner": "openai/gpt-4o",
    "delegate": "openai/gpt-4o-mini",
    "delegate_small": ""
  },
  "llm": { "model_set": true, "api_key_set": true, "planner_configured": true },
  "mlflow": { "enabled": true, "startup_status": "ready", "startup_error": null },
  "daytona": { "api_key_set": true, "api_url_set": true, "target_set": true },
  "tests": {
    "lm": { "ok": true, "latency_ms": 850 },
    "daytona": { "ok": true, "latency_ms": 640 }
  },
  "guidance": []
}

Connectivity probes

EndpointTests
POST /api/v1/runtime/tests/lmPlanner LM round-trip
POST /api/v1/runtime/tests/daytonaDaytona credentials and lifecycle
Both return the latency and a preflight envelope, and they cache results into the runtime status payload.

Daytona smoke

For sandbox-level validation without invoking an LM:
uv run fleet-rlm daytona-smoke \
  --repo https://github.com/Qredence/fleet-rlm.git \
  --ref main
This exercises credentials, network, sandbox lifecycle, repo clone, volume mount, and basic execution end-to-end.

PostHog (optional)

VariableDefaultDescription
POSTHOG_ENABLEDfalseEnable PostHog analytics
POSTHOG_HOSThttps://eu.i.posthog.comPostHog host
When enabled, the runtime emits product-analytics events through integrations/observability/posthog_callback.py. This is not a tracing replacement — it is for product usage signals.

Reading order

When you need to debug a misbehaving turn:
  1. Open the MLflow trace for the turn — it has the full DSPy program execution.
  2. If sandbox steps look wrong, check the execution-stream WebSocket frames for the same turn.
  3. If startup looks wrong, hit /ready and /api/v1/runtime/status.
  4. If Daytona is suspect, run uv run fleet-rlm daytona-smoke.

See also