Skip to main content
This page is operational reference material. New users should start with the Introduction and Quickstart.

Incident response

IR-1: Bedrock API outage (circuit breaker open)

This runbook applies when Amazon Bedrock is the configured provider. Fleet Pi defaults to Google Gemini; if you’re seeing chat failures with the default, treat them as generic provider errors and check your GEMINI_API_KEY and provider status first. Trigger: Users report chat returning “Bedrock API is temporarily unavailable” or all /api/chat requests fail with 500 errors.
  1. Verify the circuit breaker state
    • Check application logs for bedrock-api circuit breaker events
    • Look for open state transitions in logs with requestId correlation
    • Run curl -sf http://localhost:3000/api/health to confirm the web server is still healthy
  2. Check Bedrock service status
    • Verify AWS credentials are valid: aws sts get-caller-identity
    • Check Bedrock model access in the AWS Console for the configured region (default us-east-1)
    • Review AWS Service Health Dashboard for regional outages
  3. Inspect recent error patterns
    • Search logs for the last 30 minutes: grep "bedrock-api" or grep "circuit breaker"
    • Identify if errors are throttling (429), auth (403), or model-level (400)
    • Note the errorThresholdPercentage (50%) and volumeThreshold (5) — the breaker opens after 3 failures within 5 calls
  4. Wait for automatic recovery or force reset
    • The circuit breaker resetTimeout is 30 seconds; it will attempt a half-open call after that period
    • If Bedrock is confirmed restored but the breaker is still open, restart the dev server to reset the breaker state
  5. Communicate
    • Post in the incident channel: “Bedrock circuit breaker open — root cause under investigation”
    • If AWS is at fault, set status page to “degraded” and estimate recovery based on AWS status updates

IR-2: Chat session corruption or data loss

Trigger: Users refresh the page and see an empty transcript, or the chat UI shows “Session reset” repeatedly.
  1. Identify the affected session
    • Extract sessionId from browser localStorage or from the start event in recent /api/chat request logs
    • Locate the Pi session file path under .fleet/sessions/ inside the repo root
  2. Check session file validity
    • Verify the session JSONL file exists and is readable
    • Ensure the file is inside the repo-scoped session directory (outside files are rejected by isUsableSessionFile)
    • Look for truncated or malformed JSONL lines at the end of the file
  3. Validate localStorage metadata
    • If localStorage contains an invalid sessionFile (e.g. pointing to /etc/hosts or a non-existent path), the app silently starts a fresh repo-scoped session — this is expected behavior
    • Instruct the user to clear localStorage for the site if the stored metadata is corrupt
  4. Attempt manual hydration
    • Call POST /api/chat/session with the sessionId to trigger hydrateChatSession
    • If the session file cannot be opened, the server returns an empty message list with sessionReset: true
  5. Recover or recreate
    • If the file is corrupt beyond repair, archive it and let the user start a new session
    • If the issue is widespread, check disk space and file system permissions on .fleet/sessions/
  6. Follow up
    • Document the root cause (disk full, permission issue, or Pi SDK bug)
    • Monitor SessionManager.open error rates for 24 hours

Troubleshooting

Bedrock errors

Symptoms: Chat streams terminate with error events, model picker shows unavailable models, or diagnostics contain model registry errors.
  • ThrottlingException (429) — Bedrock is rate-limiting requests.
    • Check the requestId in logs to confirm it is the same across retries
    • The Pi SDK auto-retries with exponential backoff; do not manually retry
    • If sustained, enable request batching or switch to a lower-traffic model variant
  • AccessDeniedException (403) — IAM role or profile lacks bedrock:* permissions.
    • Verify AWS_PROFILE and AWS_BEARER_TOKEN_BEDROCK environment variables
    • Ensure the IAM policy includes bedrock:InvokeModel and bedrock:InvokeModelWithResponseStream
  • ValidationException (400) — The requested model ID is invalid.
    • Check modelSelection in the request body against the registry
    • Model IDs use region prefixes (e.g. us.anthropic.claude-sonnet-4-6); the backend normalizes candidates but a completely unknown ID will fail
  • ModelNotReadyException — The model is not enabled in the AWS account.
    • Visit the Bedrock Console > Model access and enable the model for the current region
  • Network / timeout errors — The circuit breaker timeout is 30 seconds.
    • If Bedrock does not respond within 30 seconds, the breaker counts it as a failure
    • Check VPC endpoints or corporate proxy settings if running in a restricted network

Session hydration failures

Symptoms: After refreshing the browser, prior messages are gone; the UI shows a blank chat; sessionReset: true appears in /api/chat responses.
  • Invalid sessionFile in localStorage — The browser stores only Pi session metadata (sessionFile and sessionId). If sessionFile points outside the repo session directory, isUsableSessionFile returns false and a fresh repo-scoped session is created silently.
    • Remediation: Clear site localStorage and start a new chat
  • Missing or moved session file — The session JSONL was deleted or moved after the metadata was stored.
    • Remediation: Check .fleet/sessions/ for the file; if missing, the session is unrecoverable
  • Corrupt session JSONL — A malformed line causes SessionManager.open to throw.
    • Remediation: Inspect the file with head -n 20 and tail -n 5; remove trailing partial lines if safe, otherwise archive and start fresh
  • Race condition during streaming — If a page refresh happens while the session is being compacted, the file may be in an inconsistent state.
    • Remediation: Wait 5 seconds and retry hydration; the compaction lock should release

Circuit breaker states

The Bedrock API call is wrapped by opossum with the following configuration:
OptionValueMeaning
errorThresholdPercentage50%Open after half of sampled calls fail
resetTimeout30,000 msWait 30 s before trying half-open
volumeThreshold5Minimum 5 calls before breaker can open
timeout30,000 msEach call must complete within 30 s
  • Closed (normal) — Requests flow to Bedrock. Failures are counted.
  • Open — All calls are rejected immediately with the fallback error: "Bedrock API is temporarily unavailable due to repeated failures. Please try again later."
  • Half-open — The next call is allowed through as a probe. If the probe succeeds, the breaker closes. If it fails, the breaker opens again for another resetTimeout.

Chat session mirror (Neon Postgres)

Pi session JSONL files under .fleet/sessions/ remain authoritative. When FLEET_PI_CHAT_DATABASE_URL is set, Fleet Pi mirrors full session entries and run provenance into Neon Postgres so you can query conversations with SQL, power cross-surface history, run analytics, and debug long-running runs. Mirror writes happen on session create, hydrate, and list paths. Failures are caught and logged with the matching requestId — they never interrupt chat streaming. When to enable it
  • You need SQL search or analytics across Pi sessions.
  • You run Fleet Pi across multiple surfaces and want a single source for chat history.
  • You want durable provenance for tool executions and file mutations beyond what local SQLite captures.
Roles Provision two Neon roles and keep them separate:
RolePrivilegesUsed by
neondb_ownerFull DDL + DML (CREATE, ALTER, DROP, etc.)Migration CLI only
fleet_pi_appSELECT, INSERT, UPDATE, DELETE on pi_* tablesRunning application
Configure Set both connection strings in .env:
# Runtime mirror (pooled app-role connection)
FLEET_PI_CHAT_DATABASE_URL=postgres://fleet_pi_app:...@ep-xxxx-pooler.neon.tech/neondb?sslmode=require

# Migration-only (direct owner connection)
FLEET_PI_CHAT_MIGRATION_DATABASE_URL=postgres://neondb_owner:...@ep-xxxx.neon.tech/neondb?sslmode=require
Leave FLEET_PI_CHAT_DATABASE_URL unset to keep Pi conversations in JSONL and local SQLite only. Run migrations
pnpm --filter web chat:migrate
Re-run after pulling changes that update the schema. The script is idempotent and records applied migrations in fleet_pi_chat_migrations. Tables
TableContents
public.pi_sessionsPi session headers and current session metadata
public.pi_session_entriesFull raw Pi entries plus normalized search fields
public.pi_runsAssistant turn/run summaries
public.pi_run_eventsOrdered streamed chat events
public.pi_tool_executionsTool call inputs, outputs, and claimed paths
public.pi_file_mutationsFile mutation summaries attributed to runs and tools
Triage
  • Mirror disabled unexpectedly: confirm FLEET_PI_CHAT_DATABASE_URL is loaded in the running process (check /api/health host env, not just the .env file).
  • Rows missing for a recent session: grep logs for the session’s requestId and look for mirror sync warnings; the JSONL file is still authoritative and you can re-trigger sync by hydrating the session.
  • Migration fails with permission errors: verify FLEET_PI_CHAT_MIGRATION_DATABASE_URL uses neondb_owner, not the app role.

Quick reference

CommandPurpose
curl -sf http://localhost:3000/api/healthVerify web server health
aws sts get-caller-identityVerify AWS credentials
pnpm --filter web testRun unit tests (including circuit breaker tests)
pnpm lintCheck code quality
pnpm knipDetect unused code
pnpm --filter web chat:migrateApply Neon chat mirror schema migrations

Runtime cache pressure

Symptoms: Memory grows over long sessions; Bedrock calls feel slower than expected after long idle periods.
  • Pi runtimes are cached per session with a TTL controlled by FLEET_PI_RUNTIME_TTL_MS (defaults to 10 minutes).
  • Lower the TTL if you want runtimes evicted sooner; raise it to keep them warmer between turns.
  • Restart pnpm dev to forcibly drop all warm runtimes if you suspect leaked Pi state.

Workspace contract drift

Symptoms: GET /api/workspace/health returns missing canonical paths or an unexpected manifest version.
  • The contract version is pinned at WORKSPACE_CONTRACT_VERSION = 1 in workspace-contract.ts.
  • Missing canonical directories should be re-created by workspace bootstrap on the next run; user-authored content is never overwritten.
  • Use POST /api/workspace/reindex to rebuild the projection index without modifying canonical files.
  • apps/web/src/lib/pi/circuit-breaker.ts — Breaker configuration and factory.
  • apps/web/src/lib/pi/server.ts — Bedrock invocation and Pi runtime cache.
  • apps/web/src/lib/pi/server-runtime.ts — Runtime TTL and PI_AGENT_DIR resolution.
  • apps/web/src/lib/pi/plan-mode.ts — Mode allowlists and plan-mode extension.
  • apps/web/src/lib/pi/run-provenance.ts — Provenance recording around session and tool events, including tool executions and file mutations mirrored to Neon.
  • apps/web/src/lib/db/pi-session-mirror.ts — Safe sync helpers that mirror Pi sessions into Neon Postgres when FLEET_PI_CHAT_DATABASE_URL is set.
  • apps/web/src/lib/pii/sanitizer.ts — Input redaction before logging.
  • apps/web/src/lib/logger.ts — Pino logger with redaction and requestId correlation.
  • apps/web/src/lib/workspace/workspace-contract.ts — Workspace contract version and section kinds.