Incident response
IR-1: Bedrock API outage (circuit breaker open)
This runbook applies when Amazon Bedrock is the configured provider. Fleet Pi defaults to Google Gemini; if you’re seeing chat failures with the default, treat them as generic provider errors and check yourGEMINI_API_KEY and provider status first.
Trigger: Users report chat returning “Bedrock API is temporarily unavailable” or all /api/chat requests fail with 500 errors.
- Verify the circuit breaker state
- Check application logs for
bedrock-apicircuit breaker events - Look for
openstate transitions in logs withrequestIdcorrelation - Run
curl -sf http://localhost:3000/api/healthto confirm the web server is still healthy
- Check application logs for
- Check Bedrock service status
- Verify AWS credentials are valid:
aws sts get-caller-identity - Check Bedrock model access in the AWS Console for the configured region (default
us-east-1) - Review AWS Service Health Dashboard for regional outages
- Verify AWS credentials are valid:
- Inspect recent error patterns
- Search logs for the last 30 minutes:
grep "bedrock-api"orgrep "circuit breaker" - Identify if errors are throttling (429), auth (403), or model-level (400)
- Note the
errorThresholdPercentage(50%) andvolumeThreshold(5) — the breaker opens after 3 failures within 5 calls
- Search logs for the last 30 minutes:
- Wait for automatic recovery or force reset
- The circuit breaker
resetTimeoutis 30 seconds; it will attempt a half-open call after that period - If Bedrock is confirmed restored but the breaker is still open, restart the dev server to reset the breaker state
- The circuit breaker
- Communicate
- Post in the incident channel: “Bedrock circuit breaker open — root cause under investigation”
- If AWS is at fault, set status page to “degraded” and estimate recovery based on AWS status updates
IR-2: Chat session corruption or data loss
Trigger: Users refresh the page and see an empty transcript, or the chat UI shows “Session reset” repeatedly.- Identify the affected session
- Extract
sessionIdfrom browserlocalStorageor from thestartevent in recent/api/chatrequest logs - Locate the Pi session file path under
.fleet/sessions/inside the repo root
- Extract
- Check session file validity
- Verify the session JSONL file exists and is readable
- Ensure the file is inside the repo-scoped session directory (outside files are rejected by
isUsableSessionFile) - Look for truncated or malformed JSONL lines at the end of the file
- Validate localStorage metadata
- If
localStoragecontains an invalidsessionFile(e.g. pointing to/etc/hostsor a non-existent path), the app silently starts a fresh repo-scoped session — this is expected behavior - Instruct the user to clear
localStoragefor the site if the stored metadata is corrupt
- If
- Attempt manual hydration
- Call
POST /api/chat/sessionwith thesessionIdto triggerhydrateChatSession - If the session file cannot be opened, the server returns an empty message list with
sessionReset: true
- Call
- Recover or recreate
- If the file is corrupt beyond repair, archive it and let the user start a new session
- If the issue is widespread, check disk space and file system permissions on
.fleet/sessions/
- Follow up
- Document the root cause (disk full, permission issue, or Pi SDK bug)
- Monitor
SessionManager.openerror rates for 24 hours
Troubleshooting
Bedrock errors
Symptoms: Chat streams terminate witherror events, model picker shows unavailable models, or diagnostics contain model registry errors.
- ThrottlingException (429) — Bedrock is rate-limiting requests.
- Check the
requestIdin logs to confirm it is the same across retries - The Pi SDK auto-retries with exponential backoff; do not manually retry
- If sustained, enable request batching or switch to a lower-traffic model variant
- Check the
- AccessDeniedException (403) — IAM role or profile lacks
bedrock:*permissions.- Verify
AWS_PROFILEandAWS_BEARER_TOKEN_BEDROCKenvironment variables - Ensure the IAM policy includes
bedrock:InvokeModelandbedrock:InvokeModelWithResponseStream
- Verify
- ValidationException (400) — The requested model ID is invalid.
- Check
modelSelectionin the request body against the registry - Model IDs use region prefixes (e.g.
us.anthropic.claude-sonnet-4-6); the backend normalizes candidates but a completely unknown ID will fail
- Check
- ModelNotReadyException — The model is not enabled in the AWS account.
- Visit the Bedrock Console > Model access and enable the model for the current region
- Network / timeout errors — The circuit breaker
timeoutis 30 seconds.- If Bedrock does not respond within 30 seconds, the breaker counts it as a failure
- Check VPC endpoints or corporate proxy settings if running in a restricted network
Session hydration failures
Symptoms: After refreshing the browser, prior messages are gone; the UI shows a blank chat;sessionReset: true appears in /api/chat responses.
- Invalid
sessionFilein localStorage — The browser stores only Pi session metadata (sessionFileandsessionId). IfsessionFilepoints outside the repo session directory,isUsableSessionFilereturnsfalseand a fresh repo-scoped session is created silently.- Remediation: Clear site
localStorageand start a new chat
- Remediation: Clear site
- Missing or moved session file — The session JSONL was deleted or moved after the metadata was stored.
- Remediation: Check
.fleet/sessions/for the file; if missing, the session is unrecoverable
- Remediation: Check
- Corrupt session JSONL — A malformed line causes
SessionManager.opento throw.- Remediation: Inspect the file with
head -n 20andtail -n 5; remove trailing partial lines if safe, otherwise archive and start fresh
- Remediation: Inspect the file with
- Race condition during streaming — If a page refresh happens while the session is being compacted, the file may be in an inconsistent state.
- Remediation: Wait 5 seconds and retry hydration; the compaction lock should release
Circuit breaker states
The Bedrock API call is wrapped byopossum with the following configuration:
| Option | Value | Meaning |
|---|---|---|
errorThresholdPercentage | 50% | Open after half of sampled calls fail |
resetTimeout | 30,000 ms | Wait 30 s before trying half-open |
volumeThreshold | 5 | Minimum 5 calls before breaker can open |
timeout | 30,000 ms | Each call must complete within 30 s |
- Closed (normal) — Requests flow to Bedrock. Failures are counted.
- Open — All calls are rejected immediately with the fallback error:
"Bedrock API is temporarily unavailable due to repeated failures. Please try again later." - Half-open — The next call is allowed through as a probe. If the probe succeeds, the breaker closes. If it fails, the breaker opens again for another
resetTimeout.
Chat session mirror (Neon Postgres)
Pi session JSONL files under.fleet/sessions/ remain authoritative. When FLEET_PI_CHAT_DATABASE_URL is set, Fleet Pi mirrors full session entries and run provenance into Neon Postgres so you can query conversations with SQL, power cross-surface history, run analytics, and debug long-running runs.
Mirror writes happen on session create, hydrate, and list paths. Failures are caught and logged with the matching requestId — they never interrupt chat streaming.
When to enable it
- You need SQL search or analytics across Pi sessions.
- You run Fleet Pi across multiple surfaces and want a single source for chat history.
- You want durable provenance for tool executions and file mutations beyond what local SQLite captures.
| Role | Privileges | Used by |
|---|---|---|
neondb_owner | Full DDL + DML (CREATE, ALTER, DROP, etc.) | Migration CLI only |
fleet_pi_app | SELECT, INSERT, UPDATE, DELETE on pi_* tables | Running application |
.env:
FLEET_PI_CHAT_DATABASE_URL unset to keep Pi conversations in JSONL and local SQLite only.
Run migrations
fleet_pi_chat_migrations.
Tables
| Table | Contents |
|---|---|
public.pi_sessions | Pi session headers and current session metadata |
public.pi_session_entries | Full raw Pi entries plus normalized search fields |
public.pi_runs | Assistant turn/run summaries |
public.pi_run_events | Ordered streamed chat events |
public.pi_tool_executions | Tool call inputs, outputs, and claimed paths |
public.pi_file_mutations | File mutation summaries attributed to runs and tools |
- Mirror disabled unexpectedly: confirm
FLEET_PI_CHAT_DATABASE_URLis loaded in the running process (check/api/healthhost env, not just the.envfile). - Rows missing for a recent session: grep logs for the session’s
requestIdand look for mirror sync warnings; the JSONL file is still authoritative and you can re-trigger sync by hydrating the session. - Migration fails with permission errors: verify
FLEET_PI_CHAT_MIGRATION_DATABASE_URLusesneondb_owner, not the app role.
Quick reference
| Command | Purpose |
|---|---|
curl -sf http://localhost:3000/api/health | Verify web server health |
aws sts get-caller-identity | Verify AWS credentials |
pnpm --filter web test | Run unit tests (including circuit breaker tests) |
pnpm lint | Check code quality |
pnpm knip | Detect unused code |
pnpm --filter web chat:migrate | Apply Neon chat mirror schema migrations |
Runtime cache pressure
Symptoms: Memory grows over long sessions; Bedrock calls feel slower than expected after long idle periods.- Pi runtimes are cached per session with a TTL controlled by
FLEET_PI_RUNTIME_TTL_MS(defaults to 10 minutes). - Lower the TTL if you want runtimes evicted sooner; raise it to keep them warmer between turns.
- Restart
pnpm devto forcibly drop all warm runtimes if you suspect leaked Pi state.
Workspace contract drift
Symptoms:GET /api/workspace/health returns missing canonical paths or an unexpected manifest version.
- The contract version is pinned at
WORKSPACE_CONTRACT_VERSION = 1inworkspace-contract.ts. - Missing canonical directories should be re-created by workspace bootstrap on the next run; user-authored content is never overwritten.
- Use
POST /api/workspace/reindexto rebuild the projection index without modifying canonical files.
Related files
apps/web/src/lib/pi/circuit-breaker.ts— Breaker configuration and factory.apps/web/src/lib/pi/server.ts— Bedrock invocation and Pi runtime cache.apps/web/src/lib/pi/server-runtime.ts— Runtime TTL andPI_AGENT_DIRresolution.apps/web/src/lib/pi/plan-mode.ts— Mode allowlists and plan-mode extension.apps/web/src/lib/pi/run-provenance.ts— Provenance recording around session and tool events, including tool executions and file mutations mirrored to Neon.apps/web/src/lib/db/pi-session-mirror.ts— Safe sync helpers that mirror Pi sessions into Neon Postgres whenFLEET_PI_CHAT_DATABASE_URLis set.apps/web/src/lib/pii/sanitizer.ts— Input redaction before logging.apps/web/src/lib/logger.ts— Pino logger with redaction andrequestIdcorrelation.apps/web/src/lib/workspace/workspace-contract.ts— Workspace contract version and section kinds.