> ## Documentation Index
> Fetch the complete documentation index at: https://docs.qredence.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Fleet Pi operational runbooks

> Incident response and troubleshooting runbooks for Fleet Pi — LLM provider circuit-breaker recovery, chat-stream triage, and Neon Postgres mirror operations.

This page is operational reference material. New users should start with the [Introduction](/fleet-pi/introduction) and [Quickstart](/fleet-pi/quickstart).

## Incident response

### IR-1: Bedrock API outage (circuit breaker open)

This runbook applies when Amazon Bedrock is the configured provider. Fleet Pi defaults to Google Gemini; if you're seeing chat failures with the default, treat them as generic provider errors and check your `GEMINI_API_KEY` and provider status first.

**Trigger:** Users report chat returning "Bedrock API is temporarily unavailable" or all `/api/chat` requests fail with 500 errors.

1. **Verify the circuit breaker state**
   * Check application logs for `bedrock-api` circuit breaker events
   * Look for `open` state transitions in logs with `requestId` correlation
   * Run `curl -sf http://localhost:3000/api/health` to confirm the web server is still healthy
2. **Check Bedrock service status**
   * Verify AWS credentials are valid: `aws sts get-caller-identity`
   * Check Bedrock model access in the AWS Console for the configured region (default `us-east-1`)
   * Review AWS Service Health Dashboard for regional outages
3. **Inspect recent error patterns**
   * Search logs for the last 30 minutes: `grep "bedrock-api"` or `grep "circuit breaker"`
   * Identify if errors are throttling (429), auth (403), or model-level (400)
   * Note the `errorThresholdPercentage` (50%) and `volumeThreshold` (5) — the breaker opens after 3 failures within 5 calls
4. **Wait for automatic recovery or force reset**
   * The circuit breaker `resetTimeout` is 30 seconds; it will attempt a half-open call after that period
   * If Bedrock is confirmed restored but the breaker is still open, restart the dev server to reset the breaker state
5. **Communicate**
   * Post in the incident channel: "Bedrock circuit breaker open — root cause under investigation"
   * If AWS is at fault, set status page to "degraded" and estimate recovery based on AWS status updates

### IR-2: Chat session corruption or data loss

**Trigger:** Users refresh the page and see an empty transcript, or the chat UI shows "Session reset" repeatedly.

1. **Identify the affected session**
   * Extract `sessionId` from browser `localStorage` or from the `start` event in recent `/api/chat` request logs
   * Locate the Pi session file path under `.fleet/sessions/` inside the repo root
2. **Check session file validity**
   * Verify the session JSONL file exists and is readable
   * Ensure the file is inside the repo-scoped session directory (outside files are rejected by `isUsableSessionFile`)
   * Look for truncated or malformed JSONL lines at the end of the file
3. **Validate localStorage metadata**
   * If `localStorage` contains an invalid `sessionFile` (e.g. pointing to `/etc/hosts` or a non-existent path), the app silently starts a fresh repo-scoped session — this is expected behavior
   * Instruct the user to clear `localStorage` for the site if the stored metadata is corrupt
4. **Attempt manual hydration**
   * Call `POST /api/chat/session` with the `sessionId` to trigger `hydrateChatSession`
   * If the session file cannot be opened, the server returns an empty message list with `sessionReset: true`
5. **Recover or recreate**
   * If the file is corrupt beyond repair, archive it and let the user start a new session
   * If the issue is widespread, check disk space and file system permissions on `.fleet/sessions/`
6. **Follow up**
   * Document the root cause (disk full, permission issue, or Pi SDK bug)
   * Monitor `SessionManager.open` error rates for 24 hours

## Troubleshooting

### Bedrock errors

Symptoms: Chat streams terminate with `error` events, model picker shows unavailable models, or diagnostics contain model registry errors.

* **ThrottlingException (429)** — Bedrock is rate-limiting requests.
  * Check the `requestId` in logs to confirm it is the same across retries
  * The Pi SDK auto-retries with exponential backoff; do not manually retry
  * If sustained, enable request batching or switch to a lower-traffic model variant
* **AccessDeniedException (403)** — IAM role or profile lacks `bedrock:*` permissions.
  * Verify `AWS_PROFILE` and `AWS_BEARER_TOKEN_BEDROCK` environment variables
  * Ensure the IAM policy includes `bedrock:InvokeModel` and `bedrock:InvokeModelWithResponseStream`
* **ValidationException (400)** — The requested model ID is invalid.
  * Check `modelSelection` in the request body against the registry
  * Model IDs use region prefixes (e.g. `us.anthropic.claude-sonnet-4-6`); the backend normalizes candidates but a completely unknown ID will fail
* **ModelNotReadyException** — The model is not enabled in the AWS account.
  * Visit the Bedrock Console > Model access and enable the model for the current region
* **Network / timeout errors** — The circuit breaker `timeout` is 30 seconds.
  * If Bedrock does not respond within 30 seconds, the breaker counts it as a failure
  * Check VPC endpoints or corporate proxy settings if running in a restricted network

### Session hydration failures

Symptoms: After refreshing the browser, prior messages are gone; the UI shows a blank chat; `sessionReset: true` appears in `/api/chat` responses.

* **Invalid `sessionFile` in localStorage** — The browser stores only Pi session metadata (`sessionFile` and `sessionId`). If `sessionFile` points outside the repo session directory, `isUsableSessionFile` returns `false` and a fresh repo-scoped session is created silently.
  * Remediation: Clear site `localStorage` and start a new chat
* **Missing or moved session file** — The session JSONL was deleted or moved after the metadata was stored.
  * Remediation: Check `.fleet/sessions/` for the file; if missing, the session is unrecoverable
* **Corrupt session JSONL** — A malformed line causes `SessionManager.open` to throw.
  * Remediation: Inspect the file with `head -n 20` and `tail -n 5`; remove trailing partial lines if safe, otherwise archive and start fresh
* **Race condition during streaming** — If a page refresh happens while the session is being compacted, the file may be in an inconsistent state.
  * Remediation: Wait 5 seconds and retry hydration; the compaction lock should release

### Circuit breaker states

The Bedrock API call is wrapped by `opossum` with the following configuration:

| Option                     | Value     | Meaning                                 |
| -------------------------- | --------- | --------------------------------------- |
| `errorThresholdPercentage` | 50%       | Open after half of sampled calls fail   |
| `resetTimeout`             | 30,000 ms | Wait 30 s before trying half-open       |
| `volumeThreshold`          | 5         | Minimum 5 calls before breaker can open |
| `timeout`                  | 30,000 ms | Each call must complete within 30 s     |

* **Closed (normal)** — Requests flow to Bedrock. Failures are counted.
* **Open** — All calls are rejected immediately with the fallback error: `"Bedrock API is temporarily unavailable due to repeated failures. Please try again later."`
* **Half-open** — The next call is allowed through as a probe. If the probe succeeds, the breaker closes. If it fails, the breaker opens again for another `resetTimeout`.

### Chat session mirror (Neon Postgres)

Pi session JSONL files under `.fleet/sessions/` remain authoritative. When `FLEET_PI_CHAT_DATABASE_URL` is set, Fleet Pi mirrors full session entries and run provenance into Neon Postgres so you can query conversations with SQL, power cross-surface history, run analytics, and debug long-running runs.

Mirror writes happen on session create, hydrate, and list paths. Failures are caught and logged with the matching `requestId` — they never interrupt chat streaming.

**When to enable it**

* You need SQL search or analytics across Pi sessions.
* You run Fleet Pi across multiple surfaces and want a single source for chat history.
* You want durable provenance for tool executions and file mutations beyond what local SQLite captures.

**Roles**

Provision two Neon roles and keep them separate:

| Role           | Privileges                                      | Used by             |
| -------------- | ----------------------------------------------- | ------------------- |
| `neondb_owner` | Full DDL + DML (CREATE, ALTER, DROP, etc.)      | Migration CLI only  |
| `fleet_pi_app` | SELECT, INSERT, UPDATE, DELETE on `pi_*` tables | Running application |

**Configure**

Set both connection strings in `.env`:

```bash theme={null}
# Runtime mirror (pooled app-role connection)
FLEET_PI_CHAT_DATABASE_URL=postgres://fleet_pi_app:...@ep-xxxx-pooler.neon.tech/neondb?sslmode=require

# Migration-only (direct owner connection)
FLEET_PI_CHAT_MIGRATION_DATABASE_URL=postgres://neondb_owner:...@ep-xxxx.neon.tech/neondb?sslmode=require
```

Leave `FLEET_PI_CHAT_DATABASE_URL` unset to keep Pi conversations in JSONL and local SQLite only.

**Run migrations**

```bash theme={null}
pnpm --filter web chat:migrate
```

Re-run after pulling changes that update the schema. The script is idempotent and records applied migrations in `fleet_pi_chat_migrations`.

**Tables**

| Table                       | Contents                                             |
| --------------------------- | ---------------------------------------------------- |
| `public.pi_sessions`        | Pi session headers and current session metadata      |
| `public.pi_session_entries` | Full raw Pi entries plus normalized search fields    |
| `public.pi_runs`            | Assistant turn/run summaries                         |
| `public.pi_run_events`      | Ordered streamed chat events                         |
| `public.pi_tool_executions` | Tool call inputs, outputs, and claimed paths         |
| `public.pi_file_mutations`  | File mutation summaries attributed to runs and tools |

**Triage**

* Mirror disabled unexpectedly: confirm `FLEET_PI_CHAT_DATABASE_URL` is loaded in the running process (check `/api/health` host env, not just the `.env` file).
* Rows missing for a recent session: grep logs for the session's `requestId` and look for mirror sync warnings; the JSONL file is still authoritative and you can re-trigger sync by hydrating the session.
* Migration fails with permission errors: verify `FLEET_PI_CHAT_MIGRATION_DATABASE_URL` uses `neondb_owner`, not the app role.

## Quick reference

| Command                                     | Purpose                                          |
| ------------------------------------------- | ------------------------------------------------ |
| `curl -sf http://localhost:3000/api/health` | Verify web server health                         |
| `aws sts get-caller-identity`               | Verify AWS credentials                           |
| `pnpm --filter web test`                    | Run unit tests (including circuit breaker tests) |
| `pnpm lint`                                 | Check code quality                               |
| `pnpm knip`                                 | Detect unused code                               |
| `pnpm --filter web chat:migrate`            | Apply Neon chat mirror schema migrations         |

### Runtime cache pressure

Symptoms: Memory grows over long sessions; Bedrock calls feel slower than expected after long idle periods.

* Pi runtimes are cached per session with a TTL controlled by `FLEET_PI_RUNTIME_TTL_MS` (defaults to 10 minutes).
* Lower the TTL if you want runtimes evicted sooner; raise it to keep them warmer between turns.
* Restart `pnpm dev` to forcibly drop all warm runtimes if you suspect leaked Pi state.

### Workspace contract drift

Symptoms: `GET /api/workspace/health` returns missing canonical paths or an unexpected manifest version.

* The contract version is pinned at `WORKSPACE_CONTRACT_VERSION = 1` in `workspace-contract.ts`.
* Missing canonical directories should be re-created by workspace bootstrap on the next run; user-authored content is never overwritten.
* Use `POST /api/workspace/reindex` to rebuild the projection index without modifying canonical files.

## Related files

* `apps/web/src/lib/pi/circuit-breaker.ts` — Breaker configuration and factory.
* `apps/web/src/lib/pi/server.ts` — Bedrock invocation and Pi runtime cache.
* `apps/web/src/lib/pi/server-runtime.ts` — Runtime TTL and `PI_AGENT_DIR` resolution.
* `apps/web/src/lib/pi/plan-mode.ts` — Mode allowlists and plan-mode extension.
* `apps/web/src/lib/pi/run-provenance.ts` — Provenance recording around session and tool events, including tool executions and file mutations mirrored to Neon.
* `apps/web/src/lib/db/pi-session-mirror.ts` — Safe sync helpers that mirror Pi sessions into Neon Postgres when `FLEET_PI_CHAT_DATABASE_URL` is set.
* `apps/web/src/lib/pii/sanitizer.ts` — Input redaction before logging.
* `apps/web/src/lib/logger.ts` — Pino logger with redaction and `requestId` correlation.
* `apps/web/src/lib/workspace/workspace-contract.ts` — Workspace contract version and section kinds.
