Error Detection

Recording an agent is step one. Retrace also detects what went wrong — automatically, on every trace — and persists each finding so you get history, alerting, and a triage queue instead of a wall of logs. Detection runs identically whether spans arrive over WebSocket, HTTP, or OpenTelemetry.

Everything below is live today. Browse findings at /detections, or inline on any trace's timeline.

What gets detected

Detector	Failure mode	How it works	Cost
Hallucination (tiered)	`hallucination`	Grounding + mutual-information scoring of model output	hot-path + Tier-3 worker
Tool-output hallucination	`hallucination`	Compares the model's claim about a tool result to the verbatim recorded result	sampled LLM judge
Schema validation	`schema_violation`	Validates tool-call args against the declared tool schema, and structured outputs against the learned schema	deterministic, every trace
Loop / non-progress	`loop`	Identical tool-call hash repeated N times, step-count outliers, reasoning stalls	deterministic, every trace
Goal drift	`goal_drift`	LLM judge over the full conversation: did the agent stay on the original objective?	sampled LLM judge
Context loss	`context_loss`	Detects dropped turn-1 constraints past long conversations	sampled LLM judge
Replay divergence	`divergence`	Re-executes the run and structurally diffs (tool graph, retrieved docs, sampling config)	on demand
Regression (golden)	`regression`	Replays a golden trace against a new prompt/model and flags structural regressions	on demand / CI
Root-cause chain	`root_cause`	Walks the span dependency graph backward to the earliest corrupted step	on demand
Distribution drift	`drift`	Scheduled MMD drift vs a rolling baseline; auto-pivots a re-cluster	scheduled worker
Probabilistic anomaly	`anomaly`	Heavy-hitter tool loops, cardinality drift, duplicate spans	hot-path
Guardrail violation	`guardrail_violation`	Live policy breaches (cost / loop / context / latency / error-rate)	hot-path

Severity & dedup

Every detector funnels through one durable write path. One logical failure = one detection row — a 2,000-iteration loop is a single loop detection whose count updates in place, not 2,000 rows.

Severity scale: critical › high › medium › low › info. New high/critical detections email the account owner (deduped per trace/detector per month).

The detections feed

# List (tenant-scoped) — filter by trace, detector, failure_mode, severity, status, date
curl "https://api.retraceai.tech/api/v1/detections?failure_mode=loop&severity=high" \
  -H "x-retrace-key: rt_live_..."

# Aggregate counts for dashboards
curl https://api.retraceai.tech/api/v1/detections/summary -H "x-retrace-key: rt_live_..."

# Triage
curl -X PATCH https://api.retraceai.tech/api/v1/detections/<id> \
  -H "x-retrace-key: rt_live_..." -d '{"status":"resolved"}'

Verify replay & divergence (2A)

Re-execute a trace and structurally diff it against the recording. Divergence is ranked by the first divergent step, not raw character diff.

curl -X POST https://api.retraceai.tech/api/v1/traces/<id>/verify-replay \
  -H "x-retrace-key: rt_live_..."

In the web app, hit Verify Replay on any trace to see the divergence score and jump to the first divergent step.

Root-cause chain (2B)

curl -X POST https://api.retraceai.tech/api/v1/traces/<id>/root-cause-chain \
  -H "x-retrace-key: rt_live_..."

Returns the causal chain from the failure back to the earliest corrupted step (empty output, error, or corrupted tool result) — distinct from the LLM "explain failure" summary.

Regression replay & golden traces (2E)

Mark known-good traces golden, then replay them against a new prompt or model and assert structural equivalence. Regressions fail the GitHub Action gate.

import retrace
retrace.mark_golden(trace_id)          # Python

import { markGolden } from "retrace-sdk";
await markGolden(traceId);              // TypeScript

# Replay the whole golden set against current code (CI)
curl -X POST https://api.retraceai.tech/api/v1/golden-set/regression-replay \
  -H "x-retrace-key: rt_live_..."

From a fired detection, one click publishes an unlisted link (/t/<slug>) that surfaces the detection inline on the timeline — drop it into a GitHub issue, Discord, or Slack to get help. See Sharing & forking.

Error Detection

Error Detection

What gets detected

Severity & dedup

The detections feed

Verify replay & divergence (2A)

Root-cause chain (2B)

Regression replay & golden traces (2E)

On this page

Error Detection

Error Detection

What gets detected

Severity & dedup

The detections feed

Verify replay & divergence (2A)

Root-cause chain (2B)

Regression replay & golden traces (2E)

On this page