Reliability & observability for AI agents

Retrace records every LLM call, tool call, and decision your agent makes — so you can replay any run, fork from the exact step it broke, and verify the fix before you ship. Plus guardrails that stop runaway agents in production.

No credit card required
Works withOpenAI·Anthropic·Geminiany LLM, any framework.

// the core loop

Record what your agent did. Replay it, fork it, fix it — and prove the fix.

From a single decorator to a verdict — the reliability loop Retrace owns, end to end.

01

Record

One decorator captures every LLM call, tool call, and decision your agent makes.

@retrace.record()
02

Replay & fork

Re-run any recorded run, or fork from the exact step it broke and watch the agent diverge.

generate_responseerror↳ forkok
03

Detect

Automatic failure detection — groundedness, drift, and failure clustering — with runs auto-classified by failure type (MAST).

groundedness · drift · MAST
04

Enforce

Guardrails and circuit breakers halt runaway loops and budget blow-outs in production — before damage cascades.

budget $1.00→ HALT
05

Prove the fix

Re-run a change against the failed run and get a verdict — did it actually fix it?

re-ran →verdict: fixed ✓

every run feeds the next — record, fix, repeat.

// see it for real

A real recording. Scrub it yourself.

The actual trace UI — a recorded run that failed on an ungrounded answer, then forked from that exact step and passed. Scrub the timeline, open any span, watch the fork diverge.

Explore your own — start free
Watch the 60-second walkthrough
Recorded agent run: 8 spans, 1 error at step 5 (generate_response), then forked from that step into a path that succeeds. Prove-the-fix verdict: fix verified.
support-agent · run 0x7f2a
failed$0.0050
forked at step 5→ re-ran failed run → verdict: fix verified
  • classify_intent412ms · $0.0008
  • fetch_user_context1.84s
  • retrieve_docs690ms
  • plan_reply320ms · $0.0003
  • generate_response2.36s · $0.0021err
  • generate_response1.98s · $0.0012ok
  • validate_output96msok
  • format_reply540ms · $0.0006ok

8 spans · 1 error at step 5 · forked at step 5 · fix verified

// how it's different

Observability shows you what broke. Retrace re-runs it, forks the failed step, and proves the fix.

01

Aggregate metrics & dashboards

Replay the exact run — fork from the failed step and cascade-re-execute

02

Search through raw JSON spans

Semantic search — describe the bug in natural language (pgvector)

03

No way to test a fix without re-running everything

Prove-the-fix — re-run a change against the failed run, get a verdict

04

Static alert thresholds

Adaptive guardrails + circuit breakers that halt runaway loops & budget

05

Post-mortem analysis only

Runtime enforcement — stop agents in production, not after

06

Tied to one framework

One decorator — any Python/TS agent (LangChain · CrewAI · LlamaIndex)

// the platform

A full reliability platform — not a dashboard.

Detect failures, enforce limits, evaluate quality, and understand multi-agent systems — on every recorded run.

detect

Detect

Automatic failure detection — groundedness via cosine similarity + an LLM faithfulness judge (tiered cheap→deep), statistical drift, failure clustering, and MAST classification of failure types.

groundednessdriftfailure clusteringMAST
⚠ ungrounded claim·faithfulness 0.41 → flagged
enforce

Enforce

Stop runaway agents in production — guardrails and circuit breakers on budget, loops, and steps, fronted by a pre-call enforcement gateway with hold-for-approval.

guardrailscircuit breakersgatewayhold-for-approval
loop ×12 detected→ HALT
evaluate

Evaluate

Quality you can gate on — evaluations, auto eval-rules, CI gates that block bad deploys, datasets, and prove-the-fix verdicts.

evaluationseval rulesCI gatesdatasetsprove-the-fix
eval gate · 0.86 ≥ 0.80PASS
understand

Understand

See the whole system — multi-agent sessions and agent topology, agent memory, semantic search, prompt versioning, and shareable tapes.

sessionsagent topologymemorysemantic searchpromptstapes
session·planner → researcher → writer

// how it works

From zero to your first trace in ~2 minutes.

Sign in with GitHub, copy ~3 lines, run your agent — the first trace streams in live. No infrastructure to manage.

  1. step 01

    Instrument

    Install the SDK and add one decorator. Calls to OpenAI, Anthropic and Gemini are captured automatically.

    Framework-agnostic — works with LangChain, CrewAI, and LlamaIndex.

    agent.py
    import retrace
    retrace.configure(api_key="rt_live_...")
    @retrace.record(name="my-agent")
    def run_agent(prompt):
    return agent.invoke(prompt)
    SDK connected
    classify_intentrecorded
  2. step 02

    Observe

    Run your agent. Every LLM call, tool call, cost and error streams onto the timeline as it happens.

    Or watch it replay step-by-step in the dashboard.

    terminal
    retrace traces tail
    REC · live
    classify_intent0.4s
    fetch_context1.8s
    generate_response2.4s
  3. step 03

    Debug & prove

    Fork from the exact step that broke, change the input, re-run — then prove the fix actually worked.

    verify-fix returns a verdict — improved, regressed, or unchanged.

    terminal
    retrace forks create --trace <id> --span <id> --input "grounded prompt"
    retrace forks replay <id> --wait
    retrace traces verify-fix <id>
    generate_responseerror
    ↳ forkok
    verify-fix →fix verified

// pricing

Start free. Scale when ready.

No credit card required. Upgrade when you need more traces or AI requests.

Free
$0

For experimenting

  • +1,000 traces/mo
  • +7-day retention
  • +Fork & replay: $5/mo add-on
  • +1 user
Start free
Starter
$29/mo

For solo builders

  • +10,000 traces/mo
  • +30-day retention
  • +100 fork replays/mo
  • +Cassette VCR replay
  • +25 prove-the-fix runs/mo
  • +1 user
Get started
Pro
$99/mo

For shipping

  • +50,000 traces/mo
  • +90-day retention
  • +Unlimited fork replays
  • +Cassette VCR replay
  • +200 prove-the-fix runs/mo
  • +1 user
  • +CI regression gates
  • +Multi-agent detectors
Get started
Teams
$399/mo

For teams

  • +500,000 traces/mo
  • +365-day retention
  • +Unlimited fork replays
  • +Cassette VCR replay
  • +1,000 prove-the-fix runs/mo
  • +Up to 10 users
  • +Team traces & collaboration
  • +CI regression gates
  • +Multi-agent detectors
Get started
Enterprise
Contact us

For scale

  • +Unlimited traces/mo
  • +Custom retention
  • +Unlimited fork replays
  • +Cassette VCR replay
  • +Unlimited prove-the-fix runs/mo
  • +Unlimited users
  • +Team traces & collaboration
  • +CI regression gates
  • +Multi-agent detectors
Talk to us

// questions

Frequently asked

How long does setup take?

Under 2 minutes. Install the SDK, add one decorator, and traces stream immediately. No infrastructure to manage.

What languages and providers are supported?

Python and TypeScript SDKs with auto-instrumentation for OpenAI, Anthropic, and Google Gemini. Works with any agent framework — LangChain, CrewAI, Vercel AI SDK, AutoGen, LlamaIndex.

How does fork & replay work?

Select any span in a trace, modify its input, and Retrace cascade-replays from that point forward. Context from the fork flows into subsequent LLM calls. You get a side-by-side diff with cost and latency deltas.

What are guardrails?

Runtime policies that monitor your agent in real-time. Set cost budgets, loop detection, context overflow limits, or latency caps. When violated, the agent receives a HALT command — stopping it before damage cascades.

Is my data secure?

TLS in transit, encrypted at rest. API keys are SHA-256 hashed. PII auto-redaction runs on every plan as a security baseline. Tenant isolation is enforced at the application layer — every query is scoped per user and backed by a guardrail regression test.

Can I use this in CI/CD?

Yes. The eval gate endpoint (POST /evaluations/:id/gate) returns pass/fail against a threshold. The CLI command `retrace eval gate` exits with code 1 on failure — perfect for GitHub Actions.

How is this different from LangSmith?

LangSmith focuses on tracing and observability. Retrace adds interactive fork & cascade-replay from any step, runtime guardrails that halt runaway agents, groundedness detection, and prove-the-fix verification.

Does it work with multi-agent systems?

Yes. Each span carries an agent id, sessions group multi-turn conversations, and an agent topology graph shows cross-agent ordering and inter-agent failure modes.

Stop guessing.
Start replaying.

Your agent failed 3 steps before the error surfaced. Fork from the real cause — not the symptom.

No credit card · 2-min setup