Retrace, The Execution Replay Engine for AI Agents

CI for AI agent behavior

Retrace records every model and tool call your agent makes, so you can re-run a failure, branch from the exact step it went wrong, and check your fix before you ship.

Start free See a live trace

No credit card required

Works withOpenAI·Anthropic·Gemini·any LLM, any framework.

// the core loop

Record what your agent did. Replay it, fork from where it broke, and prove the fix.

From a single decorator to a verdict. This is the reliability loop Retrace owns, end to end.

Record

One decorator captures every LLM call, tool call, and error. Now a production failure becomes a permanent regression test you can re-run.

@retrace.record()

Fork & Replay

Re-run any recorded run, or fork from the exact step it broke and trace what actually caused it.

generate_responseerror↳ forkok

Fix

Change the step that broke (the prompt, a tool input, or the model) and re-run the fork to see the corrected path before you ship.

input edited→ re-run

Prove the Fix

Re-run a change against the failed run and get a verdict before you ship.

re-ran →verdict: fixed ✓

↻ every run feeds the next. record, fix, repeat.

// see it for real

A real recording. Scrub it yourself.

The actual trace UI shows a recorded incident-RCA run that stopped at the OOM symptom and was marked failed, then forked from that step and traced the real cause: a deploy that raised the batch size 50×. Scrub the timeline, open any span, watch the fork diverge.

Explore your own. Start free

incident-rca · search-indexer · INC-4835

failed$0.0010

forked at step 4→ re-ran failed run → verdict: fix verified

retrace.ai.generate6.18s · $0.0001
search_loki_logs0ms
query_prometheus0ms
retrace.ai.generate5.83s · $0.0005err
↳ retrace.ai.generate2.79s · $0.0003ok
↳ retrace.ai.generate2.01s · $0.0001ok

6 spans · 1 error at step 4 · forked at step 4 · fix verified

Interactive. A real trace exported from the product (incident-rca · INC-4835). Spans, costs and timings are the recorded values.

// the platform

A full reliability platform, not just a dashboard.

Detect failures, enforce limits, evaluate quality, and understand multi-agent systems, on every recorded run.

Detect01

detect

Detect

Flag groundedness gaps, statistical drift, failure clusters, and MAST failure types automatically — so you learn why a run failed, before a user does, not just that it did.

groundednessdriftfailure clusteringMAST

⚠ ungrounded claim·faithfulness 0.41 → flagged

Enforce02

enforce

Enforce

Halt a run at a budget, loop, or step limit, and block a bad action before it runs via a pre-call gateway with hold-for-approval — so a bug can't quietly cascade into a cloud bill.

guardrailscircuit breakersgatewayhold-for-approval

loop ×12 detected→ HALT

Evaluate03

evaluate

Evaluate

Quality you can gate on — evaluations, auto eval-rules, datasets, and prove-the-fix verdicts, plus a CI gate so a regression fails the build instead of reaching users.

evaluationseval rulesCI gatesdatasetsprove-the-fix

eval gate · 0.86 ≥ 0.80PASS

Understand04

understand

Understand

See the whole system — sessions, agent topology, agent memory, semantic search, prompt versioning, and shareable tapes — so a multi-agent failure isn't a black box.

sessionsagent topologymemorysemantic searchpromptstapes

session·planner → researcher → writer

detect

Detect

Flag groundedness gaps, statistical drift, failure clusters, and MAST failure types automatically — so you learn why a run failed, before a user does, not just that it did.

groundednessdriftfailure clusteringMAST

⚠ ungrounded claim·faithfulness 0.41 → flagged

enforce

Enforce

Halt a run at a budget, loop, or step limit, and block a bad action before it runs via a pre-call gateway with hold-for-approval — so a bug can't quietly cascade into a cloud bill.

guardrailscircuit breakersgatewayhold-for-approval

loop ×12 detected→ HALT

evaluate

Evaluate

Quality you can gate on — evaluations, auto eval-rules, datasets, and prove-the-fix verdicts, plus a CI gate so a regression fails the build instead of reaching users.

evaluationseval rulesCI gatesdatasetsprove-the-fix

eval gate · 0.86 ≥ 0.80PASS

understand

Understand

See the whole system — sessions, agent topology, agent memory, semantic search, prompt versioning, and shareable tapes — so a multi-agent failure isn't a black box.

sessionsagent topologymemorysemantic searchpromptstapes

session·planner → researcher → writer

// how it's different

Observability shows you what broke. Retrace re-runs it, forks the failed step, and proves the fix.

The tool you haveWith Retrace

Aggregate metrics & dashboards

Replay the exact run. Fork from the failed step and cascade-re-execute it to find what actually caused the failure

Search through raw JSON spans

Semantic search: describe the bug in plain language and jump to the failing run in seconds

No way to test a fix without re-running everything

Prove-the-fix: re-run a change against the failed run and get a verdict before you ship

Static alert thresholds

Guardrails and circuit breakers that halt a run when it blows a budget or loops, so a bug can't quietly run up your bill

Post-mortem analysis only

Runtime enforcement: block a bad action before it runs instead of diagnosing it afterward

Tied to one framework

One decorator works with any Python or TS agent (LangChain, CrewAI, LlamaIndex). No lock-in

// how it works

From zero to your first trace in ~2 minutes.

01
step 01
Instrument
Install the SDK and add one decorator. Calls to OpenAI, Anthropic and Gemini are captured automatically.
Framework-agnostic. Works with LangChain, CrewAI, and LlamaIndex.
agent.py
1
2
3
4
5
6
```
import retrace
retrace.configure(api_key="rt_...")
@retrace.record(name="my-agent")
def run_agent(prompt):
    return agent.invoke(prompt)
```
SDK connected
classify_intentrecorded
02
step 02
Observe
Run your agent. Every LLM call, tool call, cost and error streams onto the timeline as it happens.
Or watch it replay step-by-step in the dashboard.
terminal
1
```
retrace traces tail
```
REC · live
classify_intent0.4s
fetch_context1.8s
generate_response2.4s
03
step 03
Debug & prove
Fork from the exact step that broke, change the input, re-run, then prove the fix actually worked.
verify-fix returns a verdict: improved, regressed, or unchanged.
terminal
1
2
3
```
retrace forks create --trace <id> --span <id> --input "grounded prompt"
retrace forks replay <id> --wait
retrace traces verify-fix <id>
```
generate_responseerror
↳ forkok
verify-fix →fix verified

// ci for ai agent behavior

Ship AI agents with real regression tests.

A prompt edit, tool change, or model upgrade can quietly break multi-step behavior. Retrace turns each production failure into a regression test and runs it as an eval gate on every PR, so the build fails when behavior breaks. It diffs each run against a golden baseline, so you catch “relevance dropped vs last release,” not just a red or green dot.

.github/workflows/eval-gate.yml

- name: Eval Gate
  run: retrace eval gate --evaluation $EVAL_ID --trace $TRACE_ID --threshold 0.8
  env:
    RETRACE_API_KEY: ${{ secrets.RETRACE_API_KEY }}

Checks

✗retrace / eval-gatebehavior regressed, build failed

✓retrace / eval-gatepassed 5/5 runs, merge unblocked

How the eval gate works

the closed loop

01production tracea real run fails
02captured failurebecomes a regression test
03eval gate on the PRre-runs that test
04green checkbehavior holds, merge

what it gates on (that unit tests can't)

trajectory / loopingtool-call correctnesssilent error-inventionmulti-agent hand-offsprompt drift

The build fails when behavior breaks, so that exact failure is harder to ship again. (retrace eval gate exits 1.)

// your model account

Eval gates and replays run on your model account.

Add your Google (Gemini) API key in Settings and the eval-gate judge and every server-side replay (fork, cascade, and prove-the-fix) call the model through your key, so the tokens are billed to your own provider account. The key is validated on save, encrypted at rest (AES-256-GCM), shown only as its last four characters, and never returned again. Remove it any time and Retrace falls back to the platform key.

Google · key ••••a1b2validated ✓

what your key powers

eval-gate judgescores every PR run
fork & cascade replayre-executes the agent
prove-the-fixcomputes the verdict

Add your key in Settings

how the key is handled

Validated against the provider on save. A bad key is rejected, never stored.
Encrypted at rest with AES-256-GCM and a per-secret derived key.
Surfaced only as its last 4 characters, never logged, never returned.
Remove it and replays fall back to the platform key with no downtime.

provider support

Google / Geminipowers eval gates + replays today

OpenAI · Anthropicvalidated + stored; native replay coming

// pricing

Start free. Scale when ready.

No credit card required. Upgrade when you need more traces or AI requests.

Free

For experimenting

+1,000 traces/mo
+7-day retention
+10 fork replays/mo
+1 user

Start free

Starter

$29/mo

For solo builders

+10,000 traces/mo
+30-day retention
+100 fork replays/mo
+Cassette VCR replay
+25 prove-the-fix runs/mo
+1 user

Get started

Pro

$99/mo

For shipping

+50,000 traces/mo
+90-day retention
+Unlimited fork replays
+Cassette VCR replay
+200 prove-the-fix runs/mo
+1 user
+CI regression gates
+Multi-agent detectors
+Sandbox env: 5,000 traces/mo

Get started

Teams

$399/mo

For teams

+500,000 traces/mo
+365-day retention
+Unlimited fork replays
+Cassette VCR replay
+1,000 prove-the-fix runs/mo
+Up to 10 users
+Team traces & collaboration
+CI regression gates
+Multi-agent detectors
+Sandbox env: 50,000 traces/mo

Get started

Enterprise

For scale

+Unlimited traces/mo
+Custom retention
+Unlimited fork replays
+Cassette VCR replay
+Unlimited prove-the-fix runs/mo
+Unlimited users
+Team traces & collaboration
+CI regression gates
+Multi-agent detectors
+Sandbox env: Unlimited traces/mo

Talk to us

// questions

Frequently asked

How long does setup take?

Under 2 minutes. Install the SDK, add one decorator, and traces stream immediately. No infrastructure to manage.

What languages and providers are supported?

Python and TypeScript SDKs with auto-instrumentation for OpenAI, Anthropic, and Google Gemini. Works with any agent framework: LangChain, CrewAI, Vercel AI SDK, AutoGen, LlamaIndex.

How does fork & replay work?

Select any span in a trace, modify its input, and Retrace cascade-replays from that point forward. Context from the fork flows into subsequent LLM calls. You get a side-by-side diff with cost and latency deltas.

What are guardrails?

Runtime policies that monitor your agent in real-time. Set cost budgets, loop detection, context overflow limits, or latency caps. When a limit is crossed, the agent receives a HALT command, so a runaway loop or budget blow-out stops at the limit instead of running up your bill.

Is my data secure?

TLS in transit, encrypted at rest. API keys are SHA-256 hashed. PII auto-redaction runs on every plan as a security baseline. Tenant isolation is enforced at the application layer. Every query is scoped per user and backed by a guardrail regression test.

Can I use this in CI/CD?

Yes. The eval gate endpoint (POST /evaluations/:id/gate) returns pass/fail against a threshold. The CLI command `retrace eval gate` exits with code 1 on failure, perfect for GitHub Actions.

How is this different from LangSmith?

LangSmith focuses on tracing and observability. Retrace adds interactive fork & cascade-replay from any step, runtime guardrails that halt runaway agents, groundedness detection, and prove-the-fix verification.

Does it work with multi-agent systems?

Yes. Each span carries an agent id, sessions group multi-turn conversations, and an agent topology graph shows cross-agent ordering and inter-agent failure modes.

See what your agent
actually did.

Your agent failed 3 steps before the error surfaced. Fork from the real cause instead of the symptom.

Start recording for free Documentation →

No credit card · 2-min setup

retrace

CI for AI agent behavior

Retrace records every model and tool call your agent makes, so you can re-run a failure, branch from the exact step it went wrong, and check your fix before you ship.

Start free See a live trace

No credit card required

Works withOpenAI·Anthropic·Gemini·any LLM, any framework.

// the core loop

Record what your agent did. Replay it, fork from where it broke, and prove the fix.

From a single decorator to a verdict. This is the reliability loop Retrace owns, end to end.

Record

One decorator captures every LLM call, tool call, and error. Now a production failure becomes a permanent regression test you can re-run.

@retrace.record()

Fork & Replay

Re-run any recorded run, or fork from the exact step it broke and trace what actually caused it.

generate_responseerror↳ forkok

Fix

Change the step that broke (the prompt, a tool input, or the model) and re-run the fork to see the corrected path before you ship.

input edited→ re-run

Prove the Fix

Re-run a change against the failed run and get a verdict before you ship.

re-ran →verdict: fixed ✓

↻ every run feeds the next. record, fix, repeat.

// see it for real

A real recording. Scrub it yourself.

Explore your own. Start free

incident-rca · search-indexer · INC-4835

failed$0.0010

forked at step 4→ re-ran failed run → verdict: fix verified

retrace.ai.generate6.18s · $0.0001
search_loki_logs0ms
query_prometheus0ms
retrace.ai.generate5.83s · $0.0005err
↳ retrace.ai.generate2.79s · $0.0003ok
↳ retrace.ai.generate2.01s · $0.0001ok

6 spans · 1 error at step 4 · forked at step 4 · fix verified

Interactive. A real trace exported from the product (incident-rca · INC-4835). Spans, costs and timings are the recorded values.

// the platform

A full reliability platform, not just a dashboard.

Detect failures, enforce limits, evaluate quality, and understand multi-agent systems, on every recorded run.

Detect01

detect

Detect

Flag groundedness gaps, statistical drift, failure clusters, and MAST failure types automatically — so you learn why a run failed, before a user does, not just that it did.

groundednessdriftfailure clusteringMAST

⚠ ungrounded claim·faithfulness 0.41 → flagged

Enforce02

enforce

Enforce

Halt a run at a budget, loop, or step limit, and block a bad action before it runs via a pre-call gateway with hold-for-approval — so a bug can't quietly cascade into a cloud bill.

guardrailscircuit breakersgatewayhold-for-approval

loop ×12 detected→ HALT

Evaluate03

evaluate

Evaluate

Quality you can gate on — evaluations, auto eval-rules, datasets, and prove-the-fix verdicts, plus a CI gate so a regression fails the build instead of reaching users.

evaluationseval rulesCI gatesdatasetsprove-the-fix

eval gate · 0.86 ≥ 0.80PASS

Understand04

understand

Understand

See the whole system — sessions, agent topology, agent memory, semantic search, prompt versioning, and shareable tapes — so a multi-agent failure isn't a black box.

sessionsagent topologymemorysemantic searchpromptstapes

session·planner → researcher → writer

detect

Detect

Flag groundedness gaps, statistical drift, failure clusters, and MAST failure types automatically — so you learn why a run failed, before a user does, not just that it did.

groundednessdriftfailure clusteringMAST

⚠ ungrounded claim·faithfulness 0.41 → flagged

enforce

Enforce

Halt a run at a budget, loop, or step limit, and block a bad action before it runs via a pre-call gateway with hold-for-approval — so a bug can't quietly cascade into a cloud bill.

guardrailscircuit breakersgatewayhold-for-approval

loop ×12 detected→ HALT

evaluate

Evaluate

Quality you can gate on — evaluations, auto eval-rules, datasets, and prove-the-fix verdicts, plus a CI gate so a regression fails the build instead of reaching users.

evaluationseval rulesCI gatesdatasetsprove-the-fix

eval gate · 0.86 ≥ 0.80PASS

understand

Understand

See the whole system — sessions, agent topology, agent memory, semantic search, prompt versioning, and shareable tapes — so a multi-agent failure isn't a black box.

sessionsagent topologymemorysemantic searchpromptstapes

session·planner → researcher → writer

// how it's different

Observability shows you what broke. Retrace re-runs it, forks the failed step, and proves the fix.

The tool you haveWith Retrace

Aggregate metrics & dashboards

Replay the exact run. Fork from the failed step and cascade-re-execute it to find what actually caused the failure

Search through raw JSON spans

Semantic search: describe the bug in plain language and jump to the failing run in seconds

No way to test a fix without re-running everything

Prove-the-fix: re-run a change against the failed run and get a verdict before you ship

Static alert thresholds

Guardrails and circuit breakers that halt a run when it blows a budget or loops, so a bug can't quietly run up your bill

Post-mortem analysis only

Runtime enforcement: block a bad action before it runs instead of diagnosing it afterward

Tied to one framework

One decorator works with any Python or TS agent (LangChain, CrewAI, LlamaIndex). No lock-in

// how it works

From zero to your first trace in ~2 minutes.

01
step 01
Instrument
Install the SDK and add one decorator. Calls to OpenAI, Anthropic and Gemini are captured automatically.
Framework-agnostic. Works with LangChain, CrewAI, and LlamaIndex.
agent.py
1
2
3
4
5
6
```
import retrace
retrace.configure(api_key="rt_...")
@retrace.record(name="my-agent")
def run_agent(prompt):
    return agent.invoke(prompt)
```
SDK connected
classify_intentrecorded
02
step 02
Observe
Run your agent. Every LLM call, tool call, cost and error streams onto the timeline as it happens.
Or watch it replay step-by-step in the dashboard.
terminal
1
```
retrace traces tail
```
REC · live
classify_intent0.4s
fetch_context1.8s
generate_response2.4s
03
step 03
Debug & prove
Fork from the exact step that broke, change the input, re-run, then prove the fix actually worked.
verify-fix returns a verdict: improved, regressed, or unchanged.
terminal
1
2
3
```
retrace forks create --trace <id> --span <id> --input "grounded prompt"
retrace forks replay <id> --wait
retrace traces verify-fix <id>
```
generate_responseerror
↳ forkok
verify-fix →fix verified

// ci for ai agent behavior

Ship AI agents with real regression tests.

.github/workflows/eval-gate.yml

- name: Eval Gate
  run: retrace eval gate --evaluation $EVAL_ID --trace $TRACE_ID --threshold 0.8
  env:
    RETRACE_API_KEY: ${{ secrets.RETRACE_API_KEY }}

Checks

✗retrace / eval-gatebehavior regressed, build failed

✓retrace / eval-gatepassed 5/5 runs, merge unblocked

How the eval gate works

the closed loop

01production tracea real run fails
02captured failurebecomes a regression test
03eval gate on the PRre-runs that test
04green checkbehavior holds, merge

what it gates on (that unit tests can't)

trajectory / loopingtool-call correctnesssilent error-inventionmulti-agent hand-offsprompt drift

The build fails when behavior breaks, so that exact failure is harder to ship again. (retrace eval gate exits 1.)

// your model account

Eval gates and replays run on your model account.

Google · key ••••a1b2validated ✓

what your key powers

eval-gate judgescores every PR run
fork & cascade replayre-executes the agent
prove-the-fixcomputes the verdict

Add your key in Settings

how the key is handled

Validated against the provider on save. A bad key is rejected, never stored.
Encrypted at rest with AES-256-GCM and a per-secret derived key.
Surfaced only as its last 4 characters, never logged, never returned.
Remove it and replays fall back to the platform key with no downtime.

provider support

Google / Geminipowers eval gates + replays today

OpenAI · Anthropicvalidated + stored; native replay coming

// pricing

Start free. Scale when ready.

No credit card required. Upgrade when you need more traces or AI requests.

Free

For experimenting

+1,000 traces/mo
+7-day retention
+10 fork replays/mo
+1 user

Start free

Starter

$29/mo

For solo builders

+10,000 traces/mo
+30-day retention
+100 fork replays/mo
+Cassette VCR replay
+25 prove-the-fix runs/mo
+1 user

Get started

Pro

$99/mo

For shipping

+50,000 traces/mo
+90-day retention
+Unlimited fork replays
+Cassette VCR replay
+200 prove-the-fix runs/mo
+1 user
+CI regression gates
+Multi-agent detectors
+Sandbox env: 5,000 traces/mo

Get started

Teams

$399/mo

For teams

+500,000 traces/mo
+365-day retention
+Unlimited fork replays
+Cassette VCR replay
+1,000 prove-the-fix runs/mo
+Up to 10 users
+Team traces & collaboration
+CI regression gates
+Multi-agent detectors
+Sandbox env: 50,000 traces/mo

Get started

Enterprise

For scale

+Unlimited traces/mo
+Custom retention
+Unlimited fork replays
+Cassette VCR replay
+Unlimited prove-the-fix runs/mo
+Unlimited users
+Team traces & collaboration
+CI regression gates
+Multi-agent detectors
+Sandbox env: Unlimited traces/mo

Talk to us

// questions

Frequently asked

How long does setup take?

Under 2 minutes. Install the SDK, add one decorator, and traces stream immediately. No infrastructure to manage.

What languages and providers are supported?

Python and TypeScript SDKs with auto-instrumentation for OpenAI, Anthropic, and Google Gemini. Works with any agent framework: LangChain, CrewAI, Vercel AI SDK, AutoGen, LlamaIndex.

How does fork & replay work?

What are guardrails?

Is my data secure?

Can I use this in CI/CD?

Yes. The eval gate endpoint (POST /evaluations/:id/gate) returns pass/fail against a threshold. The CLI command `retrace eval gate` exits with code 1 on failure, perfect for GitHub Actions.

How is this different from LangSmith?

Does it work with multi-agent systems?

Yes. Each span carries an agent id, sessions group multi-turn conversations, and an agent topology graph shows cross-agent ordering and inter-agent failure modes.

See what your agent
actually did.

Your agent failed 3 steps before the error surfaced. Fork from the real cause instead of the symptom.

Start recording for free Documentation →

No credit card · 2-min setup

Replay any agent run.Fork the step that broke.

Record what your agent did. Replay it, fork from where it broke, and prove the fix.

Record

Fork & Replay

Fix

Prove the Fix

A real recording. Scrub it yourself.

A full reliability platform, not just a dashboard.

Detect

Enforce

Evaluate

Understand

Detect

Enforce

Evaluate

Understand

Observability shows you what broke. Retrace re-runs it, forks the failed step, and proves the fix.

From zero to your first trace in ~2 minutes.

Instrument

Observe

Debug & prove

Ship AI agents with real regression tests.

Eval gates and replays run on your model account.

Start free. Scale when ready.

Frequently asked

See what your agentactually did.

Replay any agent run.Fork the step that broke.

Record what your agent did. Replay it, fork from where it broke, and prove the fix.

Record

Fork & Replay

Fix

Prove the Fix

A real recording. Scrub it yourself.

A full reliability platform, not just a dashboard.

Detect

Enforce

Evaluate

Understand

Detect

Enforce

Evaluate

Understand

Observability shows you what broke. Retrace re-runs it, forks the failed step, and proves the fix.

From zero to your first trace in ~2 minutes.

Instrument

Observe

Debug & prove

Ship AI agents with real regression tests.

Eval gates and replays run on your model account.

Start free. Scale when ready.

Frequently asked

See what your agentactually did.

See what your agent
actually did.

See what your agent
actually did.