// changelog

What's new.

1.24.1

Enforcement & gateway hardening

+Composite AND policies now short-circuit on tool-pattern scope, an out-of-scope call no longer depletes the policy's token/USD/step budget, so a later in-scope call trips exactly when its own usage warrants
+USD budget counters pipeline their INCRBYFLOAT + EXPIRE atomically (parity with the integer counters), no window where a USD key could be left without a TTL
+The enforcement gateway returns a generic 502 on an unreachable provider (the upstream error detail is logged server-side, not echoed to the caller)
+Docs: clarified that gateway streaming calls enforce token budgets pre-call (via max_tokens) only, and that gateway trace_created is metered after the provider responds

1.24.0

Onboarding & growth

+Guided quickstart, the empty dashboard now walks you from API key → install → env var → first trace, with copy-paste Python/TypeScript snippets and a live "waiting for your first trace…" state that flips the moment one arrives
+One-click demo workspace, load sample traces, a detection, and a tape into your own project to explore every feature before shipping a span; clearly badged, one-click removable, and never counted toward your usage
+Trending tapes, a public /tapes/trending gallery (cached, public-only) plus "forked N times" social proof on shared tapes; fork any public tape to replay it in your own account
+Clearer empty states across the app explaining what each feature does and the one action to populate it
+Activation funnel, first-party milestone tracking (signup → SDK installed → first trace → detection → replay → tape shared) with an internal admin view; no third-party analytics

1.23.0

Starter plan + pricing refresh

+New Starter plan ($29/mo · $290/yr), 10,000 traces/mo, 30-day retention, 100 fork replays, 25 prove-the-fix runs, and Cassette VCR deterministic replay. The cheapest way to get the replay workflow
+Enterprise is now a self-serve annual plan ($24,000/yr · billed annually), no monthly Enterprise product
+Pricing, landing, and billing pages now render every plan from a single shared source of truth, so quotas shown always match what the API enforces
+Existing Pro/Teams subscribers are grandfathered, invoices always reflect the amount actually charged, regardless of list-price changes for new customers
+Billing fix: every Dodo product now maps to the correct plan (a Starter purchase grants Starter, never Pro); unmapped products alert the team instead of silently defaulting

1.22.0

Reliability Platform

+Enforcement (circuit breakers), budget ceilings, loop breakers, debounce windows, and tool-pattern action policies that block or hold an agent action BEFORE it runs (not just flag it after). Server-side decision engine on atomic, fail-closed Valkey counters, a low-latency /enforcement/check pre-call gate, policy versioning + a hold-for-approval queue, and a new Enforcement page
+SDK pre-call gate (Python + TypeScript), local step/token/USD-per-run ceilings enforced offline with zero network, optional server-policy consult, and a typed RetraceEnforcementError that stops the run instead of silently skipping the call
+MAST failure taxonomy, failed traces are auto-classified into the 14 MAST failure modes (an LLM judge metered separately from your AI quota, idempotent per trace), surfaced as detections and an Insights → Failure Taxonomy breakdown
+Prove the fix, re-run a trace (fork → replay → first-divergence diff → judge verdict) to confirm a fix improved/regressed/unchanged; hypothesis testing via typed substitutions (swap model/prompt/tool-output), a dashboard Verify-fix button, a `retrace traces verify-fix` CLI gate, and an AI-widget tool
+Multi-agent traces, record which agent produced each span (Python `retrace.agent(...)`, TypeScript `withAgent`, OTel gen_ai semconv), see the agent topology graph on multi-agent traces, and catch agent ping-pong loops (all plans) plus reasoning-action mismatch (FM-2.6) and task derailment (FM-2.3) detectors (Pro+)

1.0.0

Advanced AI Agent Analytics Engine

+Deep Fork Replay, context injection flows fork output into subsequent prompts, tool output mocking, batch sweeps with multiple variants
+Runtime Guardrails, cost budgets, loop detection, context overflow, latency budgets with halt/alert/throttle actions enforced in real-time via WebSocket
+Sessions, group multi-turn conversations by session_id, execution DAG graph endpoints with causal ordering
+Eval CI/CD Gate, POST /evaluations/:id/gate returns pass/fail with regression detection against baselines
+Multi-Agent Tracing, agent_id on spans, vector clocks for causal ordering, cross-agent interaction graphs
+Hallucination Detection, tiered pipeline using KL divergence, grounding scores, and entropy analysis on every LLM call
+Critical Path Analysis, Brandes betweenness centrality identifies decision bottlenecks in trace DAGs
+Adaptive Guardrails, LinUCB contextual bandit learns optimal thresholds per trace-type from trigger history
+Trace Similarity, Sinkhorn optimal transport (Wasserstein distance) finds similar failures across all traces
+Embedding-Influence Scores, per-span heuristic ranking from embedding similarity to the final span (experimental; a similarity proxy, not causal inference)
+LTL Safety Verification, formal property checking (always/never/eventually/precedes) on traces
+Spectral Gap Heuristic, Laplacian eigenvalue gap of the agent interaction graph compared to a fixed baseline (experimental; a threshold heuristic, not learned anomaly detection)
+Fork Preferences & RLHF Export, record preferred fork outcomes, export as DPO training pairs
+Fork-Position Baseline, estimates fork success/cost from fork POSITION via inverse-distance weighting over past forks (experimental; a position baseline — not counterfactual prediction, and not nearest-k)
+Delta Compression, structural hashing + diff encoding for efficient trace storage
+Probabilistic Detection, Count-Min Sketch, HyperLogLog, Bloom filters in Valkey for real-time anomaly detection
+Fork State Checkpointing, progress persisted in Valkey, resumes from last completed span on retry
+28 Prometheus metrics + 16 alerts + 20-panel Grafana dashboard
+GitHub App integration, eval gate results posted on PRs, connect from Settings
+CLI v1.1.0, retrace sessions, retrace guardrails commands
+SDKs v0.3.1, session_id, agent_id, halt handler

0.6.0

Git Branching for AI Agent Execution

+Full Cascade Replay, fork at any span, SDK re-executes entire agent with modified input (not just one call)
+SDK Resume Protocol, dedicated listener thread receives 'resume' commands via WebSocket in real-time
+Ingestion Queue, BullMQ buffered writes with backpressure, 3 retries, exponential backoff
+Event Bus Architecture, trace.started, span.created, trace.ended events decouple ingestion from processing
+AI Gateway, single entry point for all AI calls with usage enforcement and cost tracking
+PII Redaction Layer, auto-detects emails, phones, SSNs, credit cards, API keys, JWTs in span data
+Audit Logging, typed actions (trace.created, fork.replayed, api_key.revoked) with IP tracking
+Rule Engine, extracted eval rule processing into standalone service with deduplication
+Trace Replay Player, play/pause, 0.5x-4x speed controls, auto-advance through spans like a video
+Feedback Export, GET /api/v1/feedback/export for RLHF training data (input/output/score)
+Hallucination Scoring, real vocabulary grounding check replaces placeholder
+Per-Route Rate Limiting, factory function for AI (20/min), search (50/min), export (5/min)
+SDK Offline Buffer, stores up to 1000 messages when disconnected, flushes on reconnect
+HTTP Retry, 3 attempts with exponential backoff on fallback transport
+Composite Indexes, (traceId, startedAt) and (evaluationId, traceId) for query performance
+Eval Rules UI, full CRUD page with threshold, webhook, email notification configuration
+Governance, toxic output detection, real hallucination rate calculation
+GitHub Actions upgraded to upload-artifact@v7 / download-artifact@v8 (Node 24 compatible)

0.5.0

AI Quality Intelligence, Self-Hosted Infra & SDK Sampling

+Prompt Versioning, registry with version history, rollback, and diff (POST/GET /api/v1/prompts)
+Evaluation Datasets, curate golden traces into benchmark suites for regression testing
+Drift Detection, statistical comparison of recent vs baseline metrics (cost, tokens, duration, error rate)
+Failure Clustering, automatic grouping of failed traces by normalized error patterns
+Human Feedback Loop, thumbs up/down on traces for training data curation
+Trace Sampling, configurable sample_rate (0.0-1.0) in both Python and TypeScript SDKs
+Eval Rules Automation UI, create/manage threshold alerts with webhook and email notifications
+Eval Rules Executor, filter by project/model, threshold comparison, webhook POST, email alerts
+Plan-aware eval judging, enterprise gets gemini-2.5-pro, pro/teams gets gemini-2.5-flash
+Billing usage dashboard, progress bars showing traces, tapes, forks, AI requests with limits
+Featured tapes section on marketing homepage
+Distributed event pipeline, Kafka producer/consumer for async span ingestion
+Idempotent processing, duplicate spans from SDK retries safely ignored

0.3.0

Multi-Provider Auto-Instrumentation & Team Collaboration

+OpenAI auto-instrumentation, GPT-5.5, GPT-5.4, GPT-5, GPT-4.1 series captured automatically
+Anthropic auto-instrumentation, Claude Opus 4.7, Sonnet 4.6, Haiku 4.5 captured automatically
+Gemini auto-instrumentation, Gemini 3 Pro, 3 Flash, 2.5 Pro, 2.5 Flash captured automatically
+Public tape gallery, explore trending, newest, and featured tapes at /explore
+Team collaboration, comment on spans, assign traces to teammates, notifications
+Continuous evaluations, auto-run evals on matching traces with webhook alerts
+Resumable forks, mark traces as resumable for full agent re-execution from any checkpoint
+Free tier expanded from 500 to 5,000 traces/month
+MIT license, core platform is now open source
+Security hardening, 35 audit findings resolved across all severity levels

0.2.0

Fork Engine, OpenTelemetry & Evaluations

+Fork execution engine, replay LLM calls with modified inputs and side-by-side diff
+OpenTelemetry OTLP/JSON ingestion endpoint for framework-agnostic tracing
+LLM-as-judge evaluation system with custom criteria and batch runs
+Real-time WebSocket streaming for live trace updates in the dashboard
+Semantic search across all traces and spans with vector similarity
+Migrated to Dodo Payments for subscription billing
+Trace deletion with full cascade cleanup
+Fork comparison UI with divergence scoring

0.1.0

Public Beta Launch

+Core trace recording with Python and TypeScript SDKs
+Interactive tape player for step-by-step execution replay
+Fork from any span and rerun with modified inputs
+Shareable tape links with OG images, no login required to view
+Persistent agent memory with semantic search and auto-extraction
+Skills system for reusable, composable agent capabilities
+MCP server integration for Claude Code and Cursor
+WebSocket streaming for real-time span ingestion from SDKs

// changelog

What's new.

1.24.1

Enforcement & gateway hardening

+Composite AND policies now short-circuit on tool-pattern scope, an out-of-scope call no longer depletes the policy's token/USD/step budget, so a later in-scope call trips exactly when its own usage warrants
+USD budget counters pipeline their INCRBYFLOAT + EXPIRE atomically (parity with the integer counters), no window where a USD key could be left without a TTL
+The enforcement gateway returns a generic 502 on an unreachable provider (the upstream error detail is logged server-side, not echoed to the caller)
+Docs: clarified that gateway streaming calls enforce token budgets pre-call (via max_tokens) only, and that gateway trace_created is metered after the provider responds

1.24.0

Onboarding & growth

+Guided quickstart, the empty dashboard now walks you from API key → install → env var → first trace, with copy-paste Python/TypeScript snippets and a live "waiting for your first trace…" state that flips the moment one arrives
+One-click demo workspace, load sample traces, a detection, and a tape into your own project to explore every feature before shipping a span; clearly badged, one-click removable, and never counted toward your usage
+Trending tapes, a public /tapes/trending gallery (cached, public-only) plus "forked N times" social proof on shared tapes; fork any public tape to replay it in your own account
+Clearer empty states across the app explaining what each feature does and the one action to populate it
+Activation funnel, first-party milestone tracking (signup → SDK installed → first trace → detection → replay → tape shared) with an internal admin view; no third-party analytics

1.23.0

Starter plan + pricing refresh

+New Starter plan ($29/mo · $290/yr), 10,000 traces/mo, 30-day retention, 100 fork replays, 25 prove-the-fix runs, and Cassette VCR deterministic replay. The cheapest way to get the replay workflow
+Enterprise is now a self-serve annual plan ($24,000/yr · billed annually), no monthly Enterprise product
+Pricing, landing, and billing pages now render every plan from a single shared source of truth, so quotas shown always match what the API enforces
+Existing Pro/Teams subscribers are grandfathered, invoices always reflect the amount actually charged, regardless of list-price changes for new customers
+Billing fix: every Dodo product now maps to the correct plan (a Starter purchase grants Starter, never Pro); unmapped products alert the team instead of silently defaulting

1.22.0

Reliability Platform

+Enforcement (circuit breakers), budget ceilings, loop breakers, debounce windows, and tool-pattern action policies that block or hold an agent action BEFORE it runs (not just flag it after). Server-side decision engine on atomic, fail-closed Valkey counters, a low-latency /enforcement/check pre-call gate, policy versioning + a hold-for-approval queue, and a new Enforcement page
+SDK pre-call gate (Python + TypeScript), local step/token/USD-per-run ceilings enforced offline with zero network, optional server-policy consult, and a typed RetraceEnforcementError that stops the run instead of silently skipping the call
+MAST failure taxonomy, failed traces are auto-classified into the 14 MAST failure modes (an LLM judge metered separately from your AI quota, idempotent per trace), surfaced as detections and an Insights → Failure Taxonomy breakdown
+Prove the fix, re-run a trace (fork → replay → first-divergence diff → judge verdict) to confirm a fix improved/regressed/unchanged; hypothesis testing via typed substitutions (swap model/prompt/tool-output), a dashboard Verify-fix button, a `retrace traces verify-fix` CLI gate, and an AI-widget tool
+Multi-agent traces, record which agent produced each span (Python `retrace.agent(...)`, TypeScript `withAgent`, OTel gen_ai semconv), see the agent topology graph on multi-agent traces, and catch agent ping-pong loops (all plans) plus reasoning-action mismatch (FM-2.6) and task derailment (FM-2.3) detectors (Pro+)

1.0.0

Advanced AI Agent Analytics Engine

+Deep Fork Replay, context injection flows fork output into subsequent prompts, tool output mocking, batch sweeps with multiple variants
+Runtime Guardrails, cost budgets, loop detection, context overflow, latency budgets with halt/alert/throttle actions enforced in real-time via WebSocket
+Sessions, group multi-turn conversations by session_id, execution DAG graph endpoints with causal ordering
+Eval CI/CD Gate, POST /evaluations/:id/gate returns pass/fail with regression detection against baselines
+Multi-Agent Tracing, agent_id on spans, vector clocks for causal ordering, cross-agent interaction graphs
+Hallucination Detection, tiered pipeline using KL divergence, grounding scores, and entropy analysis on every LLM call
+Critical Path Analysis, Brandes betweenness centrality identifies decision bottlenecks in trace DAGs
+Adaptive Guardrails, LinUCB contextual bandit learns optimal thresholds per trace-type from trigger history
+Trace Similarity, Sinkhorn optimal transport (Wasserstein distance) finds similar failures across all traces
+Embedding-Influence Scores, per-span heuristic ranking from embedding similarity to the final span (experimental; a similarity proxy, not causal inference)
+LTL Safety Verification, formal property checking (always/never/eventually/precedes) on traces
+Spectral Gap Heuristic, Laplacian eigenvalue gap of the agent interaction graph compared to a fixed baseline (experimental; a threshold heuristic, not learned anomaly detection)
+Fork Preferences & RLHF Export, record preferred fork outcomes, export as DPO training pairs
+Fork-Position Baseline, estimates fork success/cost from fork POSITION via inverse-distance weighting over past forks (experimental; a position baseline — not counterfactual prediction, and not nearest-k)
+Delta Compression, structural hashing + diff encoding for efficient trace storage
+Probabilistic Detection, Count-Min Sketch, HyperLogLog, Bloom filters in Valkey for real-time anomaly detection
+Fork State Checkpointing, progress persisted in Valkey, resumes from last completed span on retry
+28 Prometheus metrics + 16 alerts + 20-panel Grafana dashboard
+GitHub App integration, eval gate results posted on PRs, connect from Settings
+CLI v1.1.0, retrace sessions, retrace guardrails commands
+SDKs v0.3.1, session_id, agent_id, halt handler

0.6.0

Git Branching for AI Agent Execution

+Full Cascade Replay, fork at any span, SDK re-executes entire agent with modified input (not just one call)
+SDK Resume Protocol, dedicated listener thread receives 'resume' commands via WebSocket in real-time
+Ingestion Queue, BullMQ buffered writes with backpressure, 3 retries, exponential backoff
+Event Bus Architecture, trace.started, span.created, trace.ended events decouple ingestion from processing
+AI Gateway, single entry point for all AI calls with usage enforcement and cost tracking
+PII Redaction Layer, auto-detects emails, phones, SSNs, credit cards, API keys, JWTs in span data
+Audit Logging, typed actions (trace.created, fork.replayed, api_key.revoked) with IP tracking
+Rule Engine, extracted eval rule processing into standalone service with deduplication
+Trace Replay Player, play/pause, 0.5x-4x speed controls, auto-advance through spans like a video
+Feedback Export, GET /api/v1/feedback/export for RLHF training data (input/output/score)
+Hallucination Scoring, real vocabulary grounding check replaces placeholder
+Per-Route Rate Limiting, factory function for AI (20/min), search (50/min), export (5/min)
+SDK Offline Buffer, stores up to 1000 messages when disconnected, flushes on reconnect
+HTTP Retry, 3 attempts with exponential backoff on fallback transport
+Composite Indexes, (traceId, startedAt) and (evaluationId, traceId) for query performance
+Eval Rules UI, full CRUD page with threshold, webhook, email notification configuration
+Governance, toxic output detection, real hallucination rate calculation
+GitHub Actions upgraded to upload-artifact@v7 / download-artifact@v8 (Node 24 compatible)

0.5.0

AI Quality Intelligence, Self-Hosted Infra & SDK Sampling

+Prompt Versioning, registry with version history, rollback, and diff (POST/GET /api/v1/prompts)
+Evaluation Datasets, curate golden traces into benchmark suites for regression testing
+Drift Detection, statistical comparison of recent vs baseline metrics (cost, tokens, duration, error rate)
+Failure Clustering, automatic grouping of failed traces by normalized error patterns
+Human Feedback Loop, thumbs up/down on traces for training data curation
+Trace Sampling, configurable sample_rate (0.0-1.0) in both Python and TypeScript SDKs
+Eval Rules Automation UI, create/manage threshold alerts with webhook and email notifications
+Eval Rules Executor, filter by project/model, threshold comparison, webhook POST, email alerts
+Plan-aware eval judging, enterprise gets gemini-2.5-pro, pro/teams gets gemini-2.5-flash
+Billing usage dashboard, progress bars showing traces, tapes, forks, AI requests with limits
+Featured tapes section on marketing homepage
+Distributed event pipeline, Kafka producer/consumer for async span ingestion
+Idempotent processing, duplicate spans from SDK retries safely ignored

0.3.0

Multi-Provider Auto-Instrumentation & Team Collaboration

+OpenAI auto-instrumentation, GPT-5.5, GPT-5.4, GPT-5, GPT-4.1 series captured automatically
+Anthropic auto-instrumentation, Claude Opus 4.7, Sonnet 4.6, Haiku 4.5 captured automatically
+Gemini auto-instrumentation, Gemini 3 Pro, 3 Flash, 2.5 Pro, 2.5 Flash captured automatically
+Public tape gallery, explore trending, newest, and featured tapes at /explore
+Team collaboration, comment on spans, assign traces to teammates, notifications
+Continuous evaluations, auto-run evals on matching traces with webhook alerts
+Resumable forks, mark traces as resumable for full agent re-execution from any checkpoint
+Free tier expanded from 500 to 5,000 traces/month
+MIT license, core platform is now open source
+Security hardening, 35 audit findings resolved across all severity levels

0.2.0

Fork Engine, OpenTelemetry & Evaluations

+Fork execution engine, replay LLM calls with modified inputs and side-by-side diff
+OpenTelemetry OTLP/JSON ingestion endpoint for framework-agnostic tracing
+LLM-as-judge evaluation system with custom criteria and batch runs
+Real-time WebSocket streaming for live trace updates in the dashboard
+Semantic search across all traces and spans with vector similarity
+Migrated to Dodo Payments for subscription billing
+Trace deletion with full cascade cleanup
+Fork comparison UI with divergence scoring

0.1.0

Public Beta Launch

+Core trace recording with Python and TypeScript SDKs
+Interactive tape player for step-by-step execution replay
+Fork from any span and rerun with modified inputs
+Shareable tape links with OG images, no login required to view
+Persistent agent memory with semantic search and auto-extraction
+Skills system for reusable, composable agent capabilities
+MCP server integration for Claude Code and Cursor
+WebSocket streaming for real-time span ingestion from SDKs