Replay Optimization Engine

When you fork a trace, Retrace applies multiple optimization layers to minimize token cost and latency. Combined, these reduce replay cost by an estimated 70-90% compared to naive re-execution.

[!NOTE] The savings percentages on this page (including the 70-90% replay cost and 60-80% storage figures) are estimates from Retrace's cost model, not measured per-account guarantees — actual savings depend on your workload. Provider-side KV-cache configs are computed and surfaced in estimates but are not yet sent on the live generation call, so KV-cache savings are projected, not realized.

Differential Replay

Pre-fork spans inject their recorded outputs directly — no LLM calls needed for steps before the fork point.

Original: [Step 1] → [Step 2] → [Step 3] → [Step 4] → [Step 5]
                                    ↑ Fork here
Replay:   [Inject]   [Inject]   [Re-execute] → [Optimize] → [Optimize]
           $0          $0         Full cost      Cached/Compressed

Semantic Caching

Two-tier cache checks before every LLM call during replay:

Exact match (Redis) — O(1) lookup by content hash of input messages
Semantic match (pgvector) — Cosine similarity ≥ 0.92 threshold

Cache entries track causal dependencies. When a fork modifies a step, only downstream dependent entries are invalidated.

Context Compression

Analyzes message history to remove low-relevance segments:

Recency bias — Recent messages always preserved
Reference detection — Messages referenced in output score higher
Decision influence — Instructions/constraints score higher
Strategies — aggressive (40-50% reduction), moderate (20-30%), conservative (10-15%)

Structured Generation

Detects JSON/schema outputs and pre-populates structural tokens:

// Original output: {"name": "Alice", "age": 30, "role": "engineer"}
// Skeleton:        {"name": "__VAR_0__", "age": __VAR_1__, "role": "__VAR_2__"}
// LLM generates:   Alice|||SEPARATOR|||30|||SEPARATOR|||engineer
// Savings:         ~35% fewer output tokens

Cost Estimator

Before executing a fork, get a cost prediction:

curl https://api.retraceai.tech/api/v1/forks/:id/estimate \
  -H "x-retrace-key: rt_live_..."

Response:

{
  "naive_cost": 0.0045,
  "optimized_cost": 0.0012,
  "savings_percent": 73,
  "human_readable": "This fork will cost ~$0.0012 (vs $0.0045 without optimization) — 73% savings",
  "breakdown": {
    "pre_fork_spans": { "count": 3, "cost": 0, "savings": "differential replay" },
    "optimizations": { "cache_hits": 1, "compression_savings": 0.0008 }
  }
}

Feature Flags

All optimizations are enabled by default. Disable individually via environment variables:

Variable	Default	Effect
`RETRACE_SEMANTIC_CACHE`	`true`	Two-tier response caching
`RETRACE_STRUCTURED_GEN`	`true`	Schema detection + variable-only generation
`RETRACE_CONTEXT_COMPRESSION`	`true`	Attention-based message pruning
`RETRACE_CONTENT_ADDRESSING`	`true`	SHA-256 dedup on ingestion

Content-Addressable Store (CAES)

Every span's input/output is content-addressed on ingestion. Identical prompts across thousands of traces are stored once:

Storage reduction: 60-80% for workloads with shared system prompts
Automatic dedup — no configuration needed
Reference counting prevents premature cleanup

Prompt Optimization Suggestions

Analyze any trace for prompt improvement opportunities:

curl https://api.retraceai.tech/api/v1/traces/:id/prompt-suggestions \
  -H "x-retrace-key: rt_live_..."

Returns suggestions like:

"System prompt is 2000 tokens — could be compressed to 800 tokens"
"3 repeated message segments — remove duplicates to save 150 tokens"
"5 low-relevance messages (score < 0.2) — removing saves 400 tokens"

Replay Optimizations

Replay Optimization Engine

Differential Replay

Semantic Caching

Context Compression

Structured Generation

Cost Estimator

Feature Flags

Content-Addressable Store (CAES)

Prompt Optimization Suggestions

On this page

Replay Optimizations

Replay Optimization Engine

Differential Replay

Semantic Caching

Context Compression

Structured Generation

Cost Estimator

Feature Flags

Content-Addressable Store (CAES)

Prompt Optimization Suggestions

On this page