Replay Optimizations
How Retrace reduces fork replay cost by 70-90% using semantic caching, differential replay, structured generation, and context compression.
Replay Optimization Engine
When you fork a trace, Retrace applies multiple optimization layers to minimize token cost and latency. Combined, these reduce replay cost by an estimated 70-90% compared to naive re-execution.
[!NOTE] The savings percentages on this page (including the 70-90% replay cost and 60-80% storage figures) are estimates from Retrace's cost model, not measured per-account guarantees — actual savings depend on your workload. Provider-side KV-cache configs are computed and surfaced in estimates but are not yet sent on the live generation call, so KV-cache savings are projected, not realized.
Differential Replay
Pre-fork spans inject their recorded outputs directly — no LLM calls needed for steps before the fork point.
Original: [Step 1] → [Step 2] → [Step 3] → [Step 4] → [Step 5]
↑ Fork here
Replay: [Inject] [Inject] [Re-execute] → [Optimize] → [Optimize]
$0 $0 Full cost Cached/CompressedSemantic Caching
Two-tier cache checks before every LLM call during replay:
- Exact match (Redis) — O(1) lookup by content hash of input messages
- Semantic match (pgvector) — Cosine similarity ≥ 0.92 threshold
Cache entries track causal dependencies. When a fork modifies a step, only downstream dependent entries are invalidated.
Context Compression
Analyzes message history to remove low-relevance segments:
- Recency bias — Recent messages always preserved
- Reference detection — Messages referenced in output score higher
- Decision influence — Instructions/constraints score higher
- Strategies —
aggressive(40-50% reduction),moderate(20-30%),conservative(10-15%)
Structured Generation
Detects JSON/schema outputs and pre-populates structural tokens:
// Original output: {"name": "Alice", "age": 30, "role": "engineer"}
// Skeleton: {"name": "__VAR_0__", "age": __VAR_1__, "role": "__VAR_2__"}
// LLM generates: Alice|||SEPARATOR|||30|||SEPARATOR|||engineer
// Savings: ~35% fewer output tokensCost Estimator
Before executing a fork, get a cost prediction:
curl https://api.retraceai.tech/api/v1/forks/:id/estimate \
-H "x-retrace-key: rt_live_..."Response:
{
"naive_cost": 0.0045,
"optimized_cost": 0.0012,
"savings_percent": 73,
"human_readable": "This fork will cost ~$0.0012 (vs $0.0045 without optimization) — 73% savings",
"breakdown": {
"pre_fork_spans": { "count": 3, "cost": 0, "savings": "differential replay" },
"optimizations": { "cache_hits": 1, "compression_savings": 0.0008 }
}
}Feature Flags
All optimizations are enabled by default. Disable individually via environment variables:
| Variable | Default | Effect |
|---|---|---|
RETRACE_SEMANTIC_CACHE | true | Two-tier response caching |
RETRACE_STRUCTURED_GEN | true | Schema detection + variable-only generation |
RETRACE_CONTEXT_COMPRESSION | true | Attention-based message pruning |
RETRACE_CONTENT_ADDRESSING | true | SHA-256 dedup on ingestion |
Content-Addressable Store (CAES)
Every span's input/output is content-addressed on ingestion. Identical prompts across thousands of traces are stored once:
- Storage reduction: 60-80% for workloads with shared system prompts
- Automatic dedup — no configuration needed
- Reference counting prevents premature cleanup
Prompt Optimization Suggestions
Analyze any trace for prompt improvement opportunities:
curl https://api.retraceai.tech/api/v1/traces/:id/prompt-suggestions \
-H "x-retrace-key: rt_live_..."Returns suggestions like:
- "System prompt is 2000 tokens — could be compressed to 800 tokens"
- "3 repeated message segments — remove duplicates to save 150 tokens"
- "5 low-relevance messages (score < 0.2) — removing saves 400 tokens"