Evaluations
Run LLM-as-judge evaluations on your agent traces.
Evaluations
Define scoring criteria and automatically judge your agent traces using an LLM.
Creating an Evaluation
curl -X POST https://api.retraceai.tech/api/v1/evaluations \
-H "x-retrace-key: rt_live_..." \
-H "Content-Type: application/json" \
-d '{
"name": "Agent Quality",
"criteria": [
{"name": "accuracy", "description": "Factually correct?", "weight": 1.0},
{"name": "helpfulness", "description": "Addresses the question?", "weight": 0.8}
],
"judge_model": "gemini-2.5-flash"
}'Running Evaluations
curl -X POST https://api.retraceai.tech/api/v1/evaluations/{id}/run \
-H "x-retrace-key: rt_live_..." \
-H "Content-Type: application/json" \
-d '{"trace_ids": ["trace-1", "trace-2"]}'How It Works
- Retrace summarizes your trace (spans, inputs, outputs, errors)
- The judge LLM scores each criterion from 0.0 to 1.0
- A weighted average produces the overall score
- The judge provides textual feedback
[!TIP] Use gemini-2.5-flash for fast, cheap evaluations. Use gemini-2.5-pro for high-stakes production evals. Enterprise plans automatically use gemini-2.5-pro.
Automation Rules
Set up rules to auto-evaluate traces and alert when quality drops:
curl -X POST https://api.retraceai.tech/api/v1/eval-rules \
-H "x-retrace-key: rt_live_..." \
-H "Content-Type: application/json" \
-d '{
"evaluation_id": "eval-uuid",
"name": "Quality gate",
"threshold": 0.7,
"webhook_url": "https://hooks.slack.com/...",
"notify_email": true,
"filter": {"projectId": "proj-uuid"}
}'When a trace completes:
- Rules matching the trace's project/model are triggered
- The evaluation runs automatically
- If the score falls below the threshold, the webhook fires and email is sent
lastTriggeredAtis updated on the rule
Manage rules from the UI: Evaluations → [Evaluation] → Automation Rules.
Auto-Generated Eval Gates
Retrace analyzes your failure patterns and proposes eval gates automatically:
curl -X POST https://api.retraceai.tech/api/v1/evaluations/auto-generate \
-H "x-retrace-key: rt_live_..." \
-H "Content-Type: application/json" \
-d '{"max_proposals": 5, "auto_create": false}'Returns proposals like:
- "Null Output Guard: web_search" — Tool returns null/empty result
- "Timeout Guard: llm_call" — Step exceeds 30s threshold
- "Schema Guard: format_output" — Output fails JSON schema validation
Set auto_create: true to automatically create the proposed evaluations.
Batched Evaluation
Run evaluations across many traces in parallel with shared context:
curl -X POST https://api.retraceai.tech/api/v1/evaluations/:id/batch-run \
-H "x-retrace-key: rt_live_..." \
-H "Content-Type: application/json" \
-d '{"trace_ids": ["t1", "t2", "t3", ...], "concurrency": 5}'Batched runs share the system prompt prefix across parallel evaluations, amortizing prefill cost. Maximum 100 traces per batch, concurrency capped at 10.