Evaluations

Define scoring criteria and automatically judge your agent traces using an LLM.

Creating an Evaluation

curl -X POST https://api.retraceai.tech/api/v1/evaluations \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Agent Quality",
    "criteria": [
      {"name": "accuracy", "description": "Factually correct?", "weight": 1.0},
      {"name": "helpfulness", "description": "Addresses the question?", "weight": 0.8}
    ],
    "judge_model": "gemini-2.5-flash"
  }'

Running Evaluations

curl -X POST https://api.retraceai.tech/api/v1/evaluations/{id}/run \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{"trace_ids": ["trace-1", "trace-2"]}'

How It Works

Retrace summarizes your trace (spans, inputs, outputs, errors)
The judge LLM scores each criterion from 0.0 to 1.0
A weighted average produces the overall score
The judge provides textual feedback

[!TIP] Use gemini-2.5-flash for fast, cheap evaluations. Use gemini-2.5-pro for high-stakes production evals. Enterprise plans automatically use gemini-2.5-pro.

Automation Rules

Set up rules to auto-evaluate traces and alert when quality drops:

curl -X POST https://api.retraceai.tech/api/v1/eval-rules \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "evaluation_id": "eval-uuid",
    "name": "Quality gate",
    "threshold": 0.7,
    "webhook_url": "https://hooks.slack.com/...",
    "notify_email": true,
    "filter": {"projectId": "proj-uuid"}
  }'

When a trace completes:

Rules matching the trace's project/model are triggered
The evaluation runs automatically
If the score falls below the threshold, the webhook fires and email is sent
lastTriggeredAt is updated on the rule

Manage rules from the UI: Evaluations → [Evaluation] → Automation Rules.

Auto-Generated Eval Gates

Retrace analyzes your failure patterns and proposes eval gates automatically:

curl -X POST https://api.retraceai.tech/api/v1/evaluations/auto-generate \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{"max_proposals": 5, "auto_create": false}'

Returns proposals like:

"Null Output Guard: web_search" — Tool returns null/empty result
"Timeout Guard: llm_call" — Step exceeds 30s threshold
"Schema Guard: format_output" — Output fails JSON schema validation

Set auto_create: true to automatically create the proposed evaluations.

Batched Evaluation

Run evaluations across many traces in parallel with shared context:

curl -X POST https://api.retraceai.tech/api/v1/evaluations/:id/batch-run \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{"trace_ids": ["t1", "t2", "t3", ...], "concurrency": 5}'

Batched runs share the system prompt prefix across parallel evaluations, amortizing prefill cost. Maximum 100 traces per batch, concurrency capped at 10.

Evaluations

Evaluations

Creating an Evaluation

Running Evaluations

How It Works

Automation Rules

Auto-Generated Eval Gates

Batched Evaluation

On this page

Evaluations

Evaluations

Creating an Evaluation

Running Evaluations

How It Works

Automation Rules

Auto-Generated Eval Gates

Batched Evaluation

On this page