Evals score your real traffic automatically. An eval is a rule that samples matching traces or spans after ingest and scores them — either with a fast, deterministic code check or with an LLM judge. Scores show up on the trace they came from, roll into the Overview pass-rate KPI, and can drive alerts.Documentation Index
Fetch the complete documentation index at: https://docs.foglamp.dev/llms.txt
Use this file to discover all available pages before exploring further.
Creating an eval
Evals are created in a short wizard:- Target — what to run on. Choose the level:
trace(score the whole run) orspan(score individual steps). Optionally filter by agent, trace name, and — for span-level — span type and model. - Check — what to verify. Pick a preset (below).
- Score — how to score. For LLM judges, pick a judge model; for parameterized checks, set the parameter (a substring, pattern, or max length). Set a sample rate (1%–100%) to control how much matching traffic is scored.
Code checks
Deterministic, free, and run with no external calls:| Preset | Checks |
|---|---|
| No PII | Output is free of emails, phone numbers, SSNs, cards, IPs. |
| No secret leak | Output contains no API-key / token / private-key shapes. |
| Valid JSON | Output parses as JSON. |
| No refusal | Output isn’t a refusal. |
| Non-empty | Output isn’t empty. |
| Max length | Output is within a character budget. |
| Contains / Excludes text | Output does (or doesn’t) contain a substring. |
| Regex match | Output matches a pattern. |
| Tool args valid | (span-only) A tool call’s input is a valid JSON object. |
LLM judges
Judges send the input/output to a model that returns a 1–5 score or a pass/fail verdict with a reason. Presets cover relevance, helpfulness, coherence, conciseness, instruction-following, completeness, toxicity/safety, tool selection, and RAG-oriented checks (faithfulness, context relevance, correctness vs. a reference).LLM judges are bring-your-own-key. Add a provider key (below) before
creating one. An eval with no usable key shows the status needs key and
doesn’t score until a key is added. The set of available judge models is
defined by the deployment.

