Assertions
Set pass/fail criteria with threshold checks, baseline regression detection, and custom assertion logic.
Overview
Assertions run after all metrics are aggregated and produce pass/fail results. They're the quality gates that determine whether an eval run succeeded. Use them in CI to block deployments when quality drops.
interface AssertionResult {
name: string;
passed: boolean;
message: string;
actual?: number;
expected?: number;
}Assertions receive aggregated metric data — mean, median, p95, min, max, stdDev for each metric — plus overall suite stats (total cases, duration, cost).
threshold
The most common assertion. Checks that a metric's mean score meets a minimum value. For latency and cost metrics (detected by name prefix), the check is inverted — the actual value must be at or below the threshold.
import { threshold } from '@cogitator-ai/evals';
const suite = new EvalSuite({
dataset,
target: { fn },
metrics: [exactMatch(), contains()],
assertions: [
threshold('exactMatch', 0.9), // mean exactMatch >= 0.9
threshold('contains', 0.95), // mean contains >= 0.95
threshold('latency.p95', 5000), // p95 latency <= 5000ms
threshold('cost.mean', 0.01), // mean cost <= $0.01
],
});Dot-path Access
You can access nested aggregation fields with dot notation:
| Path | Resolves to |
|---|---|
exactMatch | aggregated.exactMatch.mean |
exactMatch.p95 | aggregated.exactMatch.p95 |
latency.p99 | aggregated.latency.p99 |
cost.median | aggregated.cost.median |
Direction Detection
Metrics whose names start with latency or cost (or end with Duration or Latency) are treated as "lower is better" — the assertion passes when the actual value is at or below the threshold. All other metrics are "higher is better".
noRegression
Compares current results against a saved baseline file. Fails if any metric regresses beyond a tolerance percentage.
Step 1: Save a baseline
Run your eval suite and save the baseline:
const result = await suite.run();
result.saveBaseline('./baseline.json');This writes a JSON file mapping metric names to their mean scores:
{
"exactMatch": 0.92,
"contains": 0.97
}Step 2: Assert no regression
In subsequent runs, compare against the baseline:
import { noRegression } from '@cogitator-ai/evals';
const suite = new EvalSuite({
dataset,
target: { fn },
metrics: [exactMatch(), contains()],
assertions: [
noRegression('./baseline.json'),
noRegression('./baseline.json', { tolerance: 0.1 }), // 10% tolerance
],
});| Option | Type | Default | Description |
|---|---|---|---|
tolerance | number | 0.05 | Allowed regression as a fraction (0.05 = 5%) |
For "higher is better" metrics, the assertion fails if actual < baseline * (1 - tolerance). For "lower is better" metrics (latency, cost), it fails if actual > baseline * (1 + tolerance).
Baseline Workflow
A typical CI workflow:
- Establish a baseline on
main: run evals,saveBaseline('./baseline.json'), commit it - On each PR: run evals with
noRegression('./baseline.json') - If the PR improves scores: update the baseline file
- If the PR regresses: fix the issue or adjust tolerance
Custom Assertions
Build arbitrary pass/fail logic with the assertion() factory:
import { assertion } from '@cogitator-ai/evals';
const budgetCheck = assertion({
name: 'budgetCheck',
check: (_aggregated, stats) => stats.cost < 1.0,
message: 'Total eval cost exceeded $1.00 budget',
});
const speedCheck = assertion({
name: 'speedCheck',
check: (aggregated) => {
const latency = aggregated.latency;
return latency ? latency.p95 < 3000 : true;
},
message: 'P95 latency exceeded 3 seconds',
});
const suite = new EvalSuite({
dataset,
target: { fn },
metrics: [exactMatch()],
statisticalMetrics: [latency()],
assertions: [budgetCheck, speedCheck],
});The check function receives:
aggregated—Record<string, AggregatedMetric>with per-metric statsstats—{ total: number, duration: number, cost: number }for the full run
Return true for pass, false for fail. The optional message is shown on failure.
Combining Assertions
Pass multiple assertions — all must pass for the suite to be considered successful:
assertions: [
threshold('exactMatch', 0.9),
threshold('relevance', 0.8),
noRegression('./baseline.json'),
budgetCheck,
]The report('ci') reporter exits with code 1 if any assertion fails, making it straightforward to integrate with CI pipelines.