Evaluation Framework
Systematically evaluate LLM agents with @cogitator-ai/evals — dataset-driven testing, deterministic and LLM-as-judge metrics, assertions, A/B comparison, and CI-ready reporters.
Why Evaluate?
LLM outputs are non-deterministic. A prompt change that improves one response might degrade ten others. Without systematic evaluation, you're flying blind. @cogitator-ai/evals gives you a structured way to measure agent quality, catch regressions, and compare model configurations — all with statistical rigor.
Installation
pnpm add @cogitator-ai/evalsOptional peer dependency for CSV dataset loading:
pnpm add papaparseQuick Start
import { Dataset, EvalSuite, exactMatch, contains, threshold } from '@cogitator-ai/evals';
const dataset = Dataset.from([
{ input: 'What is 2+2?', expected: '4' },
{ input: 'Capital of France?', expected: 'Paris' },
{ input: 'Largest planet?', expected: 'Jupiter' },
]);
const suite = new EvalSuite({
dataset,
target: {
fn: async (input) => {
// your agent or LLM call here
return 'some response';
},
},
metrics: [exactMatch(), contains()],
assertions: [threshold('exactMatch', 0.8)],
});
const result = await suite.run();
result.report('console');Architecture
┌──────────┐ ┌───────────┐ ┌─────────┐ ┌────────────┐
│ Dataset │───▶│ EvalSuite │───▶│ Metrics │───▶│ Assertions │
└──────────┘ └───────────┘ └─────────┘ └────────────┘
│ │
▼ ▼
┌───────────┐ ┌────────────┐
│ Target │ │ Reporters │
└───────────┘ └────────────┘Eval flow: Load test cases from a Dataset (JSONL, CSV, or programmatic) → EvalSuite runs each case against a Target (agent or function) with concurrency control → Metrics score each response → Scores are aggregated (mean, median, p95, etc.) → Assertions check thresholds and regressions → Reporters output results to console, JSON, CSV, or CI.
EvalSuite
The core class that orchestrates evaluation runs.
const suite = new EvalSuite({
dataset, // required — Dataset instance
target: { fn }, // required — what to evaluate
metrics: [...], // optional — per-case scoring functions
statisticalMetrics: [], // optional — aggregate metrics (latency, cost)
judge: { model: '...'}, // required if using LLM metrics
assertions: [...], // optional — pass/fail checks
concurrency: 5, // optional — parallel case execution
timeout: 30000, // optional — per-case timeout in ms
retries: 0, // optional — retry failed cases
onProgress: (p) => {}, // optional — progress callback
});Target
The target defines what you're evaluating. Use fn for a simple function, or agent + cogitator for a full Cogitator agent:
// function target
const target = {
fn: async (input: string) => {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: input }],
});
return response.choices[0].message.content!;
},
};
// agent target
const target = {
agent: myAgent,
cogitator: runtime,
};Results
suite.run() returns EvalSuiteResult with everything you need:
const result = await suite.run();
result.results; // per-case outputs and scores
result.aggregated; // { exactMatch: { mean, median, p95, ... }, ... }
result.assertions; // [{ name, passed, message }, ...]
result.stats; // { total, duration, cost }
result.report('console');
result.report(['json', 'csv'], { path: './eval-report' });
result.saveBaseline('./baseline.json');EvalBuilder
Fluent builder API as an alternative to direct EvalSuite construction:
import { EvalBuilder, Dataset, exactMatch, contains, latency, threshold } from '@cogitator-ai/evals';
const suite = new EvalBuilder()
.withDataset(dataset)
.withTarget({ fn: myFn })
.withMetrics([exactMatch(), contains()])
.withStatisticalMetrics([latency()])
.withAssertions([threshold('exactMatch', 0.9)])
.withConcurrency(10)
.withTimeout(15000)
.build();
const result = await suite.run();Next Steps
- Datasets — load test cases from JSONL, CSV, or build programmatically
- Metrics — deterministic, LLM-as-judge, statistical, and custom metrics
- Assertions — threshold checks, regression detection, custom assertions
- A/B Comparison — compare two targets with statistical significance testing
- Reporters — console, JSON, CSV, and CI output formats