Datasets
Load evaluation test cases from JSONL files, CSV files, or build them programmatically with filtering, sampling, and shuffling.
Overview
A Dataset is an immutable collection of EvalCase objects. Each case has an input string and optional expected, context, and metadata fields. Datasets are validated with Zod on construction — invalid cases throw immediately.
interface EvalCase {
input: string;
expected?: string;
context?: Record<string, unknown>;
metadata?: Record<string, unknown>;
}Programmatic Construction
Build a dataset from an array of cases:
import { Dataset } from '@cogitator-ai/evals';
const dataset = Dataset.from([
{ input: 'What is 2+2?', expected: '4' },
{ input: 'Capital of France?', expected: 'Paris' },
{
input: 'Summarize this document',
context: { document: 'Long text here...' },
metadata: { category: 'summarization' },
},
]);
console.log(dataset.length); // 3The expected field is optional — some metrics (like LLM-as-judge) don't need it.
JSONL Loading
Load cases from a JSONL file where each line is a JSON object:
const dataset = await Dataset.fromJsonl('./data/eval-cases.jsonl');Example JSONL file:
{"input": "What is 2+2?", "expected": "4"}
{"input": "Capital of France?", "expected": "Paris"}
{"input": "Translate hello to Spanish", "expected": "hola"}Each line is validated against the EvalCase schema. Invalid JSON or missing input fields throw with the line number.
You can also use the standalone loader:
import { loadJsonl } from '@cogitator-ai/evals';
const cases = await loadJsonl('./data/eval-cases.jsonl');
const dataset = Dataset.from(cases);CSV Loading
Load cases from a CSV file. Requires papaparse as a peer dependency.
pnpm add papaparseconst dataset = await Dataset.fromCsv('./data/eval-cases.csv');Example CSV file:
input,expected,metadata.category,context.topic
What is 2+2?,4,math,arithmetic
Capital of France?,Paris,geography,europeThe CSV must have an input column. The expected column is optional. Columns prefixed with metadata. are extracted into the metadata object, and columns prefixed with context. go into the context object.
Standalone loader:
import { loadCsv } from '@cogitator-ai/evals';
const cases = await loadCsv('./data/eval-cases.csv');Filtering
Create a subset of cases matching a predicate. Returns a new Dataset — the original is unchanged.
const mathOnly = dataset.filter((c) => c.metadata?.category === 'math');
const withExpected = dataset.filter((c) => c.expected !== undefined);Sampling
Randomly sample n cases from the dataset. Useful for quick smoke tests on large datasets.
const sample = dataset.sample(10);
console.log(sample.length); // 10 (or less if dataset is smaller)If n exceeds the dataset size, all cases are returned (shuffled).
Shuffling
Randomize the order of cases. Returns a new Dataset.
const shuffled = dataset.shuffle();Iteration
Datasets are iterable and expose a cases property:
for (const evalCase of dataset) {
console.log(evalCase.input);
}
const allCases = dataset.cases; // readonly EvalCase[]Chaining
Filter, sample, and shuffle are chainable:
const subset = dataset
.filter((c) => c.metadata?.difficulty === 'hard')
.shuffle()
.sample(50);Evaluation Framework
Systematically evaluate LLM agents with @cogitator-ai/evals — dataset-driven testing, deterministic and LLM-as-judge metrics, assertions, A/B comparison, and CI-ready reporters.
Metrics
Score agent responses with deterministic checks, LLM-as-judge evaluation, statistical aggregation, and custom metric functions.