Cogitator
RAG Pipeline

Document Loaders

Load documents from text files, markdown, JSON, CSV, HTML, PDFs, and web URLs into the RAG pipeline.

Overview

Document loaders convert raw sources into RAGDocument objects that the pipeline can chunk and embed. Every loader implements the DocumentLoader interface:

interface DocumentLoader {
  load(source: string): Promise<RAGDocument[]>;
  readonly supportedTypes: string[];
}

The source parameter is a file path for file-based loaders, or a URL for WebLoader. Directory paths are supported by TextLoader and MarkdownLoader — they'll load all matching files in the directory.

TextLoader

Loads plain text files (.txt, .text). Accepts a file path or directory.

import { TextLoader } from '@cogitator-ai/rag';

const loader = new TextLoader();

const docs = await loader.load('./data/notes.txt');
const allDocs = await loader.load('./data/text-files/');

No options. Documents get sourceType: 'text'.

MarkdownLoader

Loads .md and .mdx files. Optionally strips YAML frontmatter and extracts it as metadata.

import { MarkdownLoader } from '@cogitator-ai/rag';

const loader = new MarkdownLoader({ stripFrontmatter: true });
const docs = await loader.load('./docs/');

console.log(docs[0].metadata);
// { title: 'Getting Started', description: 'Quick start guide' }
OptionTypeDefaultDescription
stripFrontmatterbooleanfalseRemove YAML frontmatter and parse it as metadata

JSONLoader

Loads JSON files. Handles both single objects and arrays. Automatically detects content, text, or body fields, or you can specify a custom content field.

import { JSONLoader } from '@cogitator-ai/rag';

const loader = new JSONLoader({
  contentField: 'description',
  metadataFields: ['category', 'author'],
});

const docs = await loader.load('./data/articles.json');

If no content field is found, the entire object is serialized as the document content.

OptionTypeDefaultDescription
contentFieldstringautoField name to use as document content
metadataFieldsstring[]Fields to extract as document metadata

Auto-detected content fields (in order): content, text, body.

CSVLoader

Loads CSV files. Each row becomes a separate document. Requires papaparse as a peer dependency.

pnpm add papaparse
import { CSVLoader } from '@cogitator-ai/rag';

const loader = new CSVLoader({
  contentColumn: 'description',
  metadataColumns: ['id', 'category'],
  delimiter: ',',
});

const docs = await loader.load('./data/products.csv');

If no contentColumn is specified, the first column is used.

OptionTypeDefaultDescription
contentColumnstringfirst columnColumn to use as document content
metadataColumnsstring[]Columns to extract as metadata
delimiterstringautoCSV delimiter character

HTMLLoader

Loads HTML files and extracts text content using CSS selectors. Requires cheerio as a peer dependency.

pnpm add cheerio
import { HTMLLoader } from '@cogitator-ai/rag';

const loader = new HTMLLoader({ selector: 'article' });
const docs = await loader.load('./pages/about.html');

console.log(docs[0].metadata);
// { title: 'About Us' }

The <title> tag is automatically extracted as metadata if present.

OptionTypeDefaultDescription
selectorstringbodyCSS selector for content extraction

PDFLoader

Loads PDF files and extracts text. Optionally splits into one document per page. Requires pdf-parse as a peer dependency.

pnpm add pdf-parse
import { PDFLoader } from '@cogitator-ai/rag';

const loader = new PDFLoader();
const docs = await loader.load('./papers/research.pdf');
console.log(docs[0].metadata);
// { pages: 42 }

const perPage = new PDFLoader({ splitPages: true });
const pageDocs = await perPage.load('./papers/research.pdf');
console.log(pageDocs[0].metadata);
// { pageNumber: 1, totalPages: 42 }
OptionTypeDefaultDescription
splitPagesbooleanfalseCreate one document per page instead of one

WebLoader

Fetches a URL and extracts text content. Uses HTMLLoader internally, so it requires cheerio.

pnpm add cheerio
import { WebLoader } from '@cogitator-ai/rag';

const loader = new WebLoader({
  selector: 'main',
  headers: { 'User-Agent': 'CogitatorBot/1.0' },
});

const docs = await loader.load('https://example.com/docs/getting-started');
OptionTypeDefaultDescription
selectorstringbodyCSS selector for content extraction
headersRecord<string, string>Custom HTTP headers for the request

Custom Loaders

Implement the DocumentLoader interface to load from any source:

import { nanoid } from 'nanoid';
import type { DocumentLoader, RAGDocument } from '@cogitator-ai/types';

class NotionLoader implements DocumentLoader {
  readonly supportedTypes = ['notion'];
  private apiKey: string;

  constructor(apiKey: string) {
    this.apiKey = apiKey;
  }

  async load(source: string): Promise<RAGDocument[]> {
    const pageId = source;
    const content = await this.fetchNotionPage(pageId);

    return [{
      id: nanoid(),
      content,
      source: `notion://${pageId}`,
      sourceType: 'text',
    }];
  }

  private async fetchNotionPage(pageId: string): Promise<string> {
    // your Notion API logic here
    return '';
  }
}

Pass your custom loader to the pipeline builder:

const pipeline = new RAGPipelineBuilder()
  .withLoader(new NotionLoader(process.env.NOTION_API_KEY!))
  // ...
  .build();

await pipeline.ingest('page-id-here');

On this page