Cogitator
Core

Vision & Audio

Multi-modal capabilities including image analysis, image generation, audio transcription, and text-to-speech.

Overview

Cogitator provides four multi-modal tool factories for working with images and audio. Each factory returns a standard tool that can be given to any agent.

import {
  createAnalyzeImageTool,
  createGenerateImageTool,
  createTranscribeAudioTool,
  createGenerateSpeechTool,
} from '@cogitator-ai/core';

Image Analysis

createAnalyzeImageTool creates a tool that sends images to a vision-capable model for analysis. It supports both URL and base64 image inputs.

import { Cogitator, Agent, createAnalyzeImageTool, OpenAIBackend } from '@cogitator-ai/core';

const openai = new OpenAIBackend({ apiKey: process.env.OPENAI_API_KEY! });

const analyzeImage = createAnalyzeImageTool({
  llm: openai,
  defaultModel: 'gpt-4o',
});

const agent = new Agent({
  name: 'vision-assistant',
  model: 'openai/gpt-4o',
  instructions: 'You can analyze images. Use the analyzeImage tool when given an image URL.',
  tools: [analyzeImage],
});

Parameters

ParameterTypeDescription
imagestring | { data, mimeType }URL or base64-encoded image data
promptstringQuestion or instruction about the image
detail'auto' | 'low' | 'high'Analysis detail level (high uses more tokens)
modelstringOverride the vision model

Inline Images via RunOptions

You can also pass images directly through RunOptions without needing the analyze tool:

const result = await cog.run(agent, {
  input: 'What do you see in this image?',
  images: [
    'https://example.com/photo.jpg',
    {
      data: base64EncodedImage,
      mimeType: 'image/png',
    },
  ],
});

The images array accepts URLs as strings or objects with data (base64) and mimeType fields. Supported MIME types: image/jpeg, image/png, image/gif, image/webp.

Image Generation

createGenerateImageTool creates images using DALL-E 3 via the OpenAI API.

const generateImage = createGenerateImageTool({
  apiKey: process.env.OPENAI_API_KEY!,
});

const agent = new Agent({
  name: 'artist',
  model: 'openai/gpt-4o',
  instructions: 'You create images based on user descriptions. Use the generateImage tool.',
  tools: [generateImage],
});

Parameters

ParameterTypeDescription
promptstringDetailed description of the image to generate
size'1024x1024' | '1792x1024' | '1024x1792'Image dimensions
quality'standard' | 'hd'HD produces finer textures and detail
style'vivid' | 'natural'Vivid for dramatic, natural for realistic

Result

{
  url: 'https://oaidalleapiprodscus.blob.core.windows.net/...',
  revisedPrompt: 'A detailed painting of...',
  size: '1024x1024',
  quality: 'standard',
  style: 'vivid',
}

Audio Transcription

createTranscribeAudioTool transcribes audio to text using OpenAI Whisper or GPT-4o transcribe models.

const transcribe = createTranscribeAudioTool({
  apiKey: process.env.OPENAI_API_KEY!,
  defaultModel: 'whisper-1',
  defaultLanguage: 'en',
});

const agent = new Agent({
  name: 'transcriber',
  model: 'openai/gpt-4o',
  instructions: 'Transcribe audio files. Use the transcribeAudio tool.',
  tools: [transcribe],
});

Parameters

ParameterTypeDescription
audiostring | { data, format }URL or base64-encoded audio data
languagestringISO-639-1 code ('en', 'es', 'fr', 'ja', etc.)
model'whisper-1' | 'gpt-4o-transcribe' | 'gpt-4o-mini-transcribe'Transcription model
timestampsbooleanInclude word-level timestamps (whisper-1 only)

Supported audio formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac (up to 25MB).

Result

{
  text: 'The transcribed content of the audio file...',
  language: 'en',
  duration: 42.5,
  words: [
    { word: 'The', start: 0.0, end: 0.1 },
    { word: 'transcribed', start: 0.1, end: 0.5 },
    // ...
  ],
}

Inline Audio via RunOptions

Audio files can be passed directly to the run:

const result = await cog.run(agent, {
  input: 'Summarize what was said in this recording.',
  audio: ['https://example.com/meeting.mp3', { data: base64Audio, format: 'wav' }],
});

Text-to-Speech

createGenerateSpeechTool converts text to natural-sounding speech using OpenAI TTS.

const speak = createGenerateSpeechTool({
  apiKey: process.env.OPENAI_API_KEY!,
  defaultModel: 'tts-1',
  defaultVoice: 'nova',
  defaultFormat: 'mp3',
});

const agent = new Agent({
  name: 'narrator',
  model: 'openai/gpt-4o',
  instructions: 'Read text aloud using the generateSpeech tool.',
  tools: [speak],
});

Parameters

ParameterTypeDescription
textstringText to convert (max 4096 characters)
voicestringVoice selection (see below)
model'tts-1' | 'tts-1-hd' | 'gpt-4o-mini-tts'TTS model
speednumberPlayback speed, 0.25 to 4.0 (default: 1.0)
format'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm'Output format

Available Voices

alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse, marin, cedar

Result

{
  audioBase64: 'AAAA...',  // base64-encoded audio
  format: 'mp3',
  voice: 'nova',
  model: 'tts-1',
  textLength: 256,
}

Utilities

Cogitator exports helpers for working with image and audio data:

import {
  fetchImageAsBase64,
  fetchAudioAsBuffer,
  audioInputToBuffer,
  isValidAudioFormat,
  getAudioMimeType,
} from '@cogitator-ai/core';

const image = await fetchImageAsBase64('https://example.com/photo.jpg');
// { data: 'base64...', mimeType: 'image/jpeg' }

const audio = await fetchAudioAsBuffer('https://example.com/audio.mp3');
// { buffer: ArrayBuffer, filename: 'audio.mp3' }

On this page