Home · Platform · Architecture

Multimodal architecture

Payloads, fusion, OCR pipelines and timeline indexing — how VisionaryAI Suite moves from raw media to searchable multimodal intelligence.

Multimodal payloads

VisionaryAI Suite builds OpenAI-compatible vision payloads that include actual image frames alongside speech transcripts, OCR snippets and contextual metadata. The model sees pixels — not only pre-existing tags.

Payload assembly respects token budgets, frame caps and model capabilities. When vision payloads are unavailable, the suite falls back to text-based analysis with clear session signals.

Multi-signal fusion

BLIP, CLIP, OCR, speech, metadata and Vision LLM output are combined into coherent timeline intelligence. Fusion runs in the desktop pipeline with live logging so technical users can follow each stage.

OCR pipeline

On-screen and printed text extracted per segment, aligned to timecodes.

Speech layer

Whisper transcription with Smart Whisper profiles in Trial 1.5.2.

Visual embeddings

CLIP/BLIP signals complement Vision LLM narratives.

Timeline grounding

Visual understanding connects to precise timeline events — searchable multimodal moments across your library. Events persist in .vtag sidecars and feed Semantic Memory.

Vision Intelligence overview How it works