ZIPsZoo Proposals
ZIP-0408

Unified Multimodal Architecture (Jin)

Final

Jin -- a unified architecture processing vision, language, audio, and 3D within a single transformer, enabling cross-modal reasoning and generation

Type
Standards Track
Category
AI
Author
Zoo Labs Foundation
Created
2023-03-01
multimodaljinvision-languageaudiounified-transformercross-modal

ZIP-0408: Unified Multimodal Architecture (Jin)

Abstract

This proposal specifies Jin, a unified multimodal AI architecture that processes vision, language, audio, and 3D data through a single transformer backbone. Unlike pipeline approaches that chain separate vision and language models, Jin uses modality-specific encoders feeding into a shared cross-attention transformer that reasons natively across all modalities simultaneously. Jin is the architectural foundation for the Zen multimodal model family (Zen-VL, Zen-Omni, Zen-Live) and the broader Hanzo AI infrastructure.

Motivation

ZIP-0406 (Multi-Modal Conservation AI) identified the need for cross-modal reasoning but relied on separate encoder pipelines with a late fusion layer. This approach has fundamental limitations:

  1. Information bottleneck: Each modality is compressed independently before fusion, losing cross-modal correlations
  2. Sequential processing: Vision must complete before language can reference visual features
  3. Scaling: Adding a new modality requires retraining the fusion layer
  4. Generation: The pipeline can classify but cannot generate across modalities (e.g., describe an image, generate an image from text)

Jin solves these by treating all modalities as token sequences processed by a single transformer:

  • Images become visual token sequences via a ViT encoder
  • Audio becomes acoustic token sequences via a Whisper-derived encoder
  • Text remains as standard token sequences
  • 3D scenes become spatial token sequences via a point cloud encoder
  • All token types share the same attention mechanism and positional encoding

Specification

Architecture

                    Shared Transformer Backbone
                    (N layers, cross-attention)
                           β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚              β”‚              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Vision Tokens β”‚ β”‚ Text  β”‚ β”‚ Audio Tokens  β”‚
    β”‚ (ViT encoder) β”‚ β”‚Tokens β”‚ β”‚ (Whisper enc.)β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚              β”‚              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Image/Video   β”‚ β”‚ Text  β”‚ β”‚ Audio/Speech  β”‚
    β”‚ Input         β”‚ β”‚ Input β”‚ β”‚ Input         β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions

  1. Modality tokens share the same embedding space: Visual tokens, text tokens, and audio tokens are all projected into the same d-dimensional space before entering the transformer.

  2. Interleaved attention: Unlike approaches that process each modality separately then concatenate, Jin interleaves tokens from all modalities in a single sequence, allowing cross-modal attention from the first layer.

  3. Modality-specific heads: Output heads are modality-specific (text generation, image generation, audio synthesis) but share the same backbone representations.

  4. Dynamic resolution: Vision inputs can be any resolution (variable number of visual tokens). Audio inputs can be any duration. The transformer handles variable-length mixed-modality sequences natively.

Model Scale

VariantParametersContextModalities
Jin-Nano1.5B32KVision + Language
Jin-Base7B128KVision + Language + Audio
Jin-Pro72B256KAll (Vision + Language + Audio + 3D)
Jin-Max480B1MAll + Generation

Training

  1. Stage 1: Modality alignment -- train encoders to produce compatible token representations using paired data (image-text, audio-text)
  2. Stage 2: Joint pre-training -- train the full model on interleaved multimodal web data
  3. Stage 3: Instruction tuning -- fine-tune on multimodal instruction-following tasks
  4. Stage 4: Domain specialization -- conservation, medical, code, etc.

Research Papers

Implementation

  • hanzo/jin: Production Jin multimodal framework (Python, PyTorch)
  • hanzo/candle: Rust inference engine for Jin models
  • hanzo/llm: LLM Gateway serving Jin/Zen multimodal models

Timeline

  • Originated: March 2023 (Jin architecture design)
  • Research: zen-multimodal-architecture published 2024, zen-vl_whitepaper published 2024
  • Implementation: Jin framework deployed 2023, Zen-VL and Zen-Omni models 2024-2025