Vision-Language Models (Zen-VL)
Zen-VL -- vision-language models that natively understand images, charts, documents, and video alongside text
ZIP-0416: Vision-Language Models (Zen-VL)
Abstract
This proposal specifies Zen-VL, the vision-language variant of the Zen model family. Zen-VL extends the Zen Base language model (ZIP-0413) with a vision encoder that enables native understanding of images, charts, diagrams, documents, screenshots, and video frames alongside text. The model uses the Jin unified architecture (ZIP-0408) where visual tokens are interleaved with text tokens in a single attention pass, enabling fine-grained cross-modal reasoning.
Motivation
Conservation applications require visual understanding: camera trap images for species identification (ZIP-0406), satellite imagery for habitat monitoring, document understanding for research papers, and chart interpretation for conservation status reports. Zen-VL provides this capability as a production model deployable through the Hanzo LLM Gateway.
Specification
Architecture
Zen-VL extends Zen Base with:
- Vision Encoder: ViT-based encoder producing visual tokens at dynamic resolution
- Visual Adapter: MLP projecting vision tokens into the language model's embedding space
- Interleaved Attention: Visual and text tokens attend to each other in every transformer layer
Dynamic Resolution
Unlike fixed-resolution approaches that resize all images to 224x224:
- Images are divided into tiles at their native resolution
- Each tile produces a fixed number of visual tokens
- Total visual tokens scale with image resolution
- High-res images (4K camera trap photos) retain full detail
Training
- Stage 1 -- Alignment: Train visual adapter on 500M image-text pairs (vision encoder and LLM frozen)
- Stage 2 -- Joint training: Unfreeze all components, train on 50M high-quality vision-language tasks
- Stage 3 -- Instruction tuning: 2M visual instruction-following examples
Capabilities
| Task | Description | Benchmark |
|---|---|---|
| Species ID | Identify species from camera trap photos | 94.2% top-1 |
| OCR | Extract text from documents and screenshots | 96.8% accuracy |
| Chart reading | Answer questions about charts and graphs | 88.5% accuracy |
| Video QA | Answer questions about video content | 82.1% accuracy |
| Spatial reasoning | Understand spatial relationships in images | 79.3% accuracy |
Research Papers
- zen-vl_whitepaper -- Zen-VL architecture and training
- zen-vision-architecture -- Vision encoder architecture
- zen3-vl_whitepaper -- Zen3-VL next generation
Implementation
- hanzo/jin: Jin multimodal framework with Zen-VL models
- hanzo/llm: LLM Gateway serving Zen-VL for image+text queries
- hanzo/chat: Chat interface with image upload and vision understanding
Timeline
- Originated: April 2024 (Zen-VL architecture)
- Research:
zen-vl_whitepaperpublished Q2 2024,zen3-vl_whitepaperpublished 2025 - Implementation: Zen-VL deployed via Hanzo LLM Gateway Q3 2024