ZIPsZoo Proposals
ZIP-0278

1M Token Context Extension

Final

YaRN-based context window extension enabling Zen models to process 1 million tokens in a single inference pass

Type
Standards Track
Category
AI
Author
Zoo Labs Foundation
Created
2025-01-15
context-extensionlong-contextyarnmillion-tokensrope-scaling

ZIP-0426: 1M Token Context Extension

Abstract

This proposal specifies the methodology for extending Zen model context windows from 128K to 1 million tokens using YaRN (Yet another RoPE extensioN) scaling combined with attention optimization techniques. A 1M token context enables processing entire codebases, complete research paper collections, full species databases, and extensive conversation histories in a single inference pass -- critical for the Experience Ledger (ZIP-0401) which requires agents to reason over their complete memory.

Motivation

The Experience Ledger (ZIP-0401) accumulates knowledge over months and years of interaction. A conservation agent that has been active for one year may have:

  • 100K tokens of conversation history
  • 200K tokens of species knowledge
  • 300K tokens of field report summaries
  • 400K tokens of cross-referenced scientific literature

At 128K context, the agent must aggressively summarize and discard information. At 1M context, it can reason over its complete memory.

Specification

Extension Method

YaRN Scaling:

  1. Segment RoPE dimensions into low-frequency and high-frequency groups
  2. Low-frequency: linear interpolation (preserves long-range dependencies)
  3. High-frequency: no scaling (preserves local pattern matching)
  4. Boundary: learned via attention entropy analysis

Training Pipeline

  1. Base model: Zen Base trained at 128K context (ZIP-0413)
  2. Progressive extension: 128K -> 256K -> 512K -> 1M in three stages
  3. Per-stage training: 1B tokens at each context length
  4. Long-document data: Books, codebases, paper collections, conversation logs
  5. Needle-in-haystack evaluation: Verify retrieval accuracy at all context positions

Attention Optimization

1M token context requires O(n^2) attention optimization:

TechniqueDescriptionMemory Reduction
FlashAttention-3Tiled attention with I/O optimization4x
Ring attentionDistributed attention across multiple GPUsLinear in GPUs
Sliding windowLocal attention for most layers, global for every 4th8x
KV-cache quantization4-bit KV cache compression4x

Needle-in-Haystack Results

Context LengthRetrieval AccuracyLatency (first token)
128K99.8%0.5s
256K99.6%1.1s
512K99.2%2.3s
1M98.5%4.8s

Research Papers

Implementation

  • hanzo/llm: LLM Gateway with 1M context model serving
  • hanzo/candle: Rust inference engine with ring attention support
  • hanzo/chat: Chat interface with extended context conversations

Timeline

  • Originated: January 2025 (1M context research)
  • Research: zen-context-extension published Q1 2025
  • Implementation: Zen models with 1M context deployed via Hanzo LLM Gateway Q2 2025