ZIP-0414: Mixture of Distilled Experts (MoDE)

Abstract

This proposal specifies the Zen MoDE (Mixture of Diverse Experts) architecture, a sparse mixture-of-experts approach where each expert is a distilled specialist trained on a specific domain (code, math, language, vision, reasoning). Unlike standard MoE that uses identical expert architectures with different weights, MoDE uses diverse expert architectures -- each optimized for its domain -- with a learned router that dynamically selects the most relevant experts for each input token. This achieves the quality of a dense model at a fraction of the inference cost.

Motivation

Dense models scale by adding parameters uniformly across all layers. This is wasteful: a model answering a coding question does not need its poetry knowledge active, and vice versa. Standard MoE addresses this with a router that selects K of N experts per token, but all experts have identical architecture, differing only in weights.

MoDE goes further: each expert has architecture optimized for its domain:

Code experts use larger FFN layers (more memorization capacity for APIs and syntax)
Reasoning experts use deeper attention (more reasoning hops)
Language experts use wider vocabulary embeddings (better multilingual coverage)
Vision experts use spatial attention patterns (grid-structured features)

Specification

Architecture

Input Token
    │
    v
┌──────────────┐
│   Router     │ (learned, top-K selection)
│   Network    │
└──────┬───────┘
       │ selects K of N experts
       │
┌──────┴──────┬──────────────┬──────────────┬──────────────┐
│ Code Expert │ Math Expert  │ Lang Expert  │ Reason Expert│
│ (wide FFN)  │ (deep attn)  │ (wide vocab) │ (deep attn)  │
│ 14B params  │ 14B params   │ 14B params   │ 14B params   │
└──────┬──────┴──────┬───────┴──────┬───────┴──────┬───────┘
       │             │              │              │
       v             v              v              v
┌──────────────────────────────────────────────────────────┐
│                Weighted Combination                       │
│                (router weights)                           │
└──────────────────────────────────────────────────────────┘
       │
       v
   Output Token

Expert Distillation

Each expert is created through knowledge distillation from the dense Zen model:

Train a dense 72B model on all domains
Identify domain-specific attention patterns and parameter subsets
Distill each domain into a specialized expert architecture
Train the router to select experts based on input tokens

Routing Strategy

Top-2 routing: Each token activates 2 of N experts (typically N=8)
Load balancing loss: Auxiliary loss prevents expert collapse (all tokens routed to same expert)
Expert capacity: Each expert processes at most C tokens per batch (overflow tokens use fallback expert)

Efficiency Gains

Model	Total Params	Active Params	FLOPs vs Dense	Quality vs Dense
Zen-Base MoDE	14B	3.5B	0.25x	98.5%
Zen-Pro MoDE	110B	28B	0.25x	99.2%
Zen-Max MoDE	480B	120B	0.25x	99.8%

Research Papers

zen-mixture-of-experts -- MoDE architecture specification
zen-knowledge-distillation -- Knowledge distillation pipeline for expert creation
zen-inference-optimization -- Inference optimization for MoDE models

Implementation

hanzo/llm: LLM Gateway with MoDE-optimized serving
hanzo/candle: Rust inference engine with expert routing
hanzo/jin: Jin multimodal models using MoDE backbone

Timeline

Originated: February 2024 (MoDE architecture design)
Research: zen-mixture-of-experts published Q2 2024
Implementation: Zen MoDE models deployed via Hanzo LLM Gateway Q3 2024