ZIPsZoo Proposals
ZIP-0428

Knowledge Distillation Pipeline

Final

Systematic pipeline for distilling large Zen models into smaller, deployment-efficient variants while preserving domain expertise

Type
Standards Track
Category
AI
Author
Zoo Labs Foundation
Created
2025-03-01
distillationknowledge-transfermodel-compressionsmall-modelsedge-deployment

ZIP-0428: Knowledge Distillation Pipeline

Abstract

This proposal specifies the knowledge distillation pipeline used to create smaller Zen model variants (Nano, Mini) from larger teacher models (Pro, Max, Ultra). The pipeline preserves domain expertise through progressive distillation, where each stage transfers specific knowledge types (factual, reasoning, coding, conservation) with targeted loss functions. This enables deployment of high-quality Zen models on edge devices, mobile phones, and resource-constrained field stations.

Motivation

Conservation field stations, ranger smartphones, and camera trap edge processors cannot run 72B parameter models. But they need the intelligence of large models for species identification, threat detection, and conservation guidance. Knowledge distillation bridges this gap by compressing the knowledge of a 72B model into a 1.5B model that runs on a smartphone.

Specification

Distillation Pipeline

Teacher: Zen-Pro 72B (or Zen-Max 235B)
    │
    │ Stage 1: Logit Distillation
    │ (match teacher's output distribution)
    v
Intermediate: 14B → 7B
    │
    │ Stage 2: Feature Distillation
    │ (match teacher's hidden representations)
    v
Intermediate: 7B → 3B
    │
    │ Stage 3: Task-Specific Distillation
    │ (match teacher on domain-specific tasks)
    v
Student: Zen-Nano 1.5B (or Zen-Mini 600M)

Stage Details

Stage 1 -- Logit Distillation:

  • Temperature-scaled KL divergence between teacher and student logits
  • 100B tokens of diverse web data
  • Student architecture: same as target but with fewer layers

Stage 2 -- Feature Distillation:

  • Linear projection from student hidden states to teacher hidden states
  • Layer mapping: student layer i maps to teacher layer f(i) (learned mapping)
  • 50B tokens of high-quality data

Stage 3 -- Task-Specific Distillation:

  • Domain-specific data (code, conservation, reasoning)
  • Teacher generates synthetic training data that captures its expertise
  • Student trains on this synthetic data with task-specific loss

Quality Targets

StudentTeacherTarget QualityAchieved
Zen-Nano 1.5BZen-Pro 72B80% of teacher82.1%
Zen-Mini 3BZen-Pro 72B85% of teacher87.3%
Zen-Base 7BZen-Max 235B90% of teacher91.8%

Edge Deployment Targets

ModelDeviceMemoryLatencyBattery
Zen-Nano 1.5BSmartphone (8GB)1.2 GB50 tok/s4h continuous
Zen-Mini 3BTablet (16GB)2.4 GB30 tok/s3h continuous
Zen-Base 7BLaptop (32GB)5.6 GB20 tok/s2h continuous

Research Papers

Implementation

  • hanzo/candle: Rust inference engine optimized for small models
  • hanzo/llm: LLM Gateway serving distilled model variants
  • zoo/core: Mobile application with on-device inference

Timeline

  • Originated: March 2025 (distillation pipeline design)
  • Research: zen-knowledge-distillation published Q1 2025
  • Implementation: Zen-Nano and Zen-Mini deployed Q2 2025