Decentralized AI Training Architecture
Architecture for distributed AI model training across heterogeneous nodes, the precursor to the Zoo Gym protocol and DSO
ZIP-0407: Decentralized AI Training Architecture
Abstract
This proposal defines the architecture for training AI models across a decentralized network of heterogeneous compute nodes. Rather than requiring a single datacenter with thousands of homogeneous GPUs, this system enables conservation organizations, universities, and individual contributors to pool their compute resources for collaborative model training. The architecture handles node heterogeneity (different GPU types, network speeds, availability patterns), provides Byzantine fault tolerance, and rewards contributors proportionally to their verified compute contributions. This is the precursor to the Zoo Gym protocol and later the Decentralized Semantic Optimization protocol (ZIP-0410).
Motivation
Training conservation-aware language models (ZIP-0405) and multimodal systems (ZIP-0406) requires significant compute resources. However:
- Conservation organizations have limited budgets for cloud GPU rental
- University research labs have GPUs that sit idle outside business hours
- Individual supporters have gaming GPUs willing to contribute idle cycles
- No single entity should control the training process for a public-good AI
Decentralized training solves all four problems by creating a network where anyone can contribute compute and earn rewards, while the resulting model remains a public good.
Specification
Network Architecture
Coordinator (on-chain smart contract)
├── Task Registry: available training tasks with specs
├── Node Registry: registered compute nodes with capabilities
├── Assignment Engine: matches tasks to nodes
├── Verification: validates completed work
└── Reward Distribution: pays contributors
Compute Nodes (off-chain, heterogeneous)
├── GPU Worker: executes training steps
├── Prover: generates compute proofs
├── Reporter: submits results and proofs
└── Syncer: downloads/uploads model checkpoints
Training Protocol
- Task Creation: Training coordinator posts a task (model architecture, dataset CID, hyperparameters, required compute)
- Node Registration: Compute providers register their hardware capabilities (GPU model, VRAM, bandwidth)
- Assignment: Coordinator assigns data shards to nodes based on capability matching
- Execution: Nodes train on their assigned shard, producing gradient updates
- Verification: A subset of nodes re-execute random shards to verify correctness (ZIP-0419 PoAI)
- Aggregation: Verified gradients are aggregated using Byzantine-robust aggregation
- Checkpoint: Updated model checkpoint is stored on IPFS and CID recorded on-chain
- Reward: Contributors receive ZOO tokens proportional to verified compute (ZIP-0016)
Node Heterogeneity Handling
| GPU Class | Min VRAM | Role | Reward Multiplier |
|---|---|---|---|
| Consumer (RTX 3060-4090) | 8 GB | Data-parallel training on small batches | 1.0x |
| Professional (A4000-A6000) | 16 GB | Standard training shards | 1.5x |
| Datacenter (A100, H100) | 40-80 GB | Large batch training, model surgery | 3.0x |
| Apple Silicon (M1-M4) | Unified memory | Inference validation, lightweight fine-tuning | 0.8x |
Fault Tolerance
- Nodes can go offline at any time; their assigned shard is reassigned after timeout
- Byzantine nodes (submitting bad gradients) are detected by verification and slashed
- Network partitions are handled by allowing independent training on partitions, then reconciliation
Research Papers
- zoo-gym-protocol -- Gym decentralized training protocol (2024)
- zoo-gym-compute-proof -- Compute proof protocol for verifiable training (2024)
- zoo-gym-orchestrator -- Training orchestration system (2024)
- zen-distributed-training -- Distributed training for Zen model family
Implementation
- hanzo/node: Blockchain/AI node with libp2p networking for decentralized training
- hanzo/candle: Rust ML framework used by training nodes
- zoo/core: Gym training platform interface
Timeline
- Originated: November 2022 (decentralized training research)
- Research:
zoo-gym-protocolpublished 2024,zoo-gym-compute-proofpublished 2024 - Implementation: Zoo Gym training infrastructure deployed 2024