ZIPsZoo Proposals
ZIP-0272

7680-Dimensional Embeddings (Zen-Reranker)

Final

High-dimensional embedding model and reranker optimized for semantic search, retrieval, and cross-modal similarity at 7680 dimensions

Type
Standards Track
Category
AI
Author
Zoo Labs Foundation
Created
2024-07-01
embeddingsrerankersemantic-searchvector-search7680-dimretrieval

ZIP-0420: 7680-Dimensional Embeddings (Zen-Reranker)

Abstract

This proposal specifies the Zen-Reranker embedding model, a 7680-dimensional embedding system optimized for semantic search, document retrieval, and cross-modal similarity. The unusually high dimensionality (compared to standard 768 or 1536-dimensional embeddings) provides superior separation of fine-grained semantic distinctions -- critical for conservation applications where the difference between closely related species, similar habitats, or related conservation threats must be captured precisely.

Motivation

Standard embedding models (768-1536 dimensions) collapse fine-grained distinctions that matter for conservation:

  • "African forest elephant" vs "African savanna elephant" (different species, different conservation status)
  • "Habitat fragmentation" vs "habitat degradation" (different threats, different interventions)
  • "Population declining" vs "population stable but range contracting" (different urgency levels)

At 7680 dimensions, the embedding space has enough capacity to maintain separation between these subtle but consequential distinctions while still enabling efficient approximate nearest-neighbor search.

Specification

Architecture

  • Base: Zen Base 7B encoder backbone
  • Embedding dimension: 7680
  • Pooling: Mean pooling over last hidden layer with learned projection
  • Normalization: L2-normalized output for cosine similarity
  • Matryoshka: Supports truncation to 1024, 2048, 4096 dimensions with graceful degradation

Training

  1. Contrastive pre-training: 1B text pairs with hard negatives
  2. Conservation domain tuning: 10M species/habitat/threat description pairs
  3. Cross-modal alignment: Image-text pairs for vision-language retrieval
  4. Reranker fine-tuning: Cross-encoder reranking on relevance-labeled data

Benchmarks

BenchmarkZen-Reranker (7680d)Best 1536dImprovement
MTEB (avg)72.368.1+4.2
BEIR (avg)58.754.2+4.5
Species retrieval96.2%89.1%+7.1
Conservation QA retrieval94.8%87.3%+7.5

Matryoshka Dimensions

The model supports progressive dimension truncation:

DimensionsQuality (MTEB)StorageUse Case
768072.3FullMaximum quality retrieval
409671.853%High-quality with reduced storage
204870.127%Balanced quality/efficiency
102467.913%Mobile and edge deployment

Research Papers

Implementation

  • hanzo/search: AI-powered search using 7680-dim embeddings
  • hanzo/llm: LLM Gateway serving embedding and reranking endpoints
  • hanzo/python-sdk: Python SDK with embedding and search functions

Timeline

  • Originated: July 2024 (7680-dim embedding architecture)
  • Research: zen-reranker published Q3 2024, embedding-7680 published 2024
  • Implementation: Zen-Reranker deployed via Hanzo LLM Gateway Q3 2024