ZIPsZoo Proposals
ZIP-0280

AI Safety Framework (Zen-Guard)

Final

Comprehensive AI safety framework including content filtering, guardrails, jailbreak prevention, and conservation-specific safety constraints

Type
Standards Track
Category
AI
Author
Zoo Labs Foundation
Created
2025-05-01
ai-safetyguardrailscontent-filterzen-guardjailbreak-prevention

ZIP-0430: AI Safety Framework (Zen-Guard)

Abstract

This proposal specifies Zen-Guard, a comprehensive AI safety framework providing content filtering, guardrails, jailbreak prevention, and domain-specific safety constraints for all Zen models. Zen-Guard operates as both a standalone classifier model and a set of runtime constraints integrated into the Hanzo LLM Gateway. For conservation applications, Zen-Guard includes species-specific safety rules (never reveal endangered species locations, never recommend actions that could harm wildlife).

Motivation

Conservation AI has unique safety requirements beyond standard content filtering:

  1. Location secrecy: Never reveal GPS coordinates of critically endangered species
  2. Intervention safety: Never recommend conservation actions without expert review
  3. Cultural sensitivity: Respect indigenous knowledge protocols and data sovereignty
  4. Emotional safety: Conservation conversations can involve distressing content (poaching, extinction); agents must handle this sensitively
  5. Factual safety: Conservation misinformation (e.g., incorrect species status) can lead to misallocated resources

Specification

Architecture

User Input
    │
    v
┌──────────────┐
│ Zen-Guard    │ ← Pre-filter (block harmful inputs)
│ Input Filter │
└──────┬───────┘
       │ (safe inputs pass through)
       v
┌──────────────┐
│ Zen Model    │ ← Main model generates response
│ (inference)  │
└──────┬───────┘
       │
       v
┌──────────────┐
│ Zen-Guard    │ ← Post-filter (block harmful outputs)
│ Output Filter│
└──────┬───────┘
       │
       v
┌──────────────┐
│ Zen-Guard    │ ← Streaming filter (real-time monitoring)
│ Stream Guard │
└──────┬───────┘
       │ (safe response delivered)
       v
User Response

Guard Models

ModelParametersLatencyPurpose
Zen-Guard1.5B10msGeneral content classification
Zen-Guard-Gen7B30msGeneration-specific safety
Zen-Guard-Stream600M5msReal-time streaming filter

Conservation Safety Rules

RuleSeverityAction
Endangered species locationCriticalBlock + alert
Poaching techniqueCriticalBlock + report
Incorrect conservation statusHighCorrect + cite source
Harmful intervention adviceHighBlock + suggest expert consultation
Cultural protocol violationHighBlock + explain protocol
Age-inappropriate contentMediumRedirect
Unverified conservation claimLowFlag + request citation

Evaluation

  • Red-team adversarial testing with conservation-specific attack vectors
  • Automated jailbreak detection (prompt injection, role-play attacks)
  • Human evaluation by conservation domain experts
  • Continuous monitoring of production conversations for safety violations

Research Papers

Implementation

  • hanzo/llm: LLM Gateway with Zen-Guard integration
  • hanzo/chat: Chat interface with safety guardrails
  • hanzo/agent: Agent SDK with safety constraints

Timeline

  • Originated: May 2025 (Zen-Guard design)
  • Research: zen-safety-evaluation published Q2 2025, guard model whitepapers Q3 2025
  • Implementation: Zen-Guard deployed in Hanzo LLM Gateway Q2 2025