ZIP-601: Open Biodiversity Database Standard

Abstract

This ZIP specifies an open-access biodiversity database built on IPFS with on-chain indexing via the Zoo L2. The database stores species observations, genetic sequences, habitat survey data, and ecological metadata in standardized, machine-readable formats. It provides interoperability with the Global Biodiversity Information Facility (GBIF) through bidirectional data adapters and implements fine-grained access control for sensitive records (e.g., endangered species locations) per ZIP-540. All data is content-addressed, cryptographically attested, and permanently available.

Motivation

Biodiversity data is fragmented across thousands of institutional databases, government agencies, and research projects. This fragmentation creates several problems:

Discoverability: Researchers cannot efficiently find existing observations for a given species or region. Duplicate data collection wastes limited conservation resources.
Interoperability: Different databases use incompatible schemas, taxonomies, and coordinate systems. Merging datasets requires extensive manual harmonization.
Permanence: Institutional databases go offline when funding ends. An estimated 20% of biodiversity datasets cited in published papers are no longer accessible at their original URLs.
Provenance: Data lineage is poorly tracked. When observations are copied between databases, attribution is lost and errors propagate without correction.
Access control: Sensitive data (endangered species locations, indigenous knowledge) requires granular permissions that most open databases cannot provide.

Zoo Labs Foundation operates at the intersection of AI and conservation. Our AI models for species identification, population modeling, and habitat assessment require large, high-quality, standardized datasets. This ZIP creates the data infrastructure layer that feeds both human researchers and machine learning pipelines.

Cross-Ecosystem Context

HIP-0074 (Hanzo SBOM): Software Bill of Materials patterns for tracking data provenance and component lineage.
LP-7102 (Lux Immutable Training Ledger): On-chain audit trail for datasets used in AI model training.
ZIP-560 (Evidence Locker Index): Existing evidence cataloging system that this database extends for biodiversity data.

Specification

1. Data Model

1.1 Core Entities

BiodiversityRecord
    |
    +-- Observation          (species sighting, count, behavior)
    +-- GeneticSample        (DNA/RNA sequence, markers)
    +-- HabitatSurvey        (vegetation, soil, water, climate)
    +-- AcousticRecord       (bioacoustic recording + classification)
    +-- CameraTrapCapture    (image/video + AI classification)

1.2 Observation Schema

{
  "recordId": "bytes32",
  "recordType": "enum(OBSERVATION, GENETIC, HABITAT, ACOUSTIC, CAMERA_TRAP)",
  "taxon": {
    "scientificName": "string",
    "taxonId": "uint64",
    "taxonomySource": "enum(GBIF, ITIS, NCBI, COL)",
    "vernacularName": "string",
    "taxonRank": "enum(KINGDOM, PHYLUM, CLASS, ORDER, FAMILY, GENUS, SPECIES, SUBSPECIES)"
  },
  "location": {
    "latitude": "int64",
    "longitude": "int64",
    "precision": "uint32",
    "elevation": "int32",
    "coordinateSystem": "string",
    "locality": "string",
    "countryCode": "string",
    "sensitivityLevel": "enum(OPEN, FUZZY_10KM, FUZZY_100KM, RESTRICTED)"
  },
  "temporal": {
    "eventDate": "uint64",
    "eventEndDate": "uint64",
    "timeZone": "string"
  },
  "observer": {
    "address": "address",
    "institution": "string",
    "protocol": "string"
  },
  "evidence": {
    "ipfsCids": ["string"],
    "mediaType": ["string"],
    "license": "string"
  },
  "quality": {
    "basisOfRecord": "enum(HUMAN_OBSERVATION, MACHINE_OBSERVATION, PRESERVED_SPECIMEN, MATERIAL_SAMPLE)",
    "verificationStatus": "enum(UNVERIFIED, COMMUNITY_VERIFIED, EXPERT_VERIFIED)",
    "qualityFlags": ["string"]
  },
  "attestation": {
    "signature": "bytes",
    "timestamp": "uint64",
    "deviceFingerprint": "bytes32"
  },
  "ipfsCid": "string",
  "onChainIndex": "bytes32"
}

1.3 Genetic Sample Schema

{
  "recordId": "bytes32",
  "parentObservationId": "bytes32",
  "sequenceType": "enum(DNA_BARCODE, WHOLE_GENOME, METABARCODE, EDNA, RNA)",
  "marker": "string",
  "sequenceCid": "string",
  "qualityMetrics": {
    "phredScore": "uint8",
    "coverage": "float",
    "length": "uint32"
  },
  "accession": {
    "genbank": "string",
    "bold": "string"
  },
  "samplingProtocol": "string",
  "preservationMethod": "string"
}

1.4 Habitat Survey Schema

{
  "recordId": "bytes32",
  "surveyType": "enum(VEGETATION, AQUATIC, SOIL, CLIMATE, MULTI)",
  "plotSize": "uint32",
  "plotShape": "string",
  "measurements": [
    {
      "parameter": "string",
      "value": "float",
      "unit": "string",
      "method": "string"
    }
  ],
  "speciesInventory": ["bytes32"],
  "landCoverClass": "string",
  "habitatType": "string",
  "disturbanceHistory": "string",
  "satelliteImageCid": "string"
}

2. On-Chain Index

The on-chain index stores minimal metadata for discovery and verification. Full records live on IPFS.

2.1 Index Entry

struct IndexEntry {
    bytes32 recordId;
    uint8 recordType;
    uint64 taxonId;
    int64 latitude;       // Degrees * 1e6 (microdegrees)
    int64 longitude;      // Degrees * 1e6 (microdegrees)
    uint64 eventDate;
    address contributor;
    bytes32 ipfsCid;
    uint8 sensitivityLevel;
    uint8 verificationStatus;
    uint64 indexedAt;
}

2.2 Index Contract

interface IBiodiversityIndex {
    function submitRecord(IndexEntry calldata entry, bytes calldata attestation) external;
    function batchSubmit(IndexEntry[] calldata entries, bytes[] calldata attestations) external;
    function updateVerification(bytes32 recordId, uint8 newStatus) external;
    function queryByTaxon(uint64 taxonId) external view returns (bytes32[] memory);
    function queryByBoundingBox(int64 minLat, int64 maxLat, int64 minLon, int64 maxLon) external view returns (bytes32[] memory);
    function queryByContributor(address contributor) external view returns (bytes32[] memory);
    function getRecord(bytes32 recordId) external view returns (IndexEntry memory);
}

Gas costs are minimized by storing only the index on-chain. Spatial queries use a geohash-based Merkle tree for efficient bounding-box lookups.

3. IPFS Storage Layer

3.1 Content Structure

/biodiversity/<recordId>/
    record.json          # Full record per schema above
    media/               # Photos, audio, video
        <hash>.jpg
        <hash>.wav
    provenance.json      # Data lineage chain
    attestation.json     # Cryptographic attestation

3.2 Pinning Requirements

Data Tier	Pinning Requirement	Redundancy
Core records	Zoo pinning cluster + 2 independent services	3x minimum
Media evidence	Zoo pinning cluster + 1 independent service	2x minimum
Genetic sequences	Zoo pinning cluster + GenBank mirror	2x + institutional
Sensitive records	Encrypted, Zoo pinning cluster only	1x + encrypted backup

3.3 Encryption for Sensitive Data

Records with sensitivityLevel > OPEN are encrypted before IPFS upload:

Record encrypted with AES-256-GCM using a per-record key.
Per-record key encrypted to authorized accessor public keys.
Key management via Zoo KMS (ZIP-014) integration.
Location data is fuzzed to the specified precision before indexing.

4. GBIF Interoperability

4.1 Data Mapping

Zoo Field	GBIF Darwin Core Term	Notes
`taxon.scientificName`	`scientificName`	Direct mapping
`taxon.taxonId` (GBIF source)	`taxonKey`	Direct mapping
`location.latitude`	`decimalLatitude`	Convert from microdegrees
`location.longitude`	`decimalLongitude`	Convert from microdegrees
`temporal.eventDate`	`eventDate`	Unix timestamp to ISO 8601
`quality.basisOfRecord`	`basisOfRecord`	Direct enum mapping
`observer.address`	`recordedBy`	Resolve to name via registry
`evidence.license`	`license`	Direct mapping

4.2 Bidirectional Sync

Zoo Database                        GBIF
     |                                |
     |-- Export adapter (DwC-A) ----->|   (Push new Zoo records to GBIF)
     |                                |
     |<--- Import adapter (API) -----|   (Pull GBIF records into Zoo index)
     |                                |
     |-- Deduplication engine ------->|   (Match records by coordinates + date + taxon)

Export: Zoo records with sensitivityLevel = OPEN and license compatible with GBIF are exported as Darwin Core Archive (DwC-A) files and published to GBIF via their IPT (Integrated Publishing Toolkit).
Import: GBIF records relevant to Zoo conservation focus areas are indexed (metadata only) to provide unified search.
Deduplication: A fuzzy matching engine prevents duplicate records when the same observation exists in both systems.

5. Data Quality Pipeline

Raw submission
    |
    v
Automated validation (schema, coordinates, taxonomy)
    |
    v
AI classification (species ID confidence, anomaly detection)
    |
    v
Community verification (citizen science consensus)
    |
    v
Expert verification (specialist review for flagged records)
    |
    v
Published record (on-chain index updated)

Quality Check	Automated	Description
Schema validation	Yes	All required fields present and valid
Coordinate validation	Yes	Coordinates within country bounds, not in ocean (for terrestrial)
Taxonomy validation	Yes	Scientific name resolves to accepted taxon
Temporal validation	Yes	Date not in future, reasonable for species/location
Duplicate detection	Yes	Fuzzy match against existing records
AI species verification	Yes	Image/audio classified by Zoo AI models
Range check	Yes	Species within known geographic range
Expert review	No	Required for new range records, rare species, genetic data

6. Access Control

Access control follows ZIP-540 data governance principles:

Level	Location Precision	Record Access	Who
OPEN	Exact coordinates	Full record	Anyone
FUZZY_10KM	10km grid cell	Full record minus exact location	Registered users
FUZZY_100KM	100km grid cell	Metadata only	Registered users
RESTRICTED	Country only	Metadata on request	Authorized researchers

Sensitivity levels are set by the contributor and reviewed by data stewards. Records involving CITES Appendix I species default to FUZZY_10KM minimum.

Rationale

IPFS over centralized storage: Content-addressed storage ensures data permanence regardless of any single institution's funding status. The on-chain index provides discovery without storing large payloads on the L2.
Darwin Core compatibility: GBIF's Darwin Core standard is the de facto standard for biodiversity data exchange. Native compatibility ensures Zoo data is immediately usable by the global biodiversity research community.
Layered access control: Conservation data requires nuanced access. Exact locations of critically endangered species are weaponized by poachers. The four-tier system balances openness with protection.
AI in the quality pipeline: Zoo's AI models for species identification can pre-screen submissions, reducing the burden on expert reviewers while catching common errors (misidentified species, GPS drift).
On-chain attestation: Cryptographic attestation of observations creates an immutable provenance chain, critical for scientific reproducibility and for training AI models with verified data (per LP-7102).

Security Considerations

Location data leakage: Even fuzzy coordinates can be de-anonymized through intersection attacks with other datasets. The protocol requires minimum fuzz levels per IUCN threat category and prohibits exact coordinates for Critically Endangered species.
Genetic data sensitivity: Genetic sequences for commercially valuable species (e.g., medicinal plants) could be exploited for biopiracy. Access controls and benefit-sharing agreements per the Nagoya Protocol are required for genetic records.
Data poisoning: Malicious actors could submit false observations to corrupt AI training data. The multi-stage quality pipeline (automated + community + expert) and contributor reputation (ZIP-602) mitigate this risk.
Attestation forgery: Device fingerprinting and GPS spoofing are possible. The protocol treats attestation as evidence of submission, not proof of truth. Verification status is the authoritative quality signal.
IPFS content removal: While IPFS is designed for permanence, unpinned content will eventually be garbage collected. The triple-redundancy pinning requirement and institutional mirrors (GenBank, GBIF) provide resilience.
Smart contract vulnerabilities: The index contract handles no funds but controls access to sensitive location data. Access control logic must be audited and tested against privilege escalation attacks.

References

ZIP-540: Research Ethics & Data Governance
ZIP-560: Evidence Locker Index
ZIP-600: DeSci Protocol Framework
ZIP-602: Citizen Science Contribution Protocol
HIP-0074: Hanzo Software Bill of Materials
LP-7102: Lux Immutable Training Ledger
GBIF. "Darwin Core Standard." https://dwc.tdwg.org/
Wilkinson, M. D. et al. "The FAIR Guiding Principles for scientific data management and stewardship." Scientific Data 3, 160018 (2016).
CITES. "Convention on International Trade in Endangered Species." https://cites.org/
CBD. "Nagoya Protocol on Access and Benefit-sharing." https://cbd.int/abs/

Changelog

Version	Date	Changes
1.0	2025-01-15	Initial draft

Open Biodiversity Database Standard