ICML 2026 AI for Science — Submitted

A World Model for Biological and Climate Discovery

One pretrained architecture predicts disease dynamics, drug-protein interactions, genomic regulation, and Earth-system dynamics — under real-world partial observations. Released open-source for every research institution, regardless of geography or budget.

View on GitHub Get in Touch
Preliminary result. Within 1–4% of specialist model accuracy on out-of-domain behavioral dynamics datasets — without fine-tuning or domain adaptation. Validates that graph-structured latent world models capture domain-invariant state-transition structure.

MissionFoundation-scale science, open to every institution.

AI for biology has bypassed most of the world. Existing tools are domain-locked, observation-hungry, and concentrated in a handful of regions. We are building a single open foundation model that learns the shared dynamical structure underlying proteins, diseases, genomic processes, and atmospheric systems, then releasing every weight, dataset, and recipe so any laboratory can adapt it to local questions.

3.7B people lack access to basic diagnostic services. Lancet Commission on Diagnostics, 2021
1.7B people affected by neglected tropical diseases, with almost zero dedicated AI tools. WHO; 10/90 R&D gap
~85% of frontier AI publications concentrate in four regions, leaving most of the world structurally underserved. Stanford AI Index

ArchitectureOne graph-native world model. Seven layers. Domain-agnostic by design.

Every domain is translated into a universal graph at the boundary. Every downstream component — encoder, dynamics, belief inference, plausibility, retrieval, decision — operates on the same latent representation, which is what makes the same pretrained model usable for proteins, climate fields, or clinical trajectories.

1. Graph State Encoder

Universal typed-graph representation for any physical, biological, or climate system. The only domain-specific component.

2. Latent Dynamics Module

Joint-embedding predictive (JEPA-style) prediction in compressed latent space. Physically coherent long-horizon rollouts.

3. Belief Inference

Bayesian uncertainty maintained over partial and noisy observations — built for the 30 to 50% completeness typical of real clinical data.

4. Plausibility Estimator

Energy-based contrastive filter that enforces physical and biological consistency on every rollout step.

5. Continuous Latent Reasoning

COCONUT-style language model reasoning bidirectionally coupled to simulation dynamics. Hypotheses condition the next forward step.

6. Knowledge Graph Retrieval

GraphRAG over biological and climate knowledge graphs surfaces prior mechanisms and analogous cases at every iteration.

7. Decision & Policy Head

Outputs a portfolio of recommended next experiments, scenarios, or interventions with probability and provenance.

Pretraining & DataBuilt on The Well, extended to biology and climate.

Pretraining begins on a curated multi-terabyte subset of The Well (Polymathic AI, NeurIPS 2024), 15 TB across 16 simulation families spanning fluid dynamics, magnetohydrodynamics, acoustic scattering, and active matter. Biological and climate adaptation layers integrate established open data sources.

Physics Pretraining

The Well: 16 simulation families. Active matter dataset bridges to biological dynamics.

Biology

UniProt protein dynamics. ChEMBL drug-target interactions. Open Targets disease pathways. PDB.

Climate

CMIP6 reanalysis. ERA5 atmospheric data. Partner-curated regional climate-health datasets.

Stack

PyTorch · PyG · HuggingFace Transformers · Google Cloud TPUs.

Open ScienceEvery artifact, every release, openly licensed.

Open release is structurally aligned with the project, not retrofitted. Weights, code, datasets, and biological knowledge graphs are all openly licensed and mirrorable by any partner institution globally.

Model weightsOpenRAIL or Apache 2.0
Training & inference codeApache 2.0
Datasets & knowledge graphsCC-BY 4.0
PublicationsOpen Access

RoadmapFrom pretraining today to a sustained scientific commons.

  1. Now — Pretraining in progress Multi-domain pretraining on a curated subset of The Well across all 16 simulation families. Architecture under peer review at ICML 2026 AI for Science.
  2. Months 1–6 — Biological domain integration Biological knowledge graph covering 5M+ entity relationships. UniProt, Open Targets, ChEMBL constructors. Initial zero-shot transfer evaluation. Scientific advisory board convened.
  3. Months 7–18 — Belief inference and closed-loop coupling Belief-inference under realistic clinical observation rates. Bidirectional language-model and simulation coupling. First peer-reviewed publication. Beta release to partner institutions.
  4. Months 19–24 — Climate integration and public release Climate-health system integration (CMIP6, ERA5). Full public open release under OpenRAIL, Apache 2.0, CC-BY 4.0. 100+ research groups within three months of release.
  5. Months 25–36 — Sustained scientific commons Independent validation of top-ranked drug-target hypotheses. 200+ institutions in 40+ countries. Open governance steering committee. Follow-on funding from CZI Virtual Cell, Wellcome, NIH Bridge2AI, NSF AI Institutes.

Get involved

Open to academic partners, computational biology and climate-health collaborators, and institutions interested in adapting the foundation model to local research questions. Outreach especially welcome from under-resourced regions.

Get in Touch View Repository