Alexander Tong

Frances H. Arnold

Cheng-Hao Liu

Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encod… (voir plus)e. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp

2026-04-05

arXiv (prépublication)

MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data

Xingzhi Sun

João Felipe Rocha

Brett Phelan

Dhananjay Bhaskar

Guillaume Huguet

Yanlei Zhang

D. S. Magruder

Ke Xu

Oluwadamilola Fasina

Mark Gerstein

Guy Wolf

Natalia Ivanova

Christine L. Chaffer

Smita Krishnaswamy

Understanding cellular trajectories via time-resolved single-cell transcriptomics is vital for studying development, regeneration, and disea… (voir plus)se. A key challenge is inferring continuous trajectories from discrete snapshots. Biological complexity stems from stochastic cell fate decisions, temporal proliferation changes, and spatial environmental influences. Current methods often use deterministic interpolations treating cells in isolation, failing to capture the probabilistic branching, population shifts, and niche-dependent signaling driving real biological processes. We introduce Manifold Interpolating Optimal-Transport Flow (MIOFlow) 2.0. This framework learns biologically informed cellular trajectories by integrating manifold learning, optimal transport, and neural differential equations. It models three core processes: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features like local cell type composition and signaling. By operating in a PHATE-distance matching autoencoder latent space, MIOFlow 2.0 ensures trajectories respect the data's intrinsic geometry. Empirical comparisons show expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, MIOFlow 2.0 improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches. MIOFlow 2.0 thus bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories.

2026-03-22

arXiv (prépublication)

Autoregressive Boltzmann Generators

Charlie B. Tan

Efficient sampling of molecular systems at thermodynamic equilibrium is a hallmark challenge in statistical physics. This challenge has driv… (voir plus)en the development of Boltzmann Generators (BGs), which allow rapid generation of uncorrelated equilibrium samples by combining a generative model with exact likelihoods and an importance sampling correction. However, modern BGs predominantly rely on normalizing flows (NFs), which either suffer from limited expressivity due to strict invertibility constraints (discrete time) or computationally expensive likelihoods (continuous time). In this paper, we propose Autoregressive Boltzmann Generators (ArBG), a novel autoregressive modelling framework that overcomes these limitations by departing from the flow-based BG paradigm. ArBG circumvents the topological constraints of flows and enables sequential inference-time interventions, while offering enhanced scalability by leveraging architectures effective in Large Language Models. We empirically demonstrate that ArBG leads to significant improvements over flow-based models across all benchmarks, but particularly in larger peptide systems such as the 10-residue Chignolin. Furthermore, we introduce Robin, a 132 million parameter transferable model trained with the ArBG framework which improves over the previous state-of-the-art, reducing the zero-shot energy error,

2026-03-01

GEM @ International Conference on Learning Representations (publié)

openreview.net

Hidden sampling biases inflate performance in gene regulatory network inference

Marco Stock

Florin Ratajczak

Paul Bertin

Eva Hoermanseder

Yoshua Bengio

Jason Hartford

Pascal Falter-Braun

Matthias Heinig

Antonio Scialdone

Accurate reconstruction of gene regulatory networks (GRNs) from single-cell transcriptomic data remains a major methodological challenge. Re… (voir plus)cent machine learning approaches, particularly graph neural networks and graph autoencoders, have reported improved performance, yet these gains do not consistently translate to realistic biological settings. Here, we show that a key reason for that is the way negative regulatory interactions are sampled for supervised training and evaluation. We find that widely used sampling strategies introduce node-degree biases that allow models to exploit trivial graph-structural cues rather than biological signals. Across multiple benchmarks, simple degree-based heuristics match or exceed state-of-the-art graph neural network models under these biased evaluation protocols. We further introduce a degree-aware sampling approach that eliminates these artifacts and provides more reliable assessments of GRN inference methods. Our results call for standardized, bias-aware benchmarking practices to ensure meaningful progress in supervised GRN inference from single-cell RNA-seq data.

2025-12-22

bioRxiv (prépublication)

FALCON: Few-step Accurate Likelihoods for Continuous Flows

Artem Gazizov

2025-12-09

ArXiv (prépublication)

OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction

Emily Jin

Andrei Cristian Nica

Mikhail Galkin

Jarrid Rector-Brooks

Kin Long Kelvin Lee

Santiago Miret

Frances H. Arnold

Michael M. Bronstein

Avishek Bose

Cheng-Hao Liu

2025-12-06

ArXiv (prépublication)

Foundations of Diffusion Models in General State Spaces: A Self-Contained Introduction

Vincent Pauline

Kirill Neklyudov

2025-12-03

ArXiv (prépublication)

Curly Flow Matching for Learning Non-gradient Field Dynamics

Katarina Petrović

Lazar Atanackovic

Viggo Moro

Kacper Kapuśniak

İsmail İlkan Ceylan

Michael Bronstein

Avishek Joey Bose

Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Suc… (voir plus)h models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schrödinger bridge problem with a non-zero drift reference process---in stark contrast to typical zero-drift reference processes---which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems. Our code repository is accessibleat: https://github.com/kpetrovicc/curly-flow-matching.git

2025-12-02

Conference on Neural Information Processing Systems (Accept (poster))

openreview.net

Curly Flow Matching for Learning Non-gradient Field Dynamics

Katarina Petrovi'c

Lazar Atanackovic

Viggo Moro

Kacper Kapu'sniak

.Ismail .Ilkan Ceylan

Michael M. Bronstein

Avishek Bose

Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Suc… (voir plus)h models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schr\"odinger bridge problem with a non-zero drift reference process -- in stark contrast to typical zero-drift reference processes -- which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems. Our code repository is accessible at: https://github.com/kpetrovicc/curly-flow-matching.git

2025-10-29

ArXiv (prépublication)

Neural FIM: Bridging Statistical Manifolds and Generative Modeling through Fisher Geometry

Yanlei Zhang

Guillaume Huguet

Edward De Brouwer

Danqi Liao

Oluwadamilola Fasina

Ricky T. Q. Chen

Guy Wolf

Maximilian Nickel

Ian Adelstein

Smita Krishnaswamy

While data diffusion-based embeddings are widely used in unsupervised learning to reveal the intrinsic geometry of data, they are fundamenta… (voir plus)lly constrained by their discrete nature and inability to generalize beyond training points. This limitation ob

2025-10-20

TechRxiv (accepté)

Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts

Marta Skreta

Tara Akhound-Sadegh

Viktor Ohanesian

Roberto Bondesan

Alán Aspuru-Guzik

Arnaud Doucet

Rob Brekelmans

Kirill Neklyudov

While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling infere… (voir plus)nce-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

Scalable Equilibrium Sampling with Sequential Boltzmann Generators

Charlie B. Tan

Avishek Joey Bose

Chen Lin

Leon Klein

Michael M. Bronstein

Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann generators… (voir plus) tackle this problem by pairing normalizing flows with importance sampling to obtain uncorrelated samples under the target distribution. In this paper, we extend the Boltzmann generator framework with two key contributions, denoting our framework Sequential Boltzmann generators (SBG). The first is a highly efficient Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. In contrast to the equivariant continuous flows of prior methods, we leverage exactly invertible non-equivariant architectures which are highly efficient during both sample generation and likelihood evaluation. This efficiency unlocks more sophisticated inference strategies beyond standard importance sampling. In particular, we perform inference-time scaling of flow samples using a continuous-time variant of sequential Monte Carlo, in which flow samples are transported towards the target distribution with annealed Langevin dynamics. SBG achieves state-of-the-art performance w.r.t. all metrics on peptide systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri-, tetra- and hexa-peptides that were thus far intractable for prior Boltzmann generators.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)