Michał Koziarski

Aviv Regev

Man-Wah Tan

Tommaso Biancalani

2025-10-23

Nature Biotechnology (unknown)

Learning Decision Trees as Amortized Structure Inference

Mohammed Mahfoud

Ghait Boukachab

Stefan Bauer

Nikolay Malkin

2025-03-04

ICLR.cc/2025/Workshop/FPI (poster)

Action Abstractions for Amortized Sampling

Lena Nehale Ezzine

As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignm… (see more)ent and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

2025-01-21

International Conference on Learning Representations (poster)

RGFN: Synthesizable Molecular Generation Using GFlowNets

Andrei Rekesh

Dmytro Shevchuk

Almer Van Der Sloot

Piotr Gainski

Cheng-Hao Liu

Mike Tyers

Robert A. Batey

Generative models hold great promise for small molecule discovery, significantly increasing the size of search space compared to traditional… (see more) in silico screening libraries. However, most existing machine learning methods for small molecule generation suffer from poor synthesizability of candidate compounds, making experimental validation difficult. In this paper we propose Reaction-GFlowNet (RGFN), an extension of the GFlowNet framework that operates directly in the space of chemical reactions, thereby allowing out-of-the-box synthesizability while maintaining comparable quality of generated candidates. We demonstrate that with the proposed set of reactions and building blocks, it is possible to obtain a search space of molecules orders of magnitude larger than existing screening libraries coupled with low cost of synthesis. We also show that the approach scales to very large fragment libraries, further increasing the number of potential molecules. We demonstrate the effectiveness of the proposed approach across a range of oracle models, including pretrained proxy models and GPU-accelerated docking.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

A high-throughput phenotypic screen combined with an ultra-large-scale deep learning-based virtual screening reveals novel scaffolds of antibacterial compounds

Gabriele Scalia

Steven T. Rutherford

Ziqing Lu

Kerry R. Buchholz

Nicholas Skelton

Kangway Chuang

Nathaniel Diamant

Jan-Christian Hütter

Jerome-Maxim Luescher

Anh Miu

Jeff Blaney

Leo Gendelev

Elizabeth Skippington

Greg Zynda

Nia Dickson

Aviv Regev

Man-Wah Tan

Tommaso Biancalani

The proliferation of multi-drug-resistant bacteria underscores an urgent need for novel antibiotics. Traditional discovery methods face chal… (see more)lenges due to limited chemical diversity, high costs, and difficulties in identifying structurally novel compounds. Here, we explore the integration of small molecule high-throughput screening with a deep learning-based virtual screening approach to uncover new antibacterial compounds. Leveraging a diverse library of nearly 2 million small molecules, we conducted comprehensive phenotypic screening against a sensitized Escherichia coli strain that, at a low hit rate, yielded thousands of hits. We trained a deep learning model, GNEprop, to predict antibacterial activity, ensuring robustness through out-of-distribution generalization techniques. Virtual screening of over 1.4 billion compounds identified potential candidates, of which 82 exhibited antibacterial activity, illustrating a 90X improved hit rate over the high-throughput screening experiment GNEprop was trained on. Importantly, a significant portion of these newly identified compounds exhibited high dissimilarity to known antibiotics, indicating promising avenues for further exploration in antibiotic discovery.

2024-09-13

bioRxiv (preprint)

Cell Morphology-Guided Small Molecule Generation with GFlowNets

Stephen Zhewen Lu

Ziqing Lu

Ehsan Hajiramezanali

Tommaso Biancalani

Gabriele Scalia

2024-06-15

ICML.cc/2024/Workshop/AI4Science (poster)

Towards DNA-Encoded Library Generation with GFlowNets

Mohammed Abukalam

Vedant Shah

Louis Vaillancourt

Doris Alexandra Schuetz

Moksh Jain

Almer Van Der Sloot

Mathieu Bourgey

Anne Marinier

2024-03-03

GEM @ International Conference on Learning Representations (poster)

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini

Shenyang Huang

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Jama Hussein Mohamud

Michael Craig

Cristian Gabellini

Kerstin Klasers

Josef Dean

Cas Wognum … (see 15 more)

Maciej Sypetkowski

Ioannis Koutis

Hadrien Mary

Therence Bois

Andrew Fitzgibbon

Błażej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have shown significant advancements in multiple fields. However, the lack of datasets with labeled f… (see more)eatures and codebases has hindered the development of a supervised foundation model for molecular tasks. Here, we have carefully curated seven datasets specifically tailored for node- and graph-level prediction tasks to facilitate supervised learning on molecules. Moreover, to support the development of multi-task learning on our proposed datasets, we created the Graphium graph machine learning library. Our dataset collection encompasses two distinct categories. Firstly, the TOYMIX category modifies three small existing datasets with additional data for multi-task learning. Secondly, the LARGEMIX category includes four large-scale datasets with 344M graph-level data points and 409M node-level data points from ∼5M unique molecules. Finally, the ultra-large dataset contains 2,210M graph-level data points and 2,031M node-level data points coming from 86M molecules. Hence our datasets represent an order of magnitude increase in data volume compared to other 2D-GNN datasets. In addition, recognizing that molecule-related tasks often span multiple levels, we have designed our library to explicitly support multi-tasking, offering a diverse range of multi-level representations, i.e., representations at the graph, node, edge, and node-pair level. We equipped the library with an extensive collection of models and features to cover different levels of molecule analysis. By combining our curated datasets with this versatile library, we aim to accelerate the development of molecule foundation models. Datasets and code are available at https://github.com/datamol-io/graphium.

2024-01-15

ICLR.cc/2024/Conference (poster)

Crystal-GFN: sampling materials with desirable properties and constraints

Mistal

Alexandra Volokhova

Alexandre AGM Duval

2023-10-26

NeurIPS.cc/2023/Workshop/AI4Mat (spotlight)

Towards equilibrium molecular conformation generation with GFlowNets

Alexandra Volokhova

Cheng-Hao Liu

Santiago Miret

Pablo Lemos

Luca Thiede

Zichao Yan

Alán Aspuru-Guzik

Sampling diverse, thermodynamically feasible molecular conformations plays a crucial role in predicting properties of a molecule. In this pa… (see more)per we propose to use GFlowNet for sampling conformations of small molecules from the Boltzmann distribution, as determined by the molecule's energy. The proposed approach can be used in combination with energy estimation methods of different fidelity and discovers a diverse set of low-energy conformations for highly flexible drug-like molecules. We demonstrate that GFlowNet can reproduce molecular potential energy surfaces by sampling proportionally to the Boltzmann distribution.

2023-10-26

NeurIPS.cc/2023/Workshop/AI4Mat (poster)

Crystal-GFN: sampling crystals with desirable properties and constraints

Alexandre AGM Duval

Alexandra Volokhova

Accelerating material discovery holds the potential to greatly help mitigate the climate crisis. Discovering new solid-state materials such … (see more)as electrocatalysts, super-ionic conductors or photovoltaic materials can have a crucial impact, for instance, in improving the efficiency of renewable energy production and storage. In this paper, we introduce Crystal-GFN, a generative model of crystal structures that sequentially samples structural properties of crystalline materials, namely the space group, composition and lattice parameters. This domain-inspired approach enables the flexible incorporation of physical and structural hard constraints, as well as the use of any available predictive model of a desired physicochemical property as an objective function. To design stable materials, one must target the candidates with the lowest formation energy. Here, we use as objective the formation energy per atom of a crystal structure predicted by a new proxy machine learning model trained on MatBench. The results demonstrate that Crystal-GFN is able to sample highly diverse crystals with low (median -3.1 eV/atom) predicted formation energy.

2023-10-06

ArXiv (preprint)