Dominique Beaini

Graph Positional and Structural Encoder

Renming Liu

Semih Cantürk

Olivier Lapointe-Gagné

Vincent Létourneau

Guy Wolf

Ladislav Rampášek

Positional and structural encodings (PSE) enable better identifiability of nodes within a graph, as in general graphs lack a canonical node … (voir plus)ordering. This renders PSEs essential tools for empowering modern GNNs, and in particular graph Transformers. However, designing PSEs that work optimally for a variety of graph prediction tasks is a challenging and unsolved problem. Here, we present the graph positional and structural encoder (GPSE), a first-ever attempt to train a graph encoder that captures rich PSE representations for augmenting any GNN. GPSE can effectively learn a common latent representation for multiple PSEs, and is highly transferable. The encoder trained on a particular graph dataset can be used effectively on datasets drawn from significantly different distributions and even modalities. We show that across a wide range of benchmarks, GPSE-enhanced models can significantly improve the performance in certain tasks, while performing on par with those that employ explicitly computed PSEs in other cases. Our results pave the way for the development of large pre-trained models for extracting graph positional and structural information and highlight their potential as a viable alternative to explicitly computed PSEs as well as to existing self-supervised pre-training approaches.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (publié)

Equivariant Flow Matching for Molecular Conformer Generation

Majdi Hassan

Nikhil Shenoy

Jungyoon Lee

Hannes Stärk

Stephan Thaler

2024-06-17

ICML.cc/2024/Workshop/ML4LMS (poster)

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

Oren Kraus

Kian Kenyon-Dean

Saber Saberian

Maryam Fallah

Peter McLean

Jess Leung

Vasudev Sharma

Ayla Khan

Jia Balakrishnan

Safiye Celik

Maciej Sypetkowski

Chi Vicky Cheng

Kristen Morse

Maureen Makes

Ben Mabey

Berton Earnshaw

2024-06-16

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

On the Scalability of GNNs for Molecular Graphs

Maciej Sypetkowski

Frederik Wenkel

Farimah Poursafaei

Nia Dickson

Karush Suri

Philip Fradkin

Scaling deep learning models has been at the heart of recent revolutions in language modelling and image generation. Practitioners have obse… (voir plus)rved a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures. We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules, number of labels, and the diversity in the pretraining datasets. We further demonstrate strong finetuning scaling behavior on 38 highly competitive downstream tasks, outclassing previous large models. This gives rise to MolGPS, a new graph foundation model that allows to navigate the chemical space, outperforming the previous state-of-the-arts on 26 out the 38 downstream tasks. We hope that our work paves the way for an era where foundational GNNs drive pharmaceutical drug discovery.

2024-04-17

ArXiv (prépublication)

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

Oren Kraus

Kian Kenyon-Dean

Saber Saberian

Maryam Fallah

Peter McLean

Jess Leung

Vasudev Sharma

Ayla Khan

Jia Balakrishnan

Safiye Celik

Maciej Sypetkowski

Chi Vicky Cheng

Kristen Morse

Maureen Makes

Ben Mabey

Berton Earnshaw

Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spannin… (voir plus)g millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond. Relevant code and select models released with this work can be found at: https://github.com/recursionpharma/maes_microscopy.

2024-04-16

ArXiv (prépublication)

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

Oren Kraus

Kian Kenyon-Dean

Saber Saberian

Maryam Fallah

Peter McLean

Jess Leung

Vasudev Sharma

Ayla Khan

Jia Balakrishnan

Safiye Celik

Maciej Sypetkowski

Chi Vicky Cheng

Kristen Morse

Maureen Makes

Ben Mabey

Berton Earnshaw

Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spannin… (voir plus)g millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond. Relevant code and select models released with this work can be found at: https://github.com/recursionpharma/maes_microscopy.

2024-04-16

ArXiv (prépublication)

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

Oren Kraus

Kian Kenyon-Dean

Saber Saberian

Maryam Fallah

Peter McLean

Jess Leung

Vasudev Sharma

Ayla Khan

Jia Balakrishnan

Safiye Celik

Maciej Sypetkowski

Chi Vicky Cheng

Kristen Morse

Maureen Makes

Ben Mabey

Berton Earnshaw

Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spannin… (voir plus)g millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond. Relevant code and select models released with this work can be found at: https://github.com/recursionpharma/maes_microscopy.

2024-04-16

ArXiv (prépublication)

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Shenyang Huang

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Frederik Wenkel

Luis Müller

Jama Hussein Mohamud

Ali Parviz

Michael Craig

Michał Koziarski

Jiarui Lu

Zhaocheng Zhu

Cristian Gabellini

Kerstin Klaser

Josef Dean

Cas Wognum … (voir 15 de plus)

Maciej Sypetkowski

Guillaume Rabusseau

Reihaneh Rabbany

Jian Tang

Christopher Morris

Ioannis Koutis

Mirco Ravanelli

Guy Wolf

Prudencio Tossou

Hadrien Mary

Therence Bois

Andrew William Fitzgibbon

Blazej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (voir plus)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.

2024-01-16

ICLR.cc/2024/Conference (poster)

Latent Space Simulator for Unveiling Molecular Free Energy Landscapes and Predicting Transition Dynamics

Simon Dobers

Hannes Stärk

Xiang Fu

Stephan Günnemann

Free Energy Surfaces (FES) and metastable transition rates are key elements in understanding the behavior of molecules within a system. Howe… (voir plus)ver, the typical approaches require computing force fields across billions of time steps in a molecular dynamics (MD) simulation, which is often considered intractable when dealing with large systems or databases. In this work, we propose LaMoDy, a latent-space MD simulator, to effectively tackle the intractability with around 20-fold speed improvements compared to classical MD. The model leverages a chirality-aware SE(3)-invariant encoder-decoder architecture to generate a latent space coupled with a recurrent neural network to run the time-wise dynamics. We show that LaMoDy effectively recovers realistic trajectories and FES more accurately and faster than existing methods while capturing their major dynamical and conformational properties. Furthermore, the proposed approach can generalize to molecules outside the training distribution.

2023-10-27

NeurIPS.cc/2023/Workshop/AI4Science (poster)

Role of Structural and Conformational Diversity for Machine Learning Potentials

Nikhil Shenoy

Prudencio Tossou

Emmanuel Noutahi

Hadrien Mary

Jiarui Ding

In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically … (voir plus)conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts.

2023-10-27

NeurIPS.cc/2023/Workshop/AI4Science (présentation orale)

GPS++: Reviving the Art of Message Passing for Molecular Property Prediction

Dominic Masters

Josef Dean

Kerstin Klaser

Zhiyi Li

Samuel Maddrell-Mander

Adam Sanders

Hatem Helal

Deniz Beker

Andrew William Fitzgibbon

Shenyang Huang

Ladislav Rampášek

2023-07-30

TMLR (accepté)