Portrait de Dominique Beaini n'est pas disponible

Dominique Beaini

Membre industriel associé
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chef de la recherche graphique, Valence Discovery
Sujets de recherche
Apprentissage multimodal
Apprentissage sur graphes
Modélisation moléculaire
Réseaux de neurones en graphes

Biographie

Je suis actuellement chef d’équipe de l’unité de recherche de Valence Discovery, l’une des principales entreprises dans le domaine de l’apprentissage automatique appliqué à la découverte de médicaments, et professeur associé au Département d’informatique et de recherche opérationnelle (DIRO) de l’Université de Montréal. Mon objectif est d’amener l’apprentissage automatique vers une meilleure compréhension des molécules et de leurs interactions avec la biologie humaine. Je suis titulaire d’un doctorat de Polytechnique Montréal; mes recherches antérieures portaient sur la robotique et la vision par ordinateur.

Mes intérêts de recherche sont les réseaux neuronaux de graphes, l’apprentissage autosupervisé, la mécanique quantique, la découverte de médicaments, la vision par ordinateur et la robotique.

Étudiants actuels

Maîtrise recherche - UdeM
Doctorat - UdeM

Publications

Amortized Sampling with Transferable Normalizing Flows
Charlie B. Tan
Leon Klein
Saifuddin Syed
Michael M. Bronstein
Kirill Neklyudov
Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Cla… (voir plus)ssical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in full for each system of interest. The widespread success of generative models has inspired interest towards overcoming this limitation through learning sampling algorithms. Despite performing competitively with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We demonstrate that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve competitive performance to established methods such as sequential Monte Carlo. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.
Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities
Avishek Joey Bose
Valentin De Bortoli
Arnaud Doucet
Michael M. Bronstein
Kirill Neklyudov
Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact sc… (voir plus)ientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA), a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to procure training samples at a lower temperature for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations. Code available at: https://github.com/taraak/pita
Self-Refining Training for Amortized Density Functional Theory
Cristian Gabellini
Hatem Helal
Kirill Neklyudov
Density Functional Theory (DFT) allows for predicting all the chemical and physical properties of molecular systems from first principles by… (voir plus) finding an approximate solution to the many-body Schrödinger equation. However, the cost of these predictions becomes infeasible when increasing the scale of the energy evaluations, e.g., when calculating the ground-state energy for simulating molecular dynamics. Recent works have demonstrated that, for substantially large datasets of molecular conformations, Deep Learning-based models can predict the outputs of the classical DFT solvers by amortizing the corresponding optimization problems. In this paper, we propose a novel method that reduces the dependency of amortized DFT solvers on large pre-collected datasets by introducing a self-refining training strategy. Namely, we propose an efficient method that simultaneously trains a deep-learning model to predict the DFT outputs and samples molecular conformations that are used as training data for the model. We derive our method as a minimization of the variational upper bound on the KL-divergence measuring the discrepancy between the generated samples and the target Boltzmann distribution defined by the ground state energy. To demonstrate the utility of the proposed scheme, we perform an extensive empirical study comparing it with the models trained on the pre-collected datasets. Finally, we open-source our implementation of the proposed algorithm, optimized with asynchronous training and sampling stages, which enables simultaneous sampling and training. Code is available at https://github.com/majhas/self-refining-dft.
Virtual Cells: Predict, Explain, Discover
Emmanuel Noutahi
Jason Hartford
Ali Denton
Kristina Ulicna
Michael Craig
Jonathan Hsu
Michael Cuccarese
Christopher Gibson
Daniel Cohen
Berton Earnshaw
Scaling Deep Learning Solutions for Transition Path Sampling
Michael Plainer
Yuanqi Du
Rob Brekelmans
Carla P Gomes
Kirill Neklyudov
Transition path sampling (TPS) is an important method for studying rare events, such as they happen in chemical reactions or protein folding… (voir plus). These events occur so infrequently that traditional simulations are often impractical, and even recent machine-learning approaches struggle to address this issue for larger systems. In this paper, we propose using modern deep learning techniques to improve the scalability of TPS methods significantly. We highlight the need for better evaluations in the existing literature and start by formulating TPS as a sampling problem over an unnormalized target density and introduce relevant evaluation metrics to assess the effectiveness of TPS solutions from this perspective. To develop a scalable approach, we explore several design choices, including a problem-informed neural network architecture, simulated annealing, the integration of prior knowledge into the sampling process, and attention mechanisms. Finally, we conduct a comprehensive empirical study and compare these design choices with other recently developed deep-learning methods for rare event sampling.
Molphenix: A Multimodal Foundation Model for PhenoMolecular Retrieval
Philip Fradkin
Puria Azadi Moghadam
Karush Suri
Maciej Sypetkowski
Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellu… (voir plus)lar morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem of Contrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1
ET-Flow: Equivariant Flow-Matching for Molecular Conformer Generation
Nikhil Shenoy
Hannes Stärk
Stephan Thaler
Predicting low-energy molecular conformations given a molecular graph is an important but challenging task in computational drug discovery.… (voir plus) Existing state- of-the-art approaches either resort to large scale transformer-based models that diffuse over conformer fields, or use computationally expensive methods to gen- erate initial structures and diffuse over torsion angles. In this work, we introduce Equivariant Transformer Flow (ET-Flow). We showcase that a well-designed flow matching approach with equivariance and harmonic prior alleviates the need for complex internal geometry calculations and large architectures, contrary to the prevailing methods in the field. Our approach results in a straightforward and scalable method that directly operates on all-atom coordinates with minimal assumptions. With the advantages of equivariance and flow matching, ET-Flow significantly increases the precision and physical validity of the generated con- formers, while being a lighter model and faster at inference. Code is available https://github.com/shenoynikhil/ETFlow.
How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval
Philip Fradkin
Puria Azadi Moghadam
Karush Suri
Ali Bashashati
Maciej Sypetkowski
Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellul… (voir plus)ar morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem ofContrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1x improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy. These results open the door for machine learning to be applied in virtual phenomics screening, which can significantly benefit drug discovery applications.
On the Scalability of GNNs for Molecular Graphs
Maciej Sypetkowski
Nia Dickson
Karush Suri
Philip Fradkin
Scaling deep learning models has been at the heart of recent revolutions in language modelling and image generation. Practitioners have obse… (voir plus)rved a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures. We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules, number of labels, and the diversity in the pretraining datasets, resulting in a 30.25% improvement when scaling to 1 billion parameters and 28.98% improvement when increasing size of dataset to eightfold. We further demonstrate strong finetuning scaling behavior on 38 tasks, outclassing previous large models. We hope that our work paves the way for an era where foundational GNNs drive pharmaceutical drug discovery.
Graph Positional and Structural Encoder
Positional and structural encodings (PSE) enable better identifiability of nodes within a graph, rendering them essential tools for empoweri… (voir plus)ng modern GNNs, and in particular graph Transformers. However, designing PSEs that work optimally for all graph prediction tasks is a challenging and unsolved problem. Here, we present the Graph Positional and Structural Encoder (GPSE), the first-ever graph encoder designed to capture rich PSE representations for augmenting any GNN. GPSE learns an efficient common latent representation for multiple PSEs, and is highly transferable: The encoder trained on a particular graph dataset can be used effectively on datasets drawn from markedly different distributions and modalities. We show that across a wide range of benchmarks, GPSE-enhanced models can significantly outperform those that employ explicitly computed PSEs, and at least match their performance in others. Our results pave the way for the development of foundational pre-trained graph encoders for extracting positional and structural information, and highlight their potential as a more powerful and efficient alternative to explicitly computed PSEs and existing self-supervised pre-training approaches. Our framework and pre-trained models are publicly available at https://github.com/G-Taxonomy-Workgroup/GPSE. For convenience, GPSE has also been integrated into the PyG library to facilitate downstream applications.
Equivariant Flow Matching for Molecular Conformer Generation
Nikhil Shenoy
Hannes Stärk
Stephan Thaler
Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology
Oren Kraus
Saber Saberian
Maryam Fallah
Peter McLean
Jess Leung
Vasudev Sharma
Ayla Khan
Jia Balakrishnan
Safiye Celik
Maciej Sypetkowski
Chi Vicky Cheng
Kristen Morse
Maureen Makes
Ben Mabey
Berton Earnshaw