Jian Tang

chuanrui.wang@mila.quebec

Chuanru Wang

Master's Research - Université de Montréal

farzaneh.heidari@mila.quebec

Farzaneh Heidari

PhD - Université de Montréal

Principal supervisor :

Guillaume Rabusseau

Huiyu Cai

PhD - Université de Montréal

huiyu.cai@mila.quebec

Jianan Zhao

PhD - Université de Montréal

jianan.zhao@mila.quebec

Jerry Lu

PhD - Université de Montréal

jiarui.lu@mila.quebec

mikhail.galkin@mila.quebec

Meng Qu

PhD - Université de Montréal

qumeng@mila.quebec

Michael Galkin

Collaborating researcher

sophie.xhonneux@mila.quebec

Minghao Xu

Collaborating researcher

minghao.xu@mila.quebec

Research Intern - HEC Montréal

nicole.duane@mila.quebec

Sophie Xhonneux

PhD - Université de Montréal

Co-supervisor :

Gauthier Gidel

Xinyu Yuan

PhD - Université de Montréal

xinyu.yuan@mila.quebec

yangtian.zhang@mila.quebec

Yangtian Zhang

Collaborating researcher

Zewen Chi

Research Intern - Beijing Institute of Technology

zewen.chi@mila.quebec

Zhaocheng Zhu

PhD - Université de Montréal

Zhihao Zhan

PhD - Université de Montréal

zhihao.zhan@mila.quebec

Zuobai Zhang

PhD - Université de Montréal

zuobai.zhang@mila.quebec

Publications

DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing

Yang Zhang

Zuobai Zhang

Bozitao Zhong

Sanchit Misra

Proteins play a critical role in carrying out biological functions, and their 3D structures are essential in determining their functions. A… (see more)ccurately predicting the conformation of protein side-chains given their backbones is important for applications in protein structure prediction, design and protein-protein interactions. Traditional methods are computationally intensive and have limited accuracy, while existing machine learning methods treat the problem as a regression task and overlook the restrictions imposed by the constant covalent bond lengths and angles. In this work, we present DiffPack, a torsional diffusion model that learns the joint distribution of side-chain torsional angles, the only degrees of freedom in side-chain packing, by diffusing and denoising on the torsional space. To avoid issues arising from simultaneous perturbation of all four torsional angles, we propose autoregressively generating the four torsional angles from

GAUCHE: A Library for Gaussian Processes in Chemistry

Ryan-Rhys Griffiths

Leo Klarner

Henry Moss

Aditya Ravuri

Sang T. Truong

Yuanqi Du

Samuel Don Stanton

Gary Tom

Bojana Rankovic

Arian Rokkum Jamasb

Aryan Deshwal

Julius Schwartz

Austin Tripp

Gregory Kell

Simon Frieder

Anthony Bourached

Alex James Chan

Jacob Moss

Chengzhi Guo

Johannes P. Dürholt … (see 8 more)

Saudamini Chaurasia

Ji Won Park

Felix Strieth-Kalthoff

Alpha Lee

Bingqing Cheng

Alán Aspuru-Guzik

Philippe Schwaller

We introduce GAUCHE, a library for GAUssian processes in CHEmistry. Gaussian processes have long been a cornerstone of probabilistic machine… (see more) learning, affording particular advantages for uncertainty quantification and Bayesian optimisation. Extending Gaussian processes to chemical representations however is nontrivial, necessitating kernels defined over structured inputs such as graphs, strings and bit vectors. By defining such kernels in GAUCHE, we seek to open the door to powerful tools for uncertainty quantification and Bayesian optimisation in chemistry. Motivated by scenarios frequently encountered in experimental chemistry, we showcase applications for GAUCHE in molecular discovery and chemical reaction optimisation. The codebase is made available at https://github.com/leojklarner/gauche

Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction

Zuobai Zhang

Minghao Xu

Aurelie Lozano

Vijil Chenthamarakshan

Payel Das

Self-supervised pre-training methods on proteins have recently gained attention, with most approaches focusing on either protein sequences o… (see more)r structures, neglecting the exploration of their joint distribution, which is crucial for a comprehensive understanding of protein functions by integrating co-evolutionary information and structural characteristics. In this work, inspired by the success of denoising diffusion models in generative tasks, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure joint diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the joint diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. Our implementation is available at https://github.com/DeepGraphLearning/SiamDiff.

Scientific discovery in the age of artificial intelligence

Hanchen Wang

Tianfan Fu

Yuanqi Du

Wenhao Gao

Kexin Huang

Ziming Liu

Payal Chandak

Shengchao Liu

Peter Van Katwyk

Andreea Deac

Animashree Anandkumar

K. Bergen

Carla P. Gomes

Shirley Ho

Pushmeet Kohli

Joan Lasenby

Jure Leskovec

Tie-Yan Liu

A. Manrai

Debora Susan Marks … (see 10 more)

Bharath Ramsundar

Le Song

Jimeng Sun

Petar Veličković

Max Welling

Linfeng Zhang

Connor Wilson. Coley

Yoshua Bengio

Marinka Žitnik

2023-08-01

Nature (published)

FusionRetro: Molecule Representation Fusion via In-Context Learning for Retrosynthetic Planning

Songtao Liu

Zhengkai Tu

Minkai Xu

Zuobai Zhang

Lu Lin

Rex Ying

Zhitao Ying

Peilin Zhao

Dinghao Wu

Retrosynthetic planning aims to devise a complete multi-step synthetic route from starting materials to a target molecule. Current strategie… (see more)s use a decoupled approach of single-step retrosynthesis models and search algorithms, taking only the product as the input to predict the reactants for each planning step and ignoring valuable context information along the synthetic route. In this work, we propose a novel framework that utilizes context information for improved retrosynthetic planning. We view synthetic routes as reaction graphs and propose to incorporate context through three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. Our approach is the first attempt to utilize in-context learning for retrosynthesis prediction in retrosynthetic planning. The entire framework can be efficiently optimized in an end-to-end fashion and produce more practical and accurate predictions. Comprehensive experiments demonstrate that by fusing in the context information over routes, our model significantly improves the performance of retrosynthetic planning over baselines that are not context-aware, especially for long synthetic routes. Code is available at https://github.com/SongtaoLiu0823/FusionRetro.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

proceedings.mlr.press

A Group Symmetric Stochastic Differential Equation Model for Molecule Multi-modal Pretraining

Shengchao Liu

weitao Du

Zhi-Ming Ma

Hongyu Guo

Molecule pretraining has quickly become the go-to schema to boost the performance of AI-based drug discovery. Naturally, molecules can be re… (see more)presented as 2D topological graphs or 3D geometric point clouds. Although most existing pertaining methods focus on merely the single modality, recent research has shown that maximizing the mutual information (MI) between such two modalities enhances the molecule representation ability. Meanwhile, existing molecule multi-modal pretraining approaches approximate MI based on the representation space encoded from the topology and geometry, thus resulting in the loss of critical structural information of molecules. To address this issue, we propose MoleculeSDE. MoleculeSDE leverages group symmetric (e.g., SE(3)-equivariant and reflection-antisymmetric) stochastic differential equation models to generate the 3D geometries from 2D topologies, and vice versa, directly in the input space. It not only obtains tighter MI bound but also enables prosperous downstream tasks than the previous work. By comparing with 17 pretraining baselines, we empirically verify that MoleculeSDE can learn an expressive representation with state-of-the-art performance on 26 out of 32 downstream tasks.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Minghao Xu

Xinyu Yuan

Santiago Miret

Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary… (see more) information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM’s original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

Signed Laplacian Graph Neural Networks

Yu Li

Meng Qu

Yi Chang

2023-06-26

Proceedings of the AAAI Conference on Artificial Intelligence (published)

Score-based Enhanced Sampling for Protein Molecular Dynamics

Jiarui Lu

Bozitao Zhong

The dynamic nature of proteins is crucial for determining their biological functions and properties, and molecular dynamics (MD) simulations… (see more) stand as a predominant tool to study such phenomena. By utilizing empirically derived force fields, MD simulations explore the conformational space through numerically evolving the system along MD trajectories. However, the high-energy barrier of the force fields can hamper the exploration of MD, resulting in inadequately sampled ensemble. In this paper, we propose leveraging score-based generative models (SGMs) trained on large-scale general protein structures to perform protein con- formational sampling to complement traditional MD simulations. Experimental results demonstrate the effectiveness of our approach on several benchmark systems by comparing the results with long MD trajectories and state-of-the-art generative structure prediction models.

2023-06-19

ICML.cc/2023/Workshop/SPIGM (poster)

Evolving Computation Graphs

Andreea Deac

Graph neural networks (GNNs) have demonstrated success in modeling relational data, especially for data that exhibits homophily: when a conn… (see more)ection between nodes tends to imply that they belong to the same class. However, while this assumption is true in many relevant situations, there are important real-world scenarios that violate this assumption, and this has spurred research into improving GNNs for these cases. In this work, we propose Evolving Computation Graphs (ECGs), a novel method for enhancing GNNs on heterophilic datasets. Our approach builds on prior theoretical insights linking node degree, high homophily, and inter vs intra-class embedding similarity by rewiring the GNNs' computation graph towards adding edges that connect nodes that are likely to be in the same class. We utilise weaker classifiers to identify these edges, ultimately improving GNN performance on non-homophilic data as a result. We evaluate ECGs on a diverse set of recently-proposed heterophilous datasets and demonstrate improvements over the relevant baselines. ECG presents a simple, intuitive and elegant approach for improving GNN performance on heterophilic datasets without requiring prior domain knowledge.

2023-06-18

ICML.cc/2023/Workshop/TAGML (poster)

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Minghao Xu

Xinyu Yuan

Santiago Miret

Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary… (see more) information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.

2023-04-24

ICML.cc/2023/Conference (published)

Biomedical discovery through the integrative biomedical knowledge hub (iBKH).

Chang Su

Yu Hou

Manqi Zhou

Suraj Rajendran

Jacqueline R.M. A. Maasch

Zehra Abedi

Haotan Zhang

Zilong Bai

Anthony Cuturrufo

Winston Guo

Fayzan F. Chaudhry

Gregory Ghahramani

Feixiong Cheng

Yue Li

Rui Zhang

Steven T. DeKosky

Jiang Bian

Fei Wang

2023-04-01

iScience (published)