Portrait of Zuobai Zhang is unavailable

Zuobai Zhang

PhD - Université de Montréal
Supervisor
Research Topics
Computational Biology
Graph Neural Networks
Molecular Modeling

Publications

PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design
Chuanrui Wang
Bozitao Zhong
Narendra Chaudhary
Sanchit Misra
Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a univers… (see more)ally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the
DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing
Bozitao Zhong
Sanchit Misra
Proteins play a critical role in carrying out biological functions, and their 3D structures are essential in determining their functions. Ac… (see more)curately predicting the conformation of protein side-chains given their backbones is important for applications in protein structure prediction, design and protein-protein interactions. Traditional methods are computationally intensive and have limited accuracy, while existing machine learning methods treat the problem as a regression task and overlook the restrictions imposed by the constant covalent bond lengths and angles. In this work, we present DiffPack, a torsional diffusion model that learns the joint distribution of side-chain torsional angles, the only degrees of freedom in side-chain packing, by diffusing and denoising on the torsional space. To avoid issues arising from simultaneous perturbation of all four torsional angles, we propose autoregressively generating the four torsional angles from
Pretrainable geometric graph neural network for antibody affinity maturation
Mingkai Wang
Bozitao Zhong
Yanling Wu
Tianlei Ying
Increasing the binding affinity of an antibody to its target antigen is a crucial task in antibody therapeutics development. This paper pres… (see more)ents a pretrainable geometric graph neural network, GearBind, and explores its potential inin silicoaffinity maturation. Leveraging multi-relational graph construction, multi-level geometric message passing and contrastive pretraining on mass-scale, unlabeled protein structural data, GearBind outperforms previous state-of-the-art approaches on SKEMPI and an independent test set. A powerful ensemble model based on GearBind is then derived and used to successfully enhance the binding of two antibodies with distinct formats and target antigens. ELISA EC50values of the designed antibody mutants are decreased by up to 17 fold, andKDvalues by up to 6.1 fold. These promising results underscore the utility of geometric deep learning and effective pretraining in macromolecule interaction modeling tasks.
A Systematic Study of Joint Representation Learning on Protein Sequences and Structures
Chuanrui Wang
Vijil Chenthamarakshan
Aurelie Lozano
Payel Das
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequenc… (see more)e representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.
Enhancing Protein Language Model with Structure-based Encoder and Pre-training
Aurelie Lozano
Vijil Chenthamarakshan
Payel Das
Protein language models (PLMs) pre-trained on large-scale protein sequence corpora have achieved impressive performance on various downstrea… (see more)m protein understanding tasks. Despite the ability to implicitly capture inter-residue contact information, transformer-based PLMs cannot encode protein structures explicitly for better structure-aware protein representations. Besides, the power of pre-training on available protein structures has not been explored for improving these PLMs, though structures are important to determine functions. To tackle these limitations, in this work, we enhance the PLM with structure-based encoder and pre-training. We first explore feasible model architectures to combine the advantages of a state-of-the-art PLM (i.e., ESM-1b) and a state-of-the-art protein structure encoder (i.e., GearNet). We empirically verify the ESM-GearNet that connects two encoders in a series way as the most effective combination model. To further improve the effectiveness of ESM-GearNet, we pre-train it on massive unlabeled protein structures with contrastive learning, which aligns representations of co-occurring subsequences so as to capture their biological correlation. Extensive experiments on EC and GO protein function prediction benchmarks demonstrate the superiority of ESM-GearNet over previous PLMs and structure encoders, and clear performance gains are further achieved by structure-based pre-training upon ESM-GearNet. The source code will be made public upon acceptance.
Protein Representation Learning by Geometric Structure Pretraining
Arian Jamasb
Vijil Chenthamarakshan
Aurelie Lozano
Payel Das
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Ex… (see more)isting approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.
FusionRetro: Molecule Representation Fusion via Reaction Graph for Retrosynthetic Planning
Songtao Liu
Zhengkai Tu
Peilin Zhao
Rex Ying
Lu Lin
Dinghao Wu
Retrosynthetic planning is a fundamental problem in drug discovery and organic chemistry, which aims to find a complete multi-step syntheti… (see more)c route from a set of starting materials to the target molecule, determining crucial process flow in chemical production. Existing approaches combine single-step retrosynthesis models and search algorithms to find synthetic routes. However, these approaches generally consider the two pieces in a decoupled manner, taking only the product as the input to predict the reactants per planning step and largely ignoring the important context information from other intermediates along the synthetic route. In this work, we perform a series of experiments to identify the limitations of this decoupled view and propose a novel retrosynthesis framework that also exploits context information for retrosynthetic planning. We view synthetic routes as reaction graphs, and propose to incorporate the context by three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. The whole framework can be efficiently optimized in an end-to-end fashion. Comprehensive experiments show that by fusing in context information over routes, our model sig-nificantly improves the performance of retrosyn-thetic planning over baselines that are not context-aware, especially for long synthetic routes.
Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction
Aurelie Lozano
Vijil Chenthamarakshan
Payel Das
Self-supervised pre-training methods on proteins have recently gained attention, with most approaches focusing on either protein sequences o… (see more)r structures, neglecting the exploration of their joint distribution, which is crucial for a comprehensive understanding of protein functions by integrating co-evolutionary information and structural characteristics. In this work, inspired by the success of denoising diffusion models in generative tasks, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure joint diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the joint diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. Our implementation is available at https://github.com/DeepGraphLearning/SiamDiff.
Neural-Symbolic Models for Logical Queries on Knowledge Graphs
Answering complex first-order logic (FOL) queries on knowledge graphs is a fundamental task for multi-hop reasoning. Traditional symbolic me… (see more)thods traverse a complete knowledge graph to extract the answers, which provides good interpretation for each step. Recent neural methods learn geometric embeddings for complex queries. These methods can generalize to incomplete knowledge graphs, but their reasoning process is hard to interpret. In this paper, we propose Graph Neural Network Query Executor (GNN-QE), a neural-symbolic model that enjoys the advantages of both worlds. GNN-QE decomposes a complex FOL query into relation projections and logical operations over fuzzy sets, which provides interpretability for intermediate variables. To reason about the missing links, GNN-QE adapts a graph neural network from knowledge graph completion to execute the relation projections, and models the logical operations with product fuzzy logic. Experiments on 3 datasets show that GNN-QE significantly improves over previous state-of-the-art models in answering FOL queries. Meanwhile, GNN-QE can predict the number of answers without explicit supervision, and provide visualizations for intermediate variables.
FusionRetro: Molecule Representation Fusion via In-Context Learning for Retrosynthetic Planning
Songtao Liu
Rex Ying
Rex Ying
Peilin Zhao
Lu Lin
Dinghao Wu
Retrosynthetic planning aims to devise a complete multi-step synthetic route from starting materials to a target molecule. Current strategie… (see more)s use a decoupled approach of single-step retrosynthesis models and search algorithms, taking only the product as the input to predict the reactants for each planning step and ignoring valuable context information along the synthetic route. In this work, we propose a novel framework that utilizes context information for improved retrosynthetic planning. We view synthetic routes as reaction graphs and propose to incorporate context through three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. Our approach is the first attempt to utilize in-context learning for retrosynthesis prediction in retrosynthetic planning. The entire framework can be efficiently optimized in an end-to-end fashion and produce more practical and accurate predictions. Comprehensive experiments demonstrate that by fusing in the context information over routes, our model significantly improves the performance of retrosynthetic planning over baselines that are not context-aware, especially for long synthetic routes. Code is available at https://github.com/SongtaoLiu0823/FusionRetro.
PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding
We are now witnessing significant progress of deep learning methods in a variety of tasks (or datasets) of proteins. However, there is a lac… (see more)k of a standard benchmark to evaluate the performance of different methods, which hinders the progress of deep learning in this field. In this paper, we propose such a benchmark called PEER, a comprehensive and multi-task benchmark for Protein sEquence undERstanding. PEER provides a set of diverse protein understanding tasks including protein function prediction, protein localization prediction, protein structure prediction, protein-protein interaction prediction, and protein-ligand interaction prediction. We evaluate different types of sequence-based methods for each task including traditional feature engineering approaches, different sequence encoding methods as well as large-scale pre-trained protein language models. In addition, we also investigate the performance of these methods under the multi-task learning setting. Experimental results show that large-scale pre-trained protein language models achieve the best performance for most individual tasks, and jointly training multiple tasks further boosts the performance. The datasets and source codes of this benchmark are all available at https://github.com/DeepGraphLearning/PEER_Benchmark