Portrait of Jian Tang

Jian Tang

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, HEC Montréal, Department of Decision Sciences
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Founder, BioGeometry
Research Topics
Computational Biology
Deep Learning
Generative Models
Graph Neural Networks
Molecular Modeling

Biography

Jian Tang is an Associate professor at HEC's Department of Decision Sciences. He is also an Adjunct professor at the Department of Computer Science and Operations Research at University of Montreal and a Core Academic member at Mila - Quebec AI Institute. He is a Canada CIFAR AI Chair and the Founder of BioGeometry, an AI startup that focuses on generative AI for antibody discovery. Tang’s main research interests are deep generative models and graph machine learning, and their applications to drug discovery. He is an international leader in graph machine learning, and LINE, his node representation method, has been widely recognized and cited more than five thousand times. He has also done pioneering work on AI for drug discovery, such as developing the first open-source machine learning frameworks for drug discovery, TorchDrug and TorchProtein.

Current Students

Collaborating researcher
PhD - Université de Montréal
Principal supervisor :
Research Intern - McGill University
PhD - Université de Montréal
Collaborating researcher - Carnegie Mellon University
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal

Publications

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
Minghao Xu
Xinyu Yuan
Santiago Miret
Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary… (see more) information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM’s original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.
Signed Laplacian Graph Neural Networks
Yu Li
Meng Qu
Yi Chang
Score-based Enhanced Sampling for Protein Molecular Dynamics
Jiarui Lu
Bozitao Zhong
The dynamic nature of proteins is crucial for determining their biological functions and properties, and molecular dynamics (MD) simulations… (see more) stand as a predominant tool to study such phenomena. By utilizing empirically derived force fields, MD simulations explore the conformational space through numerically evolving the system along MD trajectories. However, the high-energy barrier of the force fields can hamper the exploration of MD, resulting in inadequately sampled ensemble. In this paper, we propose leveraging score-based generative models (SGMs) trained on large-scale general protein structures to perform protein con- formational sampling to complement traditional MD simulations. Experimental results demonstrate the effectiveness of our approach on several benchmark systems by comparing the results with long MD trajectories and state-of-the-art generative structure prediction models.
Evolving Computation Graphs
Andreea Deac
Graph neural networks (GNNs) have demonstrated success in modeling relational data, especially for data that exhibits homophily: when a conn… (see more)ection between nodes tends to imply that they belong to the same class. However, while this assumption is true in many relevant situations, there are important real-world scenarios that violate this assumption, and this has spurred research into improving GNNs for these cases. In this work, we propose Evolving Computation Graphs (ECGs), a novel method for enhancing GNNs on heterophilic datasets. Our approach builds on prior theoretical insights linking node degree, high homophily, and inter vs intra-class embedding similarity by rewiring the GNNs' computation graph towards adding edges that connect nodes that are likely to be in the same class. We utilise weaker classifiers to identify these edges, ultimately improving GNN performance on non-homophilic data as a result. We evaluate ECGs on a diverse set of recently-proposed heterophilous datasets and demonstrate improvements over the relevant baselines. ECG presents a simple, intuitive and elegant approach for improving GNN performance on heterophilic datasets without requiring prior domain knowledge.
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
Minghao Xu
Xinyu Yuan
Santiago Miret
Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary… (see more) information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.
Biomedical discovery through the integrative biomedical knowledge hub (iBKH).
Chang Su
Yufang Hou
Manqi Zhou
Suraj Rajendran
Jacqueline R.M. A. Maasch
Zehra Abedi
Haotan Zhang
Zilong Bai
Anthony Cuturrufo
Winston Guo
Fayzan F. Chaudhry
Gregory Ghahramani
Feixiong Cheng
Rui Zhang
Steven T. DeKosky
Jiang Bian
Yi Wang
A Systematic Study of Joint Representation Learning on Protein Sequences and Structures
Zuobai Zhang
Chuanrui Wang
Minghao Xu
Vijil Chenthamarakshan
Aurelie Lozano
Payel Das
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequenc… (see more)e representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.
Enhancing Protein Language Model with Structure-based Encoder and Pre-training
Zuobai Zhang
Minghao Xu
Aurelie Lozano
Vijil Chenthamarakshan
Payel Das
Protein language models (PLMs) pre-trained on large-scale protein sequence corpora have achieved impressive performance on various downstrea… (see more)m protein understanding tasks. Despite the ability to implicitly capture inter-residue contact information, transformer-based PLMs cannot encode protein structures explicitly for better structure-aware protein representations. Besides, the power of pre-training on available protein structures has not been explored for improving these PLMs, though structures are important to determine functions. To tackle these limitations, in this work, we enhance the PLM with structure-based encoder and pre-training. We first explore feasible model architectures to combine the advantages of a state-of-the-art PLM (i.e., ESM-1b) and a state-of-the-art protein structure encoder (i.e., GearNet). We empirically verify the ESM-GearNet that connects two encoders in a series way as the most effective combination model. To further improve the effectiveness of ESM-GearNet, we pre-train it on massive unlabeled protein structures with contrastive learning, which aligns representations of co-occurring subsequences so as to capture their biological correlation. Extensive experiments on EC and GO protein function prediction benchmarks demonstrate the superiority of ESM-GearNet over previous PLMs and structure encoders, and clear performance gains are further achieved by structure-based pre-training upon ESM-GearNet. The source code will be made public upon acceptance.
Enhancing Protein Language Model with Structure-based Encoder and Pre-training
Zuobai Zhang
Minghao Xu
Aurelie Lozano
Vijil Chenthamarakshan
Payel Das
Protein language models (PLMs) pre-trained on large-scale protein sequence corpora have achieved impressive performance on various downstrea… (see more)m protein understanding tasks. Despite the ability to implicitly capture inter-residue contact information, transformer-based PLMs cannot encode protein structures explicitly for better structure-aware protein representations. Besides, the power of pre-training on available protein structures has not been explored for improving these PLMs, though structures are important to determine functions. To tackle these limitations, in this work, we enhance the PLM with structure-based encoder and pre-training. We first explore feasible model architectures to combine the advantages of a state-of-the-art PLM (i.e., ESM-1b) and a state-of-the-art protein structure encoder (i.e., GearNet). We empirically verify the ESM-GearNet that connects two encoders in a series way as the most effective combination model. To further improve the effectiveness of ESM-GearNet, we pre-train it on massive unlabeled protein structures with contrastive learning, which aligns representations of co-occurring subsequences so as to capture their biological correlation. Extensive experiments on EC and GO protein function prediction benchmarks demonstrate the superiority of ESM-GearNet over previous PLMs and structure encoders, and clear performance gains are further achieved by structure-based pre-training upon ESM-GearNet. The source code will be made public upon acceptance.
EurNet: Efficient Multi-Range Relational Modeling of Protein Structure
Minghao Xu
Yuanfan Guo
Yi Xu
Xinlei Chen
Yuandong Tian
Modeling the 3D structures of proteins is critical for obtaining effective protein structure representations, which further boosts protein f… (see more)unction understanding. Existing protein structure encoders mainly focus on modeling short-range interactions within protein structures, while they neglect modeling the interactions at multiple length scales that are actually complete interactive patterns in protein structures. To attain complete interaction modeling with efficient computation, we introduce the EurNet for Efficient multi-range relational modeling. In EurNet, we represent the protein structure as a multi-relational residue-level graph with different types of edges for modeling short-range, medium-range and long-range interactions. To efficiently process these different interactive relations, we propose a novel modeling layer, called Gated Relational Message Passing (GRMP), as the basic building block of EurNet. GRMP can capture multiple interactive relations in protein structures with little extra computational cost. We verify the state-of-the-art performance of EurNet on EC and GO protein function prediction benchmarks, and the proposed GRMP layer is proved to achieve better efficiency-performance trade-off than the widely-used relational graph convolution.
A Text-guided Protein Design Framework
Shengchao Liu
Yutao Zhu
Jiarui Lu
Zhao Xu
Weili Nie
Anthony James Gitter
Chaowei Xiao
Hongyu Guo
Animashree Anandkumar
E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking
Yang Zhang
Huiyu Cai
Chence Shi
Bozitao Zhong
In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work foc… (see more)uses on blind flexible self-docking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference cost. Recently, data-driven methods based on deep learning techniques are attracting growing interest thanks to their efficiency during inference and promising performance. These methods usually either adopt a two-stage approach by first predicting the distances between proteins and ligands and then generating the final coordinates based on the predicted distances, or directly predicting the global roto-translation of ligands. In this paper, we take a different route. Inspired by the resounding success of AlphaFold2 for protein structure prediction, we propose E3Bind, an end-to-end equivariant network that iteratively updates the ligand pose. E3Bind models the protein-ligand interaction through careful consideration of the geometric constraints in docking and the local context of the binding site. Experiments on standard benchmark datasets demonstrate the superior performance of our end-to-end trainable model compared to traditional and recently-proposed deep learning methods.