Portrait of Jian Tang

Jian Tang

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, HEC Montréal, Department of Decision Sciences
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Founder, BioGeometry
Research Topics
Computational Biology
Deep Learning
Generative Models
Graph Neural Networks
Molecular Modeling

Biography

Jian Tang is an Associate professor at HEC's Department of Decision Sciences. He is also an Adjunct professor at the Department of Computer Science and Operations Research at University of Montreal and a Core Academic member at Mila - Quebec AI Institute. He is a Canada CIFAR AI Chair and the Founder of BioGeometry, an AI startup that focuses on generative AI for antibody discovery. Tang’s main research interests are deep generative models and graph machine learning, and their applications to drug discovery. He is an international leader in graph machine learning, and LINE, his node representation method, has been widely recognized and cited more than five thousand times. He has also done pioneering work on AI for drug discovery, such as developing the first open-source machine learning frameworks for drug discovery, TorchDrug and TorchProtein.

Current Students

Collaborating researcher
PhD - Université de Montréal
Principal supervisor :
Research Intern - McGill University
PhD - Université de Montréal
PhD - Université de Montréal
Collaborating researcher
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal

Publications

GraphVF: Controllable Protein-Specific 3D Molecule Generation with Variational Flow
Fang Sun
Zhihao Zhan
Hongyu Guo
Ming Zhang
Designing molecules that bind to specific target proteins is a fundamental task in drug discovery. Recent generative models leveraging geome… (see more)trical constraints imposed by proteins and molecules have shown great potential in generating protein-specific 3D molecules. Nevertheless, these existing methods fail to generate 3D molecules with 2D skeletal curtailments, which encode pharmacophoric patterns essential to drug potency. To cope with this challenge, we propose GraphVF, which seamlessly integrates geometrical and skeletal restraints into a variational flow framework, where the former is captured through a flow transformation and the latter is encoded by an amortized factorized Gaussian. We empirically verify that our method achieves state-of-the-art performance on protein-specific 3D molecule generation in terms of binding affinity and some other drug properties. In particular, it represents the first controllable geometry-aware, protein-specific molecule generation method, which enables creating 3D molecules with specified chemical sub-structures or drug properties.
Learning on Large-scale Text-attributed Graphs via Variational Inference
Jianan Zhao
Meng Qu
Chaozhuo Li
Hao Yan
Qian Liu
Rui Li
Xing Xie
This paper studies learning on text-attributed graphs (TAGs), where each node is associated with a text description. An ideal solution for s… (see more)uch a problem would be integrating both the text and graph structure information with large language models and graph neural networks (GNNs). However, the problem becomes very challenging when graphs are large due to the high computational complexity brought by training large language models and GNNs together. In this paper, we propose an efficient and effective solution to learning on large text-attributed graphs by fusing graph structure and language learning with a variational Expectation-Maximization (EM) framework, called GLEM. Instead of simultaneously training large language models and GNNs on big graphs, GLEM proposes to alternatively update the two modules in the E-step and M-step. Such a procedure allows training the two modules separately while simultaneously allowing the two modules to interact and mutually enhance each other. Extensive experiments on multiple data sets demonstrate the efficiency and effectiveness of the proposed approach.
Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching
Shengchao Liu
Hongyu Guo
Molecular representation pretraining is critical in various applications for drug and material discovery due to the limited number of labele… (see more)d molecules, and most existing work focuses on pretraining on 2D molecular graphs. However, the power of pretraining on 3D geometric structures has been less explored. This is owing to the difficulty of finding a sufficient proxy task that can empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose GeoSSL, a 3D coordinate denoising pretraining framework to model such an energy landscape. Further by leveraging an SE(3)-invariant score matching method, we propose GeoSSL-DDM in which the coordinate denoising proxy task is effectively boiled down to denoising the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method.
Pre-training Protein Structure Encoder via Siamese Diffusion Trajectory Prediction
Zuobai Zhang
Minghao Xu
Aurelie Lozano
Vijil Chenthamarakshan
Payel Das
Due to the determining role of protein structures on diverse protein functions, pre-training representations of proteins on massive unlabele… (see more)d protein structures has attracted rising research interests. Among recent efforts on this direction, mutual information (MI) maximization based methods have gained the superiority on various downstream benchmark tasks. The core of these methods is to design correlated views that share common information about a protein. Previous view designs focus on capturing structural motif co-occurrence on the same protein structure, while they cannot capture detailed atom/residue interactions. To address this limitation, we propose the Siamese Diffusion Trajectory Prediction (SiamDiff) method. SiamDiff builds a view as the trajectory that gradually approaches protein native structure from scratch, which facilitates the modeling of atom/residue interactions underlying the protein structural dynamics. Specifically, we employ the multimodal diffusion process as a faithful simulation of the structure-sequence co-diffusion trajectory, where rich patterns of protein structural changes are embedded. On such basis, we design a principled theoretical framework to maximize the MI between correlated multimodal diffusion trajectories. We study the effectiveness of SiamDiff on both residue-level and atom-level structures. On the EC and ATOM3D benchmarks, we extensively compare our method with previous protein structure pre-training approaches. The experimental results verify the consistently superior or competitive performance of SiamDiff on all benchmark tasks compared to existing baselines. The source code will be made public upon acceptance.
Protein Representation Learning by Geometric Structure Pretraining
Zuobai Zhang
Minghao Xu
Arian Rokkum Jamasb
Vijil Chenthamarakshan
Aurelie Lozano
Payel Das
Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Ex… (see more)isting approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.
Protein Sequence and Structure Co-Design with Equivariant Translation
Chence Shi
Chuanrui Wang
Jiarui Lu
Bozitao Zhong
Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and … (see more)desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods.
FusionRetro: Molecule Representation Fusion via Reaction Graph for Retrosynthetic Planning
Songtao Liu
Zhengkai Tu
Minkai Xu
Zuobai Zhang
Peilin Zhao
Rex Ying
Lu Lin
Dinghao Wu
Retrosynthetic planning is a fundamental problem in drug discovery and organic chemistry, which aims to find a complete multi-step syntheti… (see more)c route from a set of starting materials to the target molecule, determining crucial process flow in chemical production. Existing approaches combine single-step retrosynthesis models and search algorithms to find synthetic routes. However, these approaches generally consider the two pieces in a decoupled manner, taking only the product as the input to predict the reactants per planning step and largely ignoring the important context information from other intermediates along the synthetic route. In this work, we perform a series of experiments to identify the limitations of this decoupled view and propose a novel retrosynthesis framework that also exploits context information for retrosynthetic planning. We view synthetic routes as reaction graphs, and propose to incorporate the context by three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. The whole framework can be efficiently optimized in an end-to-end fashion. Comprehensive experiments show that by fusing in context information over routes, our model sig-nificantly improves the performance of retrosyn-thetic planning over baselines that are not context-aware, especially for long synthetic routes.
FusionRetro: Molecule Representation Fusion via In-Context Learning for Retrosynthetic Planning
Songtao Liu
Zhengkai Tu
Minkai Xu
Zuobai Zhang
Lu Lin
Rex Ying
Zhitao Ying
Peilin Zhao
Dinghao Wu
A Group Symmetric Stochastic Differential Equation Model for Molecule Multi-modal Pretraining
Shengchao Liu
weitao Du
Zhi-Ming Ma
Hongyu Guo
Molecule pretraining has quickly become the go-to schema to boost the performance of AI-based drug discovery. Naturally, molecules can be re… (see more)presented as 2D topological graphs or 3D geometric point clouds. Although most existing pertaining methods focus on merely the single modality, recent research has shown that maximizing the mutual information (MI) between such two modalities enhances the molecule representation ability. Meanwhile, existing molecule multi-modal pretraining approaches approximate MI based on the representation space encoded from the topology and geometry, thus resulting in the loss of critical structural information of molecules. To address this issue, we propose MoleculeSDE. MoleculeSDE leverages group symmetric (e.g., SE(3)-equivariant and reflection-antisymmetric) stochastic differential equation models to generate the 3D geometries from 2D topologies, and vice versa, directly in the input space. It not only obtains tighter MI bound but also enables prosperous downstream tasks than the previous work. By comparing with 17 pretraining baselines, we empirically verify that MoleculeSDE can learn an expressive representation with state-of-the-art performance on 26 out of 32 downstream tasks.
Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction
Zuobai Zhang
Minghao Xu
Aurelie Lozano
Vijil Chenthamarakshan
Payel Das
Pre-training methods on proteins are recently gaining interest, leveraging either protein sequences or structures, while modeling their join… (see more)t energy landscape is largely unexplored. In this work, inspired by the success of denoising diffusion models, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure multimodal diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the multimodal diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a physics-inspired method called Siamese Diffusion Trajectory Prediction ( SiamDiff ) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom-and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. The source code will be released upon acceptance.
GraphCG: Unsupervised Discovery of Steerable Factors in Graphs
Shengchao Liu
Chengpeng Wang
Weili Nie
Hanchen Wang
Jiarui Lu
Bolei Zhou
Deep generative models have been extensively explored recently, especially for the graph data such as molecular graphs and point clouds. Yet… (see more), much less investigation has been carried out on understanding the learned latent space of deep graph generative models. Such understandings can open up a unified perspective and provide guidelines for essential tasks like controllable generation. In this paper, we first examine the representation space of the recent deep generative model trained for graph data, observing that the learned representation space is not perfectly disentangled. Based on this observation, we then propose an unsupervised method called GraphCG, which is model-agnostic and task-agnostic for discovering steerable factors in graph data. Specifically, GraphCG learns the semantic-rich directions via maximizing the corresponding mutual information, where the edited graph along the same direction will possess certain steerable factors. We conduct experiments on two types of graph data, molecular graphs and point clouds. Both the quantitative and qualitative results show the effectiveness of GraphCG for discovering steerable factors. The code will be public in the near future.
Flaky Performances when Pretraining on Relational Databases
Shengchao Liu
David Vazquez
Pierre-Andre Noel