Jian Tang

Biographie

Jian Tang est professeur agrégé au département de sciences de la décision de HEC. Il est aussi professeur associé au département informatique et recherche opérationnelle (DIRO) de l'Université de Montréal et un membre académique principal à Mila – Institut québécois d’intelligence artificielle. Il est titulaire d'une chaire de recherche en IA Canada-CIFAR et le fondateur de BioGeometry, une entreprise en démarrage spécialisée dans l'IA générative pour la découverte d'anticorps. Ses principaux domaines de recherche sont les modèles génératifs profonds, l'apprentissage automatique des graphes et leurs applications à la découverte de médicaments. Il est un leader international dans le domaine de l'apprentissage automatique des graphes, et son travail représentatif sur l'apprentissage de la représentation des nœuds, LINE, a été largement reconnu et cité plus de 5 000 fois. Il a également réalisé de nombreux travaux pionniers sur l'IA pour la découverte de médicaments, notamment le premier cadre d'apprentissage automatique à source ouverte pour la découverte de médicaments, TorchDrug et TorchProtein.

Étudiants actuels

Huiyu Cai

Doctorat - UdeM

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Xixian Liu

Doctorat - Université de Montréal

Site web

Jiarui Lu

Doctorat - UdeM

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Gauthier Gidel

Xinyu Yuan

Doctorat - UdeM

Github

Zhihao Zhan

Doctorat - UdeM

Doctorat - UdeM

Doctorat - HEC

Jianan Zhao

Doctorat - UdeM

Site web

Github

Publications

Biomedical discovery through the integrative biomedical knowledge hub (iBKH)

Chang Su

Yu Hou

Manqi Zhou

Suraj Rajendran

Jacqueline R.M. A. Maasch

Zehra Abedi

Haotan Zhang

Zilong Bai

Anthony Cuturrufo

Winston Guo

Fayzan F. Chaudhry

Gregory Ghahramani

Feixiong Cheng

Yue Li

Rui Zhang

Steven T. DeKosky

Jiang Bian

Fei Wang

Summary The massive and continuously increasing volume of biomedical knowledge derived from biological experiments or gained from healthcare… (voir plus) practices has become an invaluable treasure for biomedicine. The emerging biomedical knowledge graphs (BKGs) provide an efficient and effective way to manage the abundant knowledge in biomedical and life science. In the present study, we harmonized and integrated data from diverse biomedical resources to curate a comprehensive BKG, named the integrative Biomedical Knowledge Hub (iBKH). To facilitate the usage of iBKH in biomedical research, we developed a web-based, easy-to-use, publicly available graphical portal that allows fast, interactive, and visualized knowledge retrieval in iBKH. Furthermore, an efficient and scalable graph learning pipeline was developed for novel knowledge discovery in iBKH. As a proof of concept, we performed our iBKH-based method for computational in silico drug repurposing for Alzheimer’s disease. The iBKH is publicly available at: http://ibkh.ai/ .

2023-03-31

iScience (publié)

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

Chuanrui Wang

Vijil Chenthamarakshan

Aurelie Lozano

Payel Das

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequenc… (voir plus)e representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.

2023-03-10

ArXiv (prépublication)

arxiv.org

Enhancing Protein Language Model with Structure-based Encoder and Pre-training

Aurelie Lozano

Vijil Chenthamarakshan

Payel Das

Protein language models (PLMs) pre-trained on large-scale protein sequence corpora have achieved impressive performance on various downstrea… (voir plus)m protein understanding tasks. Despite the ability to implicitly capture inter-residue contact information, transformer-based PLMs cannot encode protein structures explicitly for better structure-aware protein representations. Besides, the power of pre-training on available protein structures has not been explored for improving these PLMs, though structures are important to determine functions. To tackle these limitations, in this work, we enhance the PLM with structure-based encoder and pre-training. We first explore feasible model architectures to combine the advantages of a state-of-the-art PLM (i.e., ESM-1b) and a state-of-the-art protein structure encoder (i.e., GearNet). We empirically verify the ESM-GearNet that connects two encoders in a series way as the most effective combination model. To further improve the effectiveness of ESM-GearNet, we pre-train it on massive unlabeled protein structures with contrastive learning, which aligns representations of co-occurring subsequences so as to capture their biological correlation. Extensive experiments on EC and GO protein function prediction benchmarks demonstrate the superiority of ESM-GearNet over previous PLMs and structure encoders, and clear performance gains are further achieved by structure-based pre-training upon ESM-GearNet. The source code will be made public upon acceptance.

2023-03-05

ICLR.cc/2023/Workshop/MLDD (poster)

EurNet: Efficient Multi-Range Relational Modeling of Protein Structure

Yuanfan Guo

Yi Xu

Xinlei Chen

Yuandong Tian

Modeling the 3D structures of proteins is critical for obtaining effective protein structure representations, which further boosts protein f… (voir plus)unction understanding. Existing protein structure encoders mainly focus on modeling short-range interactions within protein structures, while they neglect modeling the interactions at multiple length scales that are actually complete interactive patterns in protein structures. To attain complete interaction modeling with efficient computation, we introduce the EurNet for Efficient multi-range relational modeling. In EurNet, we represent the protein structure as a multi-relational residue-level graph with different types of edges for modeling short-range, medium-range and long-range interactions. To efficiently process these different interactive relations, we propose a novel modeling layer, called Gated Relational Message Passing (GRMP), as the basic building block of EurNet. GRMP can capture multiple interactive relations in protein structures with little extra computational cost. We verify the state-of-the-art performance of EurNet on EC and GO protein function prediction benchmarks, and the proposed GRMP layer is proved to achieve better efficiency-performance trade-off than the widely-used relational graph convolution.

2023-03-05

ICLR.cc/2023/Workshop/MLDD (poster)

E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking

In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work foc… (voir plus)uses on blind flexible selfdocking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference cost. Recently, data-driven methods based on deep learning techniques are attracting growing interest thanks to their efficiency during inference and promising performance. These methods usually either adopt a two-stage approach by first predicting the distances between proteins and ligands and then generating the final coordinates based on the predicted distances, or directly predicting the global roto-translation of ligands. In this paper, we take a different route. Inspired by the resounding success of AlphaFold2 for protein structure prediction, we propose E3Bind, an end-to-end equivariant network that iteratively updates the ligand pose. E3Bind models the protein-ligand interaction through careful consideration of the geometric constraints in docking and the local context of the binding site. Experiments on standard benchmark datasets demonstrate the superior performance of our end-to-end trainable model compared to traditional and recently-proposed deep learning methods.

2023-01-31

ICLR.cc/2023/Conference (poster)

Learning on Large-Scale Text-Attributed Graphs via Variational Inference

Jianan Zhao

Meng Qu

Chaozhuo Li

Hao Yan

Qian Liu

Rui Li

Xing Xie

This paper studies learning on text-attributed graphs (TAGs), where each node is associated with a text description. An ideal solution for s… (voir plus)uch a problem would be integrating both the text and graph structure information with large language models and graph neural networks (GNNs). However, the problem becomes very challenging when graphs are large due to the high computational complexity brought by training large language models and GNNs together. In this paper, we propose an efficient and effective solution to learning on large text-attributed graphs by fusing graph structure and language learning with a variational Expectation-Maximization (EM) framework, called GLEM. Instead of simultaneously training large language models and GNNs on big graphs, GLEM proposes to alternatively update the two modules in the E-step and M-step. Such a procedure allows training the two modules separately while simultaneously allowing the two modules to interact and mutually enhance each other. Extensive experiments on multiple data sets demonstrate the efficiency and effectiveness of the proposed approach.

2023-01-31

ICLR.cc/2023/Conference (notable)

Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching

Shengchao Liu

Hongyu Guo

Molecular representation pretraining is critical in various applications for drug and material discovery due to the limited number of labele… (voir plus)d molecules, and most existing work focuses on pretraining on 2D molecular graphs. However, the power of pretraining on 3D geometric structures has been less explored. This is owing to the difficulty of finding a sufficient proxy task that can empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose GeoSSL, a 3D coordinate denoising pretraining framework to model such an energy landscape. Further by leveraging an SE(3)-invariant score matching method, we propose GeoSSL-DDM in which the coordinate denoising proxy task is effectively boiled down to denoising the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method.

2023-01-31

ICLR.cc/2023/Conference (poster)

Protein Representation Learning by Geometric Structure Pretraining

Arian Jamasb

Vijil Chenthamarakshan

Aurelie Lozano

Payel Das

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Ex… (voir plus)isting approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.

2023-01-31

ICLR.cc/2023/Conference (poster)

Protein Sequence and Structure Co-design with Equivariant Translation

Chence Shi

Chuanrui Wang

Jiarui Lu

Bozitao Zhong

Proteins are macromolecules that perform essential functions in all living organisms. Designing novel proteins with specific structures and … (voir plus)desired functions has been a long-standing challenge in the field of bioengineering. Existing approaches generate both protein sequence and structure using either autoregressive models or diffusion models, both of which suffer from high inference costs. In this paper, we propose a new approach capable of protein sequence and structure co-design, which iteratively translates both protein sequence and structure into the desired state from random initialization, based on context features given a priori. Our model consists of a trigonometry-aware encoder that reasons geometrical constraints and interactions from context features, and a roto-translation equivariant decoder that translates protein sequence and structure interdependently. Notably, all protein amino acids are updated in one shot in each translation step, which significantly accelerates the inference process. Experimental results across multiple tasks show that our model outperforms previous state-of-the-art baselines by a large margin, and is able to design proteins of high fidelity as regards both sequence and structure, with running time orders of magnitude less than sampling-based methods.

2023-01-31

ICLR.cc/2023/Conference (poster)

Design and Application of Adaptive Sparse Deep Echo State Network

Cuili Yang

Sheng Yang

Bing Li

The prediction of appliances energy consumption in building belongs to time series forecasting problem, which can be solved by echo state ne… (voir plus)twork (ESN). However, due to the randomly initialized inputs and reservoir, some redundant or irrelevant components are inevitably generated in original ESN. To solve this problem, the adaptive sparse deep echo state network (ASDESN) is proposed, in which the information is processed layer by layer. Firstly, the principal component analysis (PCA) layer is inserted to penalize the redundant projection transmitted between sub-reservoirs. Secondly, the coordinate descent based adaptive sparse learning method is proposed to generate the sparse output weights. Particularly, the designed adaptive threshold strategy is able to enlarge the sparsity of output weights as network depth increases. Moreover, the echo state property (ESP) of ASDESN is given to ensure its applications. The experiment results in both simulated benchmark and real appliances energy datasets illustrate that the proposed ASDESN outperforms other ESNs with higher prediction accuracy and stability.

2022-12-31

IEEE transactions on consumer electronics (publié)

FusionRetro: Molecule Representation Fusion via Reaction Graph for Retrosynthetic Planning

Songtao Liu

Zhengkai Tu

Minkai Xu

Peilin Zhao

Rex Ying

Lu Lin

Dinghao Wu

Retrosynthetic planning is a fundamental problem in drug discovery and organic chemistry, which aims to ﬁnd a complete multi-step syntheti… (voir plus)c route from a set of starting materials to the target molecule, determining crucial process ﬂow in chemical production. Existing approaches combine single-step retrosynthesis models and search algorithms to ﬁnd synthetic routes. However, these approaches generally consider the two pieces in a decoupled manner, taking only the product as the input to predict the reactants per planning step and largely ignoring the important context information from other intermediates along the synthetic route. In this work, we perform a series of experiments to identify the limitations of this decoupled view and propose a novel retrosynthesis framework that also exploits context information for retrosynthetic planning. We view synthetic routes as reaction graphs, and propose to incorporate the context by three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. The whole framework can be efﬁciently optimized in an end-to-end fashion. Comprehensive experiments show that by fusing in context information over routes, our model sig-niﬁcantly improves the performance of retrosyn-thetic planning over baselines that are not context-aware, especially for long synthetic routes.

2022-12-31

(publié)

www.semanticscholar.org

Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction

Aurelie Lozano

Vijil Chenthamarakshan

Payel Das

Self-supervised pre-training methods on proteins have recently gained attention, with most approaches focusing on either protein sequences o… (voir plus)r structures, neglecting the exploration of their joint distribution, which is crucial for a comprehensive understanding of protein functions by integrating co-evolutionary information and structural characteristics. In this work, inspired by the success of denoising diffusion models in generative tasks, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure joint diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the joint diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. Our implementation is available at https://github.com/DeepGraphLearning/SiamDiff.

2022-12-31

arXiv.org (prépublication)