Portrait de Jian Tang

Jian Tang

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur agrégé, HEC Montréal, Département de sciences de la décision
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Fondateur, BioGeometry
Sujets de recherche
Biologie computationnelle
Grands modèles de langage (LLM)
IA pour la science
Modèles génératifs
Modélisation moléculaire
Réseaux de neurones en graphes

Biographie

Jian Tang est professeur agrégé au département de sciences de la décision de HEC. Il est aussi professeur associé au département informatique et recherche opérationnelle (DIRO) de l'Université de Montréal et un membre académique principal à Mila – Institut québécois d’intelligence artificielle. Il est titulaire d'une chaire de recherche en IA Canada-CIFAR et le fondateur de BioGeometry, une entreprise en démarrage spécialisée dans l'IA générative pour la découverte d'anticorps. Ses principaux domaines de recherche sont les modèles génératifs profonds, l'apprentissage automatique des graphes et leurs applications à la découverte de médicaments. Il est un leader international dans le domaine de l'apprentissage automatique des graphes, et son travail représentatif sur l'apprentissage de la représentation des nœuds, LINE, a été largement reconnu et cité plus de 5 000 fois. Il a également réalisé de nombreux travaux pionniers sur l'IA pour la découverte de médicaments, notamment le premier cadre d'apprentissage automatique à source ouverte pour la découverte de médicaments, TorchDrug et TorchProtein.

Étudiants actuels

Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - Université de Montréal
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Doctorat - UdeM

Publications

Enhancing link prediction in biomedical knowledge graphs with BioPathNet
Emy Yue Hu
Svitlana Oleshko
Samuele Firmani
Hui Cheng
Maria Ulmer
Matthias Arnold
Maria Colomé-Tatché
Annalisa Marsico
Enhancing link prediction in biomedical knowledge graphs with BioPathNet
Emy Yue Hu
Svitlana Oleshko
Samuele Firmani
Hui Cheng
Maria Ulmer
Matthias Arnold
Maria Colomé-Tatché
Annalisa Marsico
Understanding complex interactions in biomedical networks is crucial for advancements in biomedicine, but traditional link prediction (LP) m… (voir plus)ethods are limited in capturing this complexity. We present BioPathNet, a graph neural network framework based on the neural Bellman–Ford network (NBFNet), addressing limitations of traditional representation-based learning methods through path-based reasoning for LP in biomedical knowledge graphs. Unlike node-embedding frameworks, BioPathNet learns representations between node pairs by considering all relations along paths, enhancing prediction accuracy and interpretability, and allowing visualization of influential paths and biological validation. BioPathNet leverages a background regulatory graph for enhanced message passing and uses stringent negative sampling to improve precision and scalability. BioPathNet outperforms or matches existing methods across diverse tasks including gene function annotation, drug–disease indication, synthetic lethality and lncRNA–target interaction prediction. Our study identifies promising additional drug indications for diseases such as acute lymphoblastic leukaemia and Alzheimer’s disease, validated by medical experts and clinical trials. In addition, we prioritize putative synthetic lethal gene pairs and regulatory lncRNA–target interactions. BioPathNet’s interpretability will enable researchers to trace prediction paths and gain molecular insights.
Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization
Narendra Chaudhary
Qian Cong
Jian Zhou
Sanchit Misra
Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions … (voir plus)across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly *all possible protein pairs*. For
Property-Driven Protein Inverse Folding with Multi-Objective Preference Alignment
Junqi Liu
Xiaoyang Hou
Xin Liu
Zhi Yang
Protein sequence design must balance designability, defined as the ability to recover a target backbone, with multiple, often competing, dev… (voir plus)elopability properties such as solubility, thermostability, and expression. Existing approaches address these properties through post hoc mutation, inference-time biasing, or retraining on property-specific subsets, yet they are target dependent and demand substantial domain expertise or careful hyperparameter tuning. In this paper, we introduce ProtAlign, a multi-objective preference alignment framework that fine-tunes pretrained inverse folding models to satisfy diverse developability objectives while preserving structural fidelity. ProtAlign employs a semi-online Direct Preference Optimization strategy with a flexible preference margin to mitigate conflicts among competing objectives and constructs preference pairs using in silico property predictors. Applied to the widely used ProteinMPNN backbone, the resulting model MoMPNN enhances developability without compromising designability across tasks including sequence design for CATH 4.3 crystal structures, de novo generated backbones, and real-world binder design scenarios, making it an appealing framework for practical protein sequence design.
Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute
Kieran Didi
Guoqing Zhou
Danny Reidenbach
Zhonglin Cao
Sooyoung Cha
Tomas Geffner
Christian Dallago
Michael Bronstein
Martin Steinegger
Emine Kucukbenli
Arash Vahdat
Karsten Kreis
Protein interaction modeling is central to protein design, which has been transformed by machine learning with broad applications in drug di… (voir plus)scovery and beyond. In this landscape, structure-based de novo binder design is most often cast as either conditional generative modeling or sequence optimization via structure predictors ("hallucination"). We argue that this is a false dichotomy and propose Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architecture and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We further demonstrate explicit interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.
Towards All-Atom Foundation Models for Biomolecular Binding Affinity Prediction
Liang Shi
Santiago Miret
Zhi Yang
Biomolecular interactions play a critical role in biological processes. While recent breakthroughs like AlphaFold 3 have enabled accurate mo… (voir plus)deling of biomolecular complex structures, predicting binding affinity remains challenging mainly due to limited high-quality data. Recent methods are often specialized for specific types of biomolecular interactions, limiting their generalizability. In this work, we repurpose AlphaFold 3 for representation learning to predict binding affinity, a non-trivial task that requires shifting from generative structure prediction to encoding observed geometry, simplifying the heavily conditioned trunk module, and designing a framework to jointly capture sequence and structural information. To address these challenges, we introduce the **Atom-level Diffusion Transformer (ADiT)**, which takes sequence and structure as inputs, employs a unified tokenization scheme, integrates diffusion transformers, and removes dependencies on multiple sequence alignments and templates. We pre-train three ADiT variants on the PDB dataset with a denoising objective and evaluate them across protein-ligand, drug-target, protein-protein, and antibody-antigen interactions. The model achieves state-of-the-art or competitive performance across benchmarks, scales effectively with model size, and successfully identifies wet-lab validated affinity-enhancing antibody mutations, establishing a generalizable framework for biomolecular interactions. We plan to release the code upon acceptance.
Efficient, Non‐Destructive Transfer of Wafer‐Scale Monolayer MoS
<sub>2</sub>
by Interface Engineering
Zheng Wei
Yongqing Cai
Jieying Liu
Liyan Zhang
Jiaojiao Zhao
Li Li
Qinqin Wang
Huimin Zhang
Zhihua Zhang
Dongxia Shi
Luojun Du
Controllable Generation of Drug-like Molecules with Multi-modal Variational Flow
Fang Sun
Hongyu Guo
Ming Zhang
Yizhou Sun
Designing drug molecules that bind effectively to target proteins while maintaining desired pharmacological properties remains a fundamental… (voir plus) challenge in drug discovery. Current approaches struggle to simultaneously control molecular topology and 3D geometry, often requiring expensive retraining for new design objectives. We propose a multi-modal variational flow framework that addresses these limitations by integrating a 2D topology encoder with a 3D geometry generator. Our architecture encodes molecular graphs into a learned latent distribution via junction tree representations, then employs normalizing flows to autoregressively generate atoms in 3D space conditioned on the protein binding site. This design enables zero-shot controllability: by manipulating the latent prior distribution, we can generate molecules with specific substructures or optimized properties without model retraining. Experiments on the CrossDocked benchmark show that our model achieves 31.1% high-affinity rate, substantially outperforming existing methods, while maintaining superior drug-likeness and structural diversity. Our framework opens new possibilities for on-demand molecular design, allowing medicinal chemists to rapidly explore chemical space with precise control over both structural motifs and physicochemical properties.
A Hardware‐in‐Loop Digital Twin Approach for Intelligent Optimization of Municipal Solid Waste Incineration
Wen Yu
JunFei Qiao
Rapid De Novo Antibody Design with GeoFlow-V3
BioGeometry Team
Recent years have witnessed striking advances in miniprotein design, yet de novo antibody discovery remains challenging, marked by low bindi… (voir plus)ng rates and the need for extensive, labor-intensive experimental screening of millions of candidates. This technical report introduces GeoFlow-V3, a unified atomic generative model for structure prediction and protein design. GeoFlow-V3 delivers improved accuracy on antibody-antigen complex structure prediction relative to our previous version, and its performance is further enhanced when experimental constraints or prior knowledge are provided, enabling precise control over both folding and design. The model also demonstrates reliable ability to discriminate binders from non-binders based on its confidence scores. Leveraging this capability, we build a GeoFlow-V3 in silico pipeline to design no more than 50 nanobodies per therapeutically relevant target de novo, completing a single round of wet-lab characterization in under three weeks. GeoFlow-V3 identifies at least one binder for 8 tested epitopes and achieves an average hit rate of 15.5%, representing a two-orders-of-magnitude improvement over prior computational pipelines. These results position GeoFlow-V3 as an appealing platform for rapid, AI-driven therapeutic antibody discovery, significantly reducing experimental screening demands and offering a powerful avenue to tackle previously undruggable targets. A demo of GeoFlow-V3 can be accessed via prot.design for non-commercial use.
Aligning Protein Conformation Ensemble Generation with Physical Feedback
Stephen Z. Lu
Aurelie Lozano
Vijil Chenthamarakshan
Payel Das
Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-co… (voir plus)nsuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.
Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design
Danny Reidenbach
Zhonglin Cao
Kieran Didi
Tomas Geffner
Guoqing Zhou
Christian Dallago
Arash Vahdat
Emine Kucukbenli
Karsten Kreis
High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often inc… (voir plus)lude unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic structures with recoverable synthetic sequences. In that way, we create a new dataset designed specifically for training expressive, fully atomistic protein generators. By retraining La-Proteína, which models discrete residue type and side chain structure in a continuous latent space, on this dataset, we achieve new state-of-the-art results, with improvements of +54% in structural diversity and +27% in co-designability. To validate the broad utility of our approach, we further introduce Proteína-Atomística, a unified flow-based framework that jointly learns the distribution of protein backbone structure, discrete sequences, and atomistic side chains without latent variables. We again find that training on our new sequence-structure data dramatically boosts benchmark performance, improving Proteína-Atomística’s structural diversity by +73% and co-designability by +5%. Our work highlights the critical importance of aligned sequence-structure data for training high-performance de novo protein design models. All data will be publicly released.