Portrait de Jian Tang

Jian Tang

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur agrégé, HEC Montréal, Département de sciences de la décision
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Fondateur, BioGeometry
Sujets de recherche
Biologie computationnelle
Grands modèles de langage (LLM)
IA pour la science
Modèles génératifs
Modélisation moléculaire
Réseaux de neurones en graphes

Biographie

Jian Tang est professeur agrégé au département de sciences de la décision de HEC. Il est aussi professeur associé au département informatique et recherche opérationnelle (DIRO) de l'Université de Montréal et un membre académique principal à Mila – Institut québécois d’intelligence artificielle. Il est titulaire d'une chaire de recherche en IA Canada-CIFAR et le fondateur de BioGeometry, une entreprise en démarrage spécialisée dans l'IA générative pour la découverte d'anticorps. Ses principaux domaines de recherche sont les modèles génératifs profonds, l'apprentissage automatique des graphes et leurs applications à la découverte de médicaments. Il est un leader international dans le domaine de l'apprentissage automatique des graphes, et son travail représentatif sur l'apprentissage de la représentation des nœuds, LINE, a été largement reconnu et cité plus de 5 000 fois. Il a également réalisé de nombreux travaux pionniers sur l'IA pour la découverte de médicaments, notamment le premier cadre d'apprentissage automatique à source ouverte pour la découverte de médicaments, TorchDrug et TorchProtein.

Étudiants actuels

Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - Université de Montréal
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Doctorat - UdeM

Publications

Engineered Nonheme Iron Enzymes Enable Asymmetric Hydrogenation of Alkenes
Yunfei He
Shuang-Yu Dai
Mei‐Yan Xu
Baixu Ma
Lizhi Tao
Developing biocatalytic systems capable of reducing simple alkenes is highly desirable for synthetic chemistry and biosynthesis, yet existin… (voir plus)g enzymes remain largely restricted to their ability to convert polarized, electron-deficient substrates. Here, we present a nonheme iron metalloenzyme platform that enables hydrogenation of styrenes, conjugated nitriles and amides, and nonconjugated olefins through a putative iron–hydride mechanism. Starting from the Fe(II)/ α -ketoglutarate-dependent dioxygenase GOX, iterative rounds of directed evolution produced an engineered “alkene hydrogenase” (AHase-6) containing 16 mutations and promoting NaBH 4 -driven reduction across diverse C═C bond motifs. Kinetic analysis indicates that this enzymatic hydrogenation process proceeds via formation of an enzyme–substrate ternary complex through a sequential mechanism. Mechanistic studies further reveal that alkene insertion occurs with regioselectivity governed primarily by substrate electronics and sterics. These findings establish nonheme iron enzymes as an unrecognized scaffold for metal–hydride-based hydrogenation and highlight their potential as sustainable, tunable alternatives to traditional catalytic systems.
Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning
Yashi Zhang
Hongyu Guo
Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expre… (voir plus)ssion responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as"virtual cell"simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning
Learning Structure, Energy, and Dynamics: A Survey of Artificial Intelligence for Protein Dynamics
Haocheng Tang
Liang Shi
Protein dynamics underlie many biological functions, yet remain difficult to characterize due to the high computational cost of molecular dy… (voir plus)namics simulations and the scarcity of dynamic structural data. This survey reviews recent advances in artificial intelligence for protein dynamics from three perspectives: learning from structural ensembles and trajectories, learning from physical energy signals, and learning to accelerate molecular simulations. We summarize representative methods for conformation ensemble generation, trajectory generation, Boltzmann generators, physics-aware adaptation, machine learning potentials, coarse-grained modeling, and collective variable discovery. We further discuss available datasets and key open challenges, such as scalability, thermodynamic consistency, kinetic fidelity, and integration with experimental constraints.
RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception
Jiahao Ma
Qiang Zhang
Peiran Liu
Zeran Su
Pihai Sun
Gang Han
Wen Zhao
Wei Cui
Zhang Zhang
Zhiyuan Xu
Renjing Xu
Miaomiao Liu
Yijie Guo
Surround-view perception is increasingly important for robotic navigation and loco-manipulation, especially in human-in-the-loop settings su… (voir plus)ch as teleoperation, data collection, and emergency takeover. However, current robotic visual interfaces are often limited to narrow forward-facing views, or, when multiple on-board cameras are available, require cumbersome manual switching that interrupts the operator's workflow. Both configurations suffer from motion-induced jitter that causes simulator sickness in head-mounted displays. We introduce a surround-view robotic vision system that combines six cameras with LiDAR to provide full 360
Antibody discovery technology: innovation and outlook from classic to leading edge
PengFei WANG
,
Atomic Trajectory Modeling with State Space Models for Biomolecular Dynamics
Liang Shi
Junqi Liu
Zhi Yang
Understanding the dynamic behavior of biomolecules is fundamental to elucidating biological function and facilitating drug discovery. While … (voir plus)Molecular Dynamics (MD) simulations provide a rigorous physical basis for studying these dynamics, they remain computationally expensive for long timescales. Conversely, recent deep generative models accelerate conformation generation but are typically either failing to model temporal relationship or built only for monomeric proteins. To bridge this gap, we introduce ATMOS, a novel generative framework based on State Space Models (SSM) designed to generate atom-level MD trajectories for biomolecular systems. ATMOS integrates a Pairformer-based state transition mechanism to capture long-range temporal dependencies, with a diffusion-based module to decode trajectory frames in an autoregressive manner. ATMOS is trained across crystal structures from PDB and conformation trajectory from large-scale MD simulation datasets including mdCATH and MISATO. We demonstrate that ATMOS achieves state-of-the-art performance in generating conformation trajectories for both protein monomers and complex protein-ligand systems. By enabling efficient inference of atomic trajectory of motions, this work establishes a promising foundation for modeling biomolecular dynamics.
PerturbDiff: Functional Diffusion for Single-Cell Perturbation Modeling
Yashi Zhang
Hongyu Guo
Building Virtual Cells that can accurately simulate cellular responses to perturbations is a long-standing goal in systems biology. A fundam… (voir plus)ental challenge is that high-throughput single-cell sequencing is destructive: the same cell cannot be observed both before and after a perturbation. Thus, perturbation prediction requires mapping unpaired control and perturbed populations. Existing models address this by learning maps between distributions, but typically assume a single fixed response distribution when conditioned on observed cellular context (e.g., cell type) and the perturbation type. In reality, responses vary systematically due to unobservable latent factors such as microenvironmental fluctuations and complex batch effects, forming a manifold of possible distributions for the same observed conditions. To account for this variability, we introduce PerturbDiff, which shifts modeling from individual cells to entire distributions. By embedding distributions as points in a Hilbert space, we define a diffusion-based generative process operating directly over probability distributions. This allows PerturbDiff to capture population-level response shifts across hidden factors. Benchmarks on established datasets show that PerturbDiff achieves state-of-the-art performance in single-cell response prediction and generalizes substantially better to unseen perturbations. See our project page (https://katarinayuan.github.io/PerturbDiff-ProjectPage/), where code and data will be made publicly available (https://github.com/DeepGraphLearning/PerturbDiff).
GeneZip: Region-Aware Compression for Long Context DNA Modeling
Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches l… (voir plus)argely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.
GENERator: A Long-Context Generative Genomic Foundation Model
Q. Li
Wei Wu
Yong Zhang
Rui Chen
Mingyang Li
Kun Fu
Junyan Qi
Yongzhou Bao
Chao Wang
Yiheng Zhu
Zhiyun Zhang
Fuli Feng
Jieping Ye
Liu Yuwen
Hui Xiong
Zheng Wang
Zhang, Yuanyuan
Chen, Ruipu … (voir 2 de plus)
Wang, Chao
Tang, Jian
Enhancing link prediction in biomedical knowledge graphs with BioPathNet
Emy Yue Hu
Svitlana Oleshko
Samuele Firmani
Hui Cheng
Maria Ulmer
Matthias Arnold
Maria Colomé-Tatché
Annalisa Marsico
Understanding complex interactions in biomedical networks is crucial for advancements in biomedicine, but traditional link prediction (LP) m… (voir plus)ethods are limited in capturing this complexity. We present BioPathNet, a graph neural network framework based on the neural Bellman–Ford network (NBFNet), addressing limitations of traditional representation-based learning methods through path-based reasoning for LP in biomedical knowledge graphs. Unlike node-embedding frameworks, BioPathNet learns representations between node pairs by considering all relations along paths, enhancing prediction accuracy and interpretability, and allowing visualization of influential paths and biological validation. BioPathNet leverages a background regulatory graph for enhanced message passing and uses stringent negative sampling to improve precision and scalability. BioPathNet outperforms or matches existing methods across diverse tasks including gene function annotation, drug–disease indication, synthetic lethality and lncRNA–target interaction prediction. Our study identifies promising additional drug indications for diseases such as acute lymphoblastic leukaemia and Alzheimer’s disease, validated by medical experts and clinical trials. In addition, we prioritize putative synthetic lethal gene pairs and regulatory lncRNA–target interactions. BioPathNet’s interpretability will enable researchers to trace prediction paths and gain molecular insights.
Enhancing link prediction in biomedical knowledge graphs with BioPathNet
Emy Yue Hu
Svitlana Oleshko
Samuele Firmani
Hui Cheng
Maria Ulmer
Matthias Arnold
Maria Colomé-Tatché
Annalisa Marsico
Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization
Narendra Chaudhary
Qian Cong
Jian Zhou
Sanchit Misra
Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions … (voir plus)across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly *all possible protein pairs*. For