Portrait of Yue Li

Yue Li

Associate Academic Member
Assistant Professor, McGill University, School of Computer Science
Research Topics
AI in Health
Bayesian Models
Computational Biology
Deep Learning
Genetics
Large Language Models (LLM)
Multimodal Learning
Single-Cell Genomics

Biography

I completed my PhD degree in computer science and computational biology at the University of Toronto in 2014. Prior to joining McGill University, I was a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT (2015–2018).

In general, my research program covers three main research areas that involve applied machine learning in computational genomics and health. More specifically, it focuses on developing interpretable probabilistic learning models and deep learning models to model genetic, epigenetic, electronic health record and single-cell genomic data.

By systematically integrating multimodal and longitudinal data, I aim to have impactful applications in computational medicine, including building intelligent clinical recommender systems, forecasting patient health trajectories, making personalized polygenic risk predictions, characterizing multi-trait functional genetic mutations, and dissecting cell-type-specific regulatory elements that underpin complex traits and diseases in humans.

Current Students

Postdoctorate - McGill University
PhD - McGill University
Postdoctorate - McGill University
PhD - McGill University
Master's Research - McGill University
Master's Research - McGill University
Master's Research - McGill University
PhD - McGill University
PhD - McGill University
Master's Research - McGill University
PhD - McGill University
PhD - McGill University
Master's Research - McGill University
Master's Research - McGill University
Master's Research - McGill University
Postdoctorate - McGill University

Publications

GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling
Yimin Fan
Single-cell Assay for Transposase-Accessible Chromatin with sequencing (scATAC-seq) enables investigation of open chromatin landscapes at si… (see more)ngle-cell resolution, but its analysis remains challenging because of sparsity, noise, and dataset-specific peak vocabularies. Genome Foundation Models (GFMs), pre-trained on large DNA sequence corpora, offer a potential source of transferable sequence information for scATAC-seq modeling. We introduce the Genome Foundation Embedded Topic Model (\model{}), an interpretable framework that combines GFMs with the Embedded Topic Model (ETM) for sequence-informed scATAC-seq analysis. By integrating GFM-derived DNA sequence embeddings into a topic-model decoder, \model{} improves clustering quality on standard benchmarks and captures cell-state-specific transcription factor activity through motif scoring and attention-based interpretation.
WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents
Fatemeh Pesaran zadeh
Weijian Qi
Alexander Miller
Junyi Song
Yunjia Tian
Dongjin Kang
Seyeon Choi
Ewen Gueguen
Zeyi Liao
Mengqi Yuan
Alexandre Lacoste
Huan Sun … (see 2 more)
Gunhee Kim
Web agents powered by large language and vision-language models are increasingly applied to realistic browser work that spans heterogeneous … (see more)applications, multimodal content, and stateful workflows. However, existing reproducible web-agent benchmarks cover only a small number of web applications drawn from a few software categories, and restrict modality to text and vision. Live benchmarks broaden site coverage but sacrifice reproducibility, since pages and data drift between runs. Moreover, existing benchmarks do not meaningfully evaluate whether agents can understand and use audio and video content embedded within web tasks. To address these gaps, we introduce WebArena-Pro, a benchmark comprising 300 tasks across 20 self-hosted web applications in six domain categories, spanning distinct interface conventions, workflows, and data models. Across the evaluated agents, the best performance is achieved by Gemini 3.1 Pro, which attains 37.0 % success under a 50-step budget, while open-source models' performance does not exceed 27.7% success. Among reproducible, human-curated web agent benchmarks, WebArena-Pro provides the broadest application coverage and the most comprehensive multimodal support to date. The benchmark treats audio and video as core observations alongside text and vision, with dedicated actions for extracting information from each. WebArena-Pro runs each task in isolation and supports reproducible, parallel evaluation. Tasks are authored through a dedicated annotator interface, filtered by LLM-assisted triage, and finally validated by humans before release.
Dissecting and steering cell dynamics using spatially-informed RNA velocity with veloAgent
Brent Yoon
Gregory J Fonseca
RNA velocity enables inference of cell state transitions from single-cell transcriptomics by modeling transcriptional dynamics from spliced … (see more)and unspliced mRNA. However, existing methods overlook spatial context and struggle to scale to large datasets, limiting insights into tissue organization and dynamic processes. We introduce veloAgent, a deep generative and agent-based framework that estimates gene- and cell-specific transcriptional kinetics while integrating spatial information through agent-based simulations of local microenvironments. By leveraging both molecular and spatial cues, veloAgent improves velocity accuracy and achieves sublinear memory scaling, enabling efficient analysis of large and multi-batch spatial datasets. A distinctive feature of veloAgent is its in silico perturbation module, which allows targeted manipulation of spatial velocity vectors to simulate regulatory interventions and predict their impact on cell fate dynamics. These capabilities position veloAgent as a scalable and versatile framework for dissecting spatially resolved cellular dynamics and guiding cell fate manipulation across diverse biological processes.
PheCode-guided multi-modal topic modeling of electronic health records improves disease incidence prediction and GWAS discovery from UK Biobank
Ziqi Yang
Ziyang Song
Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of e… (see more)lectronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1000 interpretable phenotype topics from UK Biobank data. Applied to 350 000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict incident type 2 diabetes (T2D) and leukemia diagnoses. Subsequent genome-wide association studies using these continuous risk scores uncovered novel disease-associated loci, including PPP1R15A for T2D and JMJD6/SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE.
MiRformer: a dual-transformer-encoder framework for predicting microRNA-mRNA interactions from paired sequences
MicroRNAs (miRNAs) are small non-coding RNAs that regulate genes by binding to target messenger RNAs (mRNAs), causing them to degrade or sup… (see more)pressing their translation. Accurate prediction of miRNA–mRNA interactions is crucial for RNA therapeutics. Existing methods rely on handcrafted features, struggle to scale to kilobase-long mRNA sequences, or lack interpretability. We introduce MiRformer , a transformer framework designed to predict not only the binary miRNA–mRNA interaction but also the start and end location of the miRNA binding site in the mRNA sequence. MiRformer employs a dual-transformer encoder architecture to learn interaction patterns directly from raw miRNA-mRNA sequence pairs via the cross-attention between the miRNA-encoder and mRNA-encoder. To scale to long mRNA sequences, we leverage sliding-window attention mechanism. MiR-former achieves state-of-the-art performance across diverse miRNA–mRNA tasks, including binding prediction, target-site localization, and cleavage-site identification from Degradome sequencing data. The learned transformer attention are highly interpretable and reveals highly contrasting signals for the miRNA seed regions in 500-nt long mRNA sequences. We used MiRformer to simultaneously predict novel binding sites and cleavage sites in 13k miRNA-mRNA pairs and observed that the two types of sites tend to be close to each other, supporting miRNA-mediated degradation mechanism. Our code is available at https://github.com/li-lab-mcgill/miRformer .
TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare
Ziyang Song
Qincheng Lu
Hao Xu
David L. Buckeridge
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success in Natural Language Processing and Computer … (see more)Vision domains. However, the development of PTMs on healthcare time-series data is lagging behind.This underscores the limitations of the existing transformer-based architectures, particularly their scalability to handle large-scale time series and ability to capture long-term temporal dependencies. In this study, we present Timely Generative Pre-trained Transformer (TimelyGPT). TimelyGPT employs an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations. It also integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies. We evaluated TimelyGPT on two large-scale healthcare time series datasets corresponding to continuous biosignals and irregularly-sampled time series, respectively. Our experiments show that during pre-training, TimelyGPT excels in learning time-series representations from continuously monitored biosignals and irregularly-sampled time series data commonly observed in longitudinal electronic health records (EHRs). In forecasting continuous biosignals, TimelyGPT achieves accurate extrapolation up to 6,000 timesteps of body temperature during the sleep stage transition, given a short look-up window (i.e., prompt) containing only 2,000 timesteps. For irregularly-sampled time series, TimelyGPT with a proposed time-specific inference demonstrates high top recall scores in predicting future diagnoses using early diagnostic records, effectively handling irregular intervals between clinical records. Together, we envision TimelyGPT to be useful in a broad spectrum of health domains, including long-term patient health state forecasting and patient risk trajectory prediction.
Source-free cross-modality medical image synthesis with diffusion priors
Jia Chen
Kai Yang
Xinrong Hu
A Unified Solution to Diverse Heterogeneities in One-Shot Federated Learning
Yiliao Song
Atul Sajjanhar
Yong Xiang
Wei Zhou
Xiaohui Tao
Yan Li
One-Shot Federated Learning (OSFL) restricts communication between the server and clients to a single round, significantly reducing communic… (see more)ation costs and minimizing privacy leakage risks compared to traditional Federated Learning (FL), which requires multiple rounds of communication. However, existing OSFL frameworks remain vulnerable to distributional heterogeneity, as they primarily focus on model heterogeneity while neglecting data heterogeneity. To bridge this gap, we propose FedHydra, a unified, data-free, OSFL framework designed to effectively address both model and data heterogeneity. Unlike existing OSFL approaches, FedHydra introduces a novel two-stage learning mechanism. Specifically, it incorporates model stratification and heterogeneity-aware stratified aggregation to mitigate the challenges posed by both model and data heterogeneity. By this design, the data and model heterogeneity issues are simultaneously monitored from different aspects during learning. Consequently, FedHydra can effectively mitigate both issues by minimizing their inherent conflicts. We compared FedHydra with five SOTA baselines on four benchmark datasets. Experimental results show that our method outperforms the previous OSFL methods in both homogeneous and heterogeneous settings. The code is available at https://github.com/Jun-B0518/FedHydra.
Harnessing agent-based frameworks in CellAgentChat to unravel cell–cell interactions from single-cell and spatial transcriptomics
Understanding cell–cell interactions (CCIs) is essential yet challenging owing to the inherent intricacy and diversity of cellular dynamic… (see more)s. Existing approaches often analyze global patterns of CCIs using statistical frameworks, missing the nuances of individual cell behavior owing to their focus on aggregate data. This makes them insensitive in complex environments where the detailed dynamics of cell interactions matter. We introduce CellAgentChat, an agent-based model (ABM) designed to decipher CCIs from single-cell RNA sequencing and spatial transcriptomics data. This approach models biological systems as collections of autonomous agents governed by biologically inspired principles and rules. Validated across eight diverse single-cell data sets, CellAgentChat demonstrates its effectiveness in detecting intricate signaling events across different cell populations. Moreover, CellAgentChat offers the ability to generate animated visualizations of single-cell interactions and provides flexibility in modifying agent behavior rules, facilitating thorough exploration of both close and distant cellular communications. Furthermore, CellAgentChat leverages ABM features to enable intuitive in silico perturbations via agent rule modifications, facilitating the development of novel intervention strategies. This ABM method unlocks an in-depth understanding of cellular signaling interactions across various biological contexts, thereby enhancing in silico studies for cellular communication–based therapies.
CellMemory: hierarchical interpretation of out-of-distribution cells using bottlenecked transformer
Qifei Wang
Yiwen Hu
Yanjie Chen
Yuwei Wang
Guochao Li
Yun Li
Jinfeng Chen
Xuegong Zhang
James Zou
Manolis Kellis
Dianbo Liu
Lan Jiang
Identifying the genetic and molecular drivers of phenotypic heterogeneity among individuals is vital for understanding human health and for … (see more)diagnosing, monitoring, and treating diseases. To this end, international consortia such as the Human Cell Atlas and the Tabula Sapiens are creating comprehensive cellular references. Due to the massive volume of data generated, machine learning methods, especially transformer architectures, have been widely employed in related studies. However, applying machine learning to cellular data presents several challenges. One such challenge is making the methods interpretable with respect to both the input cellular information and its context. Another less explored challenge is the accurate representation of cells outside existing references, referred to as out-of-distribution (OOD) cells. The out-of-distribution could be attributed to various physiological conditions, such as comparing diseased cells, particularly tumor cells, with healthy reference data, or significant technical variations, such as using transfer learning from single-cell reference to spatial query data. Inspired by the global workspace theory in cognitive neuroscience, we introduce CellMemory, a bottlenecked Transformer with improved generalization capabilities designed for the hierarchical interpretation of OOD cells unseen during reference building. Even without pre-training, it exceeds the performance of large language models pre-trained with tens of millions of cells. In particular, when deciphering spatially resolved single-cell transcriptomics data, CellMemory demonstrates the ability to interpret data at the granule level accurately. Finally, we harness CellMemory's robust representational capabilities to elucidate malignant cells and their founder cells in different patients, providing reliable characterizations of the cellular changes caused by the disease.
FedWeight: mitigating covariate shift of federated learning on electronic health records data through patients re-weighting
Na Li
Xiaoxiao Li
Dianbo Liu
David L. Buckeridge
Federated learning (FL) enables collaborative analysis of decentralized medical data while preserving patient privacy. However, the covariat… (see more)e shift from demographic and clinical differences can reduce model generalizability. We propose FedWeight, a novel FL framework that mitigates covariate shift by reweighting patient data from the source sites using density estimators, allowing the trained model to better align with the distribution of the target site. To support unsupervised applications, we introduce FedWeight ETM, a federated embedded topic model. We evaluated FedWeight in cross-site FL on the eICU dataset and cross-dataset FL between eICU and MIMIC III. FedWeight consistently outperforms standard FL baselines in predicting ICU mortality, ventilator use, sepsis diagnosis, and length of stay. SHAP-based interpretation and ETM-based topic modeling reveal improved identification of clinically relevant characteristics and disease topics associated with ICU readmission.
ECLARE: multi-teacher contrastive learning via ensemble distillation for diagonal integration of single-cell multi-omic data
Anjali Chawla
Gustavo Turecki
Corina Nagy
Integrating multimodal single-cell data such as scRNA-seq with scATAC-seq is essential for decoding gene regulatory networks, but remains di… (see more)fficult due to feature harmonization and limited paired multiome data. We introduce ECLARE, a framework that uses multi-teacher ensemble knowledge distillation with contrastive learning and optimal-transport alignment to integrate unpaired single-cell multi-omic datasets. Across benchmarks, ECLARE achieves competitive performance for multimodal integration and biological structure preservation. We further demonstrate utility in a major depressive disorder case study using unpaired snRNA-seq and snATAC-seq, identifying transcription factor–target gene programs that are differentially regulated with sex- and cell-type specificity. Finally, ECLARE learns continuous representations that capture longitudinal structure, highlighting altered neurodevelopmental programs associated with depression in female subjects. Altogether, ECLARE expands the practical reach of multimodal single-cell analysis by enabling diagonal integration of unpaired data with strong biological preservation, facilitating integrative regulatory studies across diverse cohorts and conditions.