Yue Li

Postdoctorate - McGill University

Doruk Cakmakci

PhD - McGill University

Postdoctorate - McGill University

Website

Vicky Dong

PhD - McGill University

Neda Esfehani

Master's Research - McGill University

Claris Gu

Master's Research - McGill University

Eric Huang

Master's Research - McGill University

Yixuan Li

PhD - McGill University

Principal supervisor :

Dylan Mann-Krzisnik

PhD - McGill University

Marshall Meng

Master's Research - McGill University

Principal supervisor :

Adrien Osakwe

PhD - McGill University

Google Scholar

Vishvak Raghavan

PhD - McGill University

Co-supervisor :

Jack Song

Master's Research - McGill University

Co-supervisor :

Ruilin Wang

Master's Research - McGill University

Bo-Hong Wang

Master's Research - McGill University

Kunpeng Xu

Postdoctorate - McGill University

Co-supervisor :

Publications

GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling

Yimin Fan

Yu Li

Single-cell Assay for Transposase-Accessible Chromatin with sequencing (scATAC-seq) enables investigation of open chromatin landscapes at si… (see more)ngle-cell resolution, but its analysis remains challenging because of sparsity, noise, and dataset-specific peak vocabularies. Genome Foundation Models (GFMs), pre-trained on large DNA sequence corpora, offer a potential source of transferable sequence information for scATAC-seq modeling. We introduce the Genome Foundation Embedded Topic Model (\model{}), an interpretable framework that combines GFMs with the Embedded Topic Model (ETM) for sequence-informed scATAC-seq analysis. By integrating GFM-derived DNA sequence embeddings into a topic-model decoder, \model{} improves clustering quality on standard benchmarks and captures cell-state-specific transcription factor activity through motif scoring and attention-based interpretation.

2026-05-27

FM4LS @ International Conference on Machine Learning (poster)

openreview.net

WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents

Imene Kerboua

Fatemeh Pesaran zadeh

Xing Han Lu

Weijian Qi

Alexander Miller

Junyi Song

Yunjia Tian

Dongjin Kang

Seyeon Choi

Marzia Nouri

Ewen Gueguen

Matteo Boglioni

Fengyuan Liu

Zeyi Liao

Mengqi Yuan

Alexandre Lacoste

Alexandre Drouin

Spandana Gella

Huan Sun … (see 2 more)

Gunhee Kim

Siva Reddy

Web agents powered by large language and vision-language models are increasingly applied to realistic browser work that spans heterogeneous … (see more)applications, multimodal content, and stateful workflows. However, existing reproducible web-agent benchmarks cover only a small number of web applications drawn from a few software categories, and restrict modality to text and vision. Live benchmarks broaden site coverage but sacrifice reproducibility, since pages and data drift between runs. Moreover, existing benchmarks do not meaningfully evaluate whether agents can understand and use audio and video content embedded within web tasks. To address these gaps, we introduce WebArena-Pro, a benchmark comprising 300 tasks across 20 self-hosted web applications in six domain categories, spanning distinct interface conventions, workflows, and data models. Across the evaluated agents, the best performance is achieved by Gemini 3.1 Pro, which attains 37.0 % success under a 50-step budget, while open-source models' performance does not exceed 27.7% success. Among reproducible, human-curated web agent benchmarks, WebArena-Pro provides the broadest application coverage and the most comprehensive multimodal support to date. The benchmark treats audio and video as core observations alongside text and vision, with dedicated actions for extracting information from each. WebArena-Pro runs each task in isolation and supports reproducible, parallel evaluation. Tasks are authored through a dedicated annotator interface, filtered by LLM-assisted triage, and finally validated by humans before release.

2026-05-22

AIWILD @ International Conference on Machine Learning (published)

openreview.net

Dissecting and steering cell dynamics using spatially-informed RNA velocity with veloAgent

Vishvak Raghavan

Brent Yoon

Gregory J Fonseca

RNA velocity enables inference of cell state transitions from single-cell transcriptomics by modeling transcriptional dynamics from spliced … (see more)and unspliced mRNA. However, existing methods overlook spatial context and struggle to scale to large datasets, limiting insights into tissue organization and dynamic processes. We introduce veloAgent, a deep generative and agent-based framework that estimates gene- and cell-specific transcriptional kinetics while integrating spatial information through agent-based simulations of local microenvironments. By leveraging both molecular and spatial cues, veloAgent improves velocity accuracy and achieves sublinear memory scaling, enabling efficient analysis of large and multi-batch spatial datasets. A distinctive feature of veloAgent is its in silico perturbation module, which allows targeted manipulation of spatial velocity vectors to simulate regulatory interventions and predict their impact on cell fate dynamics. These capabilities position veloAgent as a scalable and versatile framework for dissecting spatially resolved cellular dynamics and guiding cell fate manipulation across diverse biological processes.

2026-05-05

Molecular Systems Biology (published)

PheCode-guided multi-modal topic modeling of electronic health records improves disease incidence prediction and GWAS discovery from UK Biobank

Ziqi Yang

Ziyang Song

Shadi Zabad

Marc-André Legault

Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of e… (see more)lectronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1000 interpretable phenotype topics from UK Biobank data. Applied to 350 000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict incident type 2 diabetes (T2D) and leukemia diagnoses. Subsequent genome-wide association studies using these continuous risk scores uncovered novel disease-associated loci, including PPP1R15A for T2D and JMJD6/SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE.

2025-12-31

Briefings in Bioinformatics (published)

MiRformer: a dual-transformer-encoder framework for predicting microRNA-mRNA interactions from paired sequences

Jiayao Gu

Can (Sam) Chen

MicroRNAs (miRNAs) are small non-coding RNAs that regulate genes by binding to target messenger RNAs (mRNAs), causing them to degrade or sup… (see more)pressing their translation. Accurate prediction of miRNA–mRNA interactions is crucial for RNA therapeutics. Existing methods rely on handcrafted features, struggle to scale to kilobase-long mRNA sequences, or lack interpretability. We introduce MiRformer , a transformer framework designed to predict not only the binary miRNA–mRNA interaction but also the start and end location of the miRNA binding site in the mRNA sequence. MiRformer employs a dual-transformer encoder architecture to learn interaction patterns directly from raw miRNA-mRNA sequence pairs via the cross-attention between the miRNA-encoder and mRNA-encoder. To scale to long mRNA sequences, we leverage sliding-window attention mechanism. MiR-former achieves state-of-the-art performance across diverse miRNA–mRNA tasks, including binding prediction, target-site localization, and cleavage-site identification from Degradome sequencing data. The learned transformer attention are highly interpretable and reveals highly contrasting signals for the miRNA seed regions in 500-nt long mRNA sequences. We used MiRformer to simultaneously predict novel binding sites and cleavage sites in 13k miRNA-mRNA pairs and observed that the two types of sites tend to be close to each other, supporting miRNA-mediated degradation mechanism. Our code is available at https://github.com/li-lab-mcgill/miRformer .

2025-11-23

bioRxiv (preprint)

TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare

Ziyang Song

Qincheng Lu

Hao Xu

He Zhu

David L. Buckeridge

Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success in Natural Language Processing and Computer … (see more)Vision domains. However, the development of PTMs on healthcare time-series data is lagging behind.This underscores the limitations of the existing transformer-based architectures, particularly their scalability to handle large-scale time series and ability to capture long-term temporal dependencies. In this study, we present Timely Generative Pre-trained Transformer (TimelyGPT). TimelyGPT employs an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations. It also integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies. We evaluated TimelyGPT on two large-scale healthcare time series datasets corresponding to continuous biosignals and irregularly-sampled time series, respectively. Our experiments show that during pre-training, TimelyGPT excels in learning time-series representations from continuously monitored biosignals and irregularly-sampled time series data commonly observed in longitudinal electronic health records (EHRs). In forecasting continuous biosignals, TimelyGPT achieves accurate extrapolation up to 6,000 timesteps of body temperature during the sleep stage transition, given a short look-up window (i.e., prompt) containing only 2,000 timesteps. For irregularly-sampled time series, TimelyGPT with a proposed time-specific inference demonstrates high top recall scores in predicting future diagnoses using early diagnostic records, effectively handling irregular intervals between clinical records. Together, we envision TimelyGPT to be useful in a broad spectrum of health domains, including long-term patient health state forecasting and patient risk trajectory prediction.

2025-10-13

Health Information Science and Systems (published)

arxiv.org

Source-free cross-modality medical image synthesis with diffusion priors

Jia Chen

Xin Wang

Kai Yang

Xinrong Hu

2025-09-23

Journal of King Saud University Computer and Information Sciences (published)

A Unified Solution to Diverse Heterogeneities in One-Shot Federated Learning

Yiliao Song

Di Wu

Atul Sajjanhar

Yong Xiang

Wei Zhou

Xiaohui Tao

Yan Li

One-Shot Federated Learning (OSFL) restricts communication between the server and clients to a single round, significantly reducing communic… (see more)ation costs and minimizing privacy leakage risks compared to traditional Federated Learning (FL), which requires multiple rounds of communication. However, existing OSFL frameworks remain vulnerable to distributional heterogeneity, as they primarily focus on model heterogeneity while neglecting data heterogeneity. To bridge this gap, we propose FedHydra, a unified, data-free, OSFL framework designed to effectively address both model and data heterogeneity. Unlike existing OSFL approaches, FedHydra introduces a novel two-stage learning mechanism. Specifically, it incorporates model stratification and heterogeneity-aware stratified aggregation to mitigate the challenges posed by both model and data heterogeneity. By this design, the data and model heterogeneity issues are simultaneously monitored from different aspects during learning. Consequently, FedHydra can effectively mitigate both issues by minimizing their inherent conflicts. We compared FedHydra with five SOTA baselines on four benchmark datasets. Experimental results show that our method outperforms the previous OSFL methods in both homogeneous and heterogeneous settings. The code is available at https://github.com/Jun-B0518/FedHydra.

2025-08-02

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (published)

arxiv.org

Harnessing agent-based frameworks in CellAgentChat to unravel cell–cell interactions from single-cell and spatial transcriptomics

Vishvak Raghavan

Yumin Zheng

Understanding cell–cell interactions (CCIs) is essential yet challenging owing to the inherent intricacy and diversity of cellular dynamic… (see more)s. Existing approaches often analyze global patterns of CCIs using statistical frameworks, missing the nuances of individual cell behavior owing to their focus on aggregate data. This makes them insensitive in complex environments where the detailed dynamics of cell interactions matter. We introduce CellAgentChat, an agent-based model (ABM) designed to decipher CCIs from single-cell RNA sequencing and spatial transcriptomics data. This approach models biological systems as collections of autonomous agents governed by biologically inspired principles and rules. Validated across eight diverse single-cell data sets, CellAgentChat demonstrates its effectiveness in detecting intricate signaling events across different cell populations. Moreover, CellAgentChat offers the ability to generate animated visualizations of single-cell interactions and provides flexibility in modifying agent behavior rules, facilitating thorough exploration of both close and distant cellular communications. Furthermore, CellAgentChat leverages ABM features to enable intuitive in silico perturbations via agent rule modifications, facilitating the development of novel intervention strategies. This ABM method unlocks an in-depth understanding of cellular signaling interactions across various biological contexts, thereby enhancing in silico studies for cellular communication–based therapies.

2025-06-30

Genome Research (published)

CellMemory: hierarchical interpretation of out-of-distribution cells using bottlenecked transformer

Qifei Wang

He Zhu

Yiwen Hu

Yanjie Chen

Yuwei Wang

Guochao Li

Yun Li

Jinfeng Chen

Xuegong Zhang

James Zou

Manolis Kellis

Dianbo Liu

Lan Jiang

Identifying the genetic and molecular drivers of phenotypic heterogeneity among individuals is vital for understanding human health and for … (see more)diagnosing, monitoring, and treating diseases. To this end, international consortia such as the Human Cell Atlas and the Tabula Sapiens are creating comprehensive cellular references. Due to the massive volume of data generated, machine learning methods, especially transformer architectures, have been widely employed in related studies. However, applying machine learning to cellular data presents several challenges. One such challenge is making the methods interpretable with respect to both the input cellular information and its context. Another less explored challenge is the accurate representation of cells outside existing references, referred to as out-of-distribution (OOD) cells. The out-of-distribution could be attributed to various physiological conditions, such as comparing diseased cells, particularly tumor cells, with healthy reference data, or significant technical variations, such as using transfer learning from single-cell reference to spatial query data. Inspired by the global workspace theory in cognitive neuroscience, we introduce CellMemory, a bottlenecked Transformer with improved generalization capabilities designed for the hierarchical interpretation of OOD cells unseen during reference building. Even without pre-training, it exceeds the performance of large language models pre-trained with tens of millions of cells. In particular, when deciphering spatially resolved single-cell transcriptomics data, CellMemory demonstrates the ability to interpret data at the granule level accurately. Finally, we harness CellMemory's robust representational capabilities to elucidate malignant cells and their founder cells in different patients, providing reliable characterizations of the cellular changes caused by the disease.

2025-06-22

Genome Biology (published)

FedWeight: mitigating covariate shift of federated learning on electronic health records data through patients re-weighting

He Zhu

Na Li

Xiaoxiao Li

Dianbo Liu

David L. Buckeridge

Federated learning (FL) enables collaborative analysis of decentralized medical data while preserving patient privacy. However, the covariat… (see more)e shift from demographic and clinical differences can reduce model generalizability. We propose FedWeight, a novel FL framework that mitigates covariate shift by reweighting patient data from the source sites using density estimators, allowing the trained model to better align with the distribution of the target site. To support unsupervised applications, we introduce FedWeight ETM, a federated embedded topic model. We evaluated FedWeight in cross-site FL on the eICU dataset and cross-dataset FL between eICU and MIMIC III. FedWeight consistently outperforms standard FL baselines in predicting ICU mortality, ventilator use, sepsis diagnosis, and length of stay. SHAP-based interpretation and ETM-based topic modeling reveal improved identification of clinically relevant characteristics and disease topics associated with ICU readmission.

2025-05-16

NPJ Digital Medicine (published)

ECLARE: multi-teacher contrastive learning via ensemble distillation for diagonal integration of single-cell multi-omic data

Dylan Mann-Krzisnik

Anjali Chawla

Gustavo Turecki

Corina Nagy

Integrating multimodal single-cell data such as scRNA-seq with scATAC-seq is essential for decoding gene regulatory networks, but remains di… (see more)fficult due to feature harmonization and limited paired multiome data. We introduce ECLARE, a framework that uses multi-teacher ensemble knowledge distillation with contrastive learning and optimal-transport alignment to integrate unpaired single-cell multi-omic datasets. Across benchmarks, ECLARE achieves competitive performance for multimodal integration and biological structure preservation. We further demonstrate utility in a major depressive disorder case study using unpaired snRNA-seq and snATAC-seq, identifying transcription factor–target gene programs that are differentially regulated with sex- and cell-type specificity. Finally, ECLARE learns continuous representations that capture longitudinal structure, highlighting altered neurodevelopmental programs associated with depression in female subjects. Altogether, ECLARE expands the practical reach of multimodal single-cell analysis by enabling diagonal integration of unpaired data with strong biological preservation, facilitating integrative regulatory studies across diverse cohorts and conditions.

2025-04-06

bioRxiv (preprint)