Portrait de Yue Li

Yue Li

Membre académique associé
Professeur adjoint, McGill University, École d'informatique
Sujets de recherche
Apprentissage multimodal
Apprentissage profond
Biologie computationnelle
Génétique
Génomique unicellulaire
Grands modèles de langage (LLM)
IA en santé
Modèles bayésiens

Biographie

J'ai obtenu un doctorat en informatique et biologie computationnelle de l'Université de Toronto en 2014. Avant de me joindre à l’Université McGill, j'ai été associé postdoctoral au Computer Science and Artificial Intelligence Laboratory (CSAIL) du Massachusetts Institute of Technology (MIT) (2015-2018).

Mes recherches portent sur le développement de modèles d'apprentissage probabilistes interprétables et de modèles d'apprentissage profond pour modéliser les données génétiques et épigénétiques, les dossiers de santé électroniques et les données génomiques unicellulaires.

En intégrant systématiquement des données multimodales et longitudinales, je cherche à obtenir des applications qui auront des effets tangibles en médecine computationnelle, y compris la construction de systèmes de recommandation clinique intelligents, la prévision des trajectoires de santé des patients, les prédictions personnalisées de risques polygéniques, la caractérisation des mutations génétiques fonctionnelles multitraits, et la dissection des éléments réglementaires spécifiques au type de cellule qui sont à la base des traits complexes et des maladies chez l'homme. Mon programme de recherche couvre trois domaines principaux impliquant l'apprentissage automatique appliqué à la génomique computationnelle et à la santé.

Étudiants actuels

Postdoctorat - McGill
Doctorat - McGill
Maîtrise recherche - McGill
Maîtrise recherche - McGill
Maîtrise recherche - McGill
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Maîtrise recherche - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Doctorat - McGill
Co-superviseur⋅e :
Maîtrise recherche - McGill
Co-superviseur⋅e :
Maîtrise recherche - McGill
Postdoctorat - McGill
Co-superviseur⋅e :

Publications

PheCode-guided multi-modal topic modeling of electronic health records improves disease incidence prediction and GWAS discovery from UK Biobank
Ziqi Yang
Ziyang Song
Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of e… (voir plus)lectronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1000 interpretable phenotype topics from UK Biobank data. Applied to 350 000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict incident type 2 diabetes (T2D) and leukemia diagnoses. Subsequent genome-wide association studies using these continuous risk scores uncovered novel disease-associated loci, including PPP1R15A for T2D and JMJD6/SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE.
SpaTM: Topic Models for Inferring Spatially Informed Transcriptional Programs
Wenqi Dong
Qihuang Zhang
Robert Sladek
Spatial transcriptomics has revolutionized our ability to characterize tissues and diseases by contextualizing gene expression with spatial … (voir plus)organization. Available methods require researchers to either train a model using histology-based annotations or use annotation-free clustering approaches to uncover spatial domains. However, few methods provide researchers with a way to jointly analyze spatial data from both annotation-free and annotation-guided perspectives using consistent inductive biases and levels of interpretability. A single framework with consistent inductive biases ensures coherence and transferability across tasks, reducing the risks of conflicting assumptions. To this end, we propose the Spatial Topic Model (SpaTM), a topic-modeling framework capable of annotation-guided and annotation-free analysis of spatial transcriptomics data. SpaTM can be used to learn gene programs that represent histology-based annotations while providing researchers with the ability to infer spatial domains with an annotation-free approach if manual annotations are limited or noisy. We demonstrate SpaTM’s interpretability with its use of topic mixtures to represent cell states and transcriptional programs and how its intuitive framework facilitates the integration of annotation-guided and annotation-free analyses of spatial data with downstream analyses such as cell type deconvolution. Finally, we demonstrate how both approaches can be used to extend the analysis of large-scale snRNA-seq atlases with the inference of cell proximity and spatial annotations in human brains with Major Depressive Disorder.
MiRformer: a dual-transformer-encoder framework for predicting microRNA-mRNA interactions from paired sequences
MicroRNAs (miRNAs) are small non-coding RNAs that regulate genes by binding to target messenger RNAs (mRNAs), causing them to degrade or sup… (voir plus)pressing their translation. Accurate prediction of miRNA–mRNA interactions is crucial for RNA therapeutics. Existing methods rely on handcrafted features, struggle to scale to kilobase-long mRNA sequences, or lack interpretability. We introduce MiRformer , a transformer framework designed to predict not only the binary miRNA–mRNA interaction but also the start and end location of the miRNA binding site in the mRNA sequence. MiRformer employs a dual-transformer encoder architecture to learn interaction patterns directly from raw miRNA-mRNA sequence pairs via the cross-attention between the miRNA-encoder and mRNA-encoder. To scale to long mRNA sequences, we leverage sliding-window attention mechanism. MiR-former achieves state-of-the-art performance across diverse miRNA–mRNA tasks, including binding prediction, target-site localization, and cleavage-site identification from Degradome sequencing data. The learned transformer attention are highly interpretable and reveals highly contrasting signals for the miRNA seed regions in 500-nt long mRNA sequences. We used MiRformer to simultaneously predict novel binding sites and cleavage sites in 13k miRNA-mRNA pairs and observed that the two types of sites tend to be close to each other, supporting miRNA-mediated degradation mechanism. Our code is available at https://github.com/li-lab-mcgill/miRformer .
TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare
Ziyang Song
Qincheng Lu
Hao Xu
Ziqi Yang
Mike He Zhu
Motivation: Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success in Natural Language Processing a… (voir plus)nd Computer Vision domains. However, the development of PTMs on healthcare time-series data is lagging behind. This underscores the limitations of the existing transformer-based architectures, particularly their scalability to handle large-scale time series and ability to capture long-term temporal dependencies. Methods: In this study, we present Timely Generative Pre-trained Transformer (TimelyGPT). TimelyGPT employs an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations. It also integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies. Materials: We evaluated TimelyGPT on two large-scale healthcare time series datasets corresponding to continuous biosignals and irregularly-sampled time series, respectively: (1) the Sleep EDF dataset consisting of over 1.2 billion timesteps; (2) the longitudinal healthcare administrative database PopHR, comprising 489,000 patients randomly sampled from the Montreal population. Results: In forecasting continuous biosignals, TimelyGPT achieves accurate extrapolation up to 6,000 timesteps of body temperature during the sleep stage transition, given a short look-up window (i.e., prompt) containing only 2,000 timesteps. For irregularly-sampled time series, TimelyGPT with a proposed time-specific inference demonstrates high top recall scores in predicting future diagnoses using early diagnostic records, effectively handling irregular intervals between clinical records. Together, we envision TimelyGPT to be useful in various health domains, including long-term patient health state forecasting and patient risk trajectory prediction. Availability: The open-sourced code is available at Github.
Timelygpt: extrapolatable transformer pre-training for long-term time-series forecasting in healthcare
Ziyang Song
Qincheng Lu
Hao Xu
Ziqi Yang
Mike He Zhu
TrajGPT: Irregular Time-Series Representation Learning of Health Trajectory.
Ziyang Song
Qincheng Lu
Mike He Zhu
In the healthcare domain, time-series data are often irregularly sampled with varying intervals through outpatient visits, posing challenges… (voir plus) for existing models designed for equally spaced sequential data. To address this, we propose Trajectory Generative Pre-trained Transformer (TrajGPT) for representation learning on irregularly-sampled healthcare time series. TrajGPT introduces a novel Selective Recurrent Attention (SRA) module that leverages a data-dependent decay to adaptively filter irrelevant past information. As a discretized ordinary differential equation (ODE) framework, TrajGPT captures underlying continuous dynamics and enables a time-specific inference for forecasting arbitrary target timesteps without auto-regressive prediction. Experimental results based on the longitudinal EHR data PopHR from Montreal health system and eICU from PhysioNet showcase TrajGPT's superior zero-shot performance in disease forecasting, drug usage prediction, and sepsis detection. The inferred trajectories of diabetic and cardiac patients reveal meaningful comorbidity conditions, underscoring TrajGPT as a useful tool for forecasting patient health evolution.
TrajGPT: Irregular Time-Series Representation Learning of Health Trajectory.
Ziyang Song
Qincheng Lu
Mike He Zhu
In the healthcare domain, time-series data are often irregularly sampled with varying intervals through outpatient visits, posing challenges… (voir plus) for existing models designed for equally spaced sequential data. To address this, we propose Trajectory Generative Pre-trained Transformer (TrajGPT) for representation learning on irregularly-sampled healthcare time series. TrajGPT introduces a novel Selective Recurrent Attention (SRA) module that leverages a data-dependent decay to adaptively filter irrelevant past information. As a discretized ordinary differential equation (ODE) framework, TrajGPT captures underlying continuous dynamics and enables a time-specific inference for forecasting arbitrary target timesteps without auto-regressive prediction. Experimental results based on the longitudinal EHR data PopHR from Montreal health system and eICU from PhysioNet showcase TrajGPT's superior zero-shot performance in disease forecasting, drug usage prediction, and sepsis detection. The inferred trajectories of diabetic and cardiac patients reveal meaningful comorbidity conditions, underscoring TrajGPT as a useful tool for forecasting patient health evolution.
Single-nucleus chromatin accessibility profiling identifies cell types and functional variants contributing to major depression
Anjali Chawla
Laura M. Fiori
Wenmin Zang
Malosree Maitra
Jennie Yang
Dariusz Żurawek
Gabriella Frosi
Reza Rahimian
Haruka Mitsuhashi
Maria Antonietta Davoli
Ryan Denniston
Gary Gang Chen
Volodymyr Yerko
Deborah Mash
Kiran Girdhar
Schahram Akbarian
Naguib Mechawar
Matthew Suderman
Corina Nagy
Gustavo Turecki
Single-nucleus chromatin accessibility profiling identifies cell types and functional variants contributing to major depression.
Anjali Chawla
Laura M. Fiori
Wenmin Zang
Malosree Maitra
Jennie Yang
Dariusz Żurawek
Gabriella Frosi
Reza Rahimian
Haruka Mitsuhashi
MA Davoli
Ryan Denniston
Gary Gang Chen
V. Yerko
Deborah Mash
Kiran Girdhar
S. Akbarian
Naguib Mechawar
Matthew Suderman
Corina Nagy
Gustavo Turecki
Single-nucleus chromatin accessibility profiling identifies cell types and functional variants contributing to major depression
Anjali Chawla
Laura M. Fiori
Wenmin Zang
Malosree Maitra
Jennie Yang
Dariusz Żurawek
Gabriella Frosi
Reza Rahimian
Haruka Mitsuhashi
Maria Antonietta Davoli
MA Davoli
Ryan Denniston
Gary Gang Chen
Volodymyr Yerko
Deborah Mash
Kiran Girdhar
Schahram Akbarian
Naguib Mechawar
Matthew Suderman … (voir 3 de plus)
Corina Nagy
Gustavo Turecki
Toward whole-genome inference of polygenic scores with fast and memory-efficient algorithms.
Chirayu Anant Haryan
Simon Gravel
Sanchit Misra
Harnessing agent-based frameworks in CellAgentChat to unravel cell-cell interactions from single-cell and spatial transcriptomics