Portrait de Yue Li

Yue Li

Membre académique associé
Professeur adjoint, McGill University, École d'informatique
Sujets de recherche
Apprentissage multimodal
Apprentissage profond
Biologie computationnelle
Génétique
Génomique unicellulaire
Grands modèles de langage (LLM)
IA en santé
Modèles bayésiens

Biographie

J'ai obtenu un doctorat en informatique et biologie computationnelle de l'Université de Toronto en 2014. Avant de me joindre à l’Université McGill, j'ai été associé postdoctoral au Computer Science and Artificial Intelligence Laboratory (CSAIL) du Massachusetts Institute of Technology (MIT) (2015-2018).

Mes recherches portent sur le développement de modèles d'apprentissage probabilistes interprétables et de modèles d'apprentissage profond pour modéliser les données génétiques et épigénétiques, les dossiers de santé électroniques et les données génomiques unicellulaires.

En intégrant systématiquement des données multimodales et longitudinales, je cherche à obtenir des applications qui auront des effets tangibles en médecine computationnelle, y compris la construction de systèmes de recommandation clinique intelligents, la prévision des trajectoires de santé des patients, les prédictions personnalisées de risques polygéniques, la caractérisation des mutations génétiques fonctionnelles multitraits, et la dissection des éléments réglementaires spécifiques au type de cellule qui sont à la base des traits complexes et des maladies chez l'homme. Mon programme de recherche couvre trois domaines principaux impliquant l'apprentissage automatique appliqué à la génomique computationnelle et à la santé.

Étudiants actuels

Postdoctorat - McGill
Postdoctorat - McGill
Doctorat - McGill
Maîtrise recherche - McGill
Maîtrise recherche - McGill
Maîtrise recherche - McGill
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Maîtrise recherche - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Doctorat - McGill
Co-superviseur⋅e :
Maîtrise recherche - McGill
Co-superviseur⋅e :
Maîtrise recherche - McGill
Maîtrise recherche - McGill
Postdoctorat - McGill
Co-superviseur⋅e :

Publications

TrajGPT: Irregular Time-Series Representation Learning of Health Trajectory
Ziyang Song
Qincheng Lu
In the healthcare domain, time-series data are often irregularly sampled with varying intervals through outpatient visits, posing challenges… (voir plus) for existing models designed for equally spaced sequential data. To address this, we propose Trajectory Generative Pre-trained Transformer (TrajGPT) for representation learning on irregularly-sampled healthcare time series. TrajGPT introduces a novel Selective Recurrent Attention (SRA) module that leverages a data-dependent decay to adaptively filter irrelevant past information. As a discretized ordinary differential equation (ODE) framework, TrajGPT captures underlying continuous dynamics and enables a time-specific inference for forecasting arbitrary target timesteps without auto-regressive prediction. Experimental results based on the longitudinal EHR data PopHR from Montreal health system and eICU from PhysioNet showcase TrajGPT's superior zero-shot performance in disease forecasting, drug usage prediction, and sepsis detection. The inferred trajectories of diabetic and cardiac patients reveal meaningful comorbidity conditions, underscoring TrajGPT as a useful tool for forecasting patient health evolution.
Cell ontology guided transcriptome foundation model
Transcriptome foundation models (TFMs) hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by… (voir plus) self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present **s**ingle **c**ell, **Cell**-**o**ntology guided TFM (scCello). We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses. Source code and model weights are available at https://github.com/DeepGraphLearning/scCello.
MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling
Ruohan Wang
Ziyang Song
Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups a… (voir plus)nd enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.
scMoE: single-cell mixture of experts for learning hierarchical, cell-type-specific, and interpretable representations from heterogeneous scRNA-seq data
Michael Huang
Advancements in single-cell transcriptomics methods have resulted in a wealth of single-cell RNA sequencing (scRNA-seq) data. Methods to lea… (voir plus)rn cell representation from atlas-level scRNA-seq data across diverse tissues can shed light into cell functions implicated in diseases such as cancer. However, integrating large-scale and heterogeneous scRNA-seq data is challenging due to the disparity of cell-types and batch effects. We present single-cell Mixture of Expert (scMoE), a hierarchical mixture of experts single-cell topic model. Our key contributions are the cell-type specific experts, which explicitly aligns topics with cell-types, and the integration of hierarchical cell-type lineages and domain knowledge. scMoE is both transferable and highly interpretable. We benchmarked our scMoE’s performance on 9 single-cell RNA-seq datasets for clustering and 3 simulated spatial datasets for spatial deconvolution. We additionally show that our model, using single-cell references, yields meaningful biological results by deconvolving 3 cancer bulk RNA-seq datasets and 2 spatial transcriptomics datasets. scMoE is able to identify cell-types of survival importance, find cancer subtype specific deconvolutional patterns, and capture meaningful spatially distinct cell-type distributions.
Protocol to perform integrative analysis of high-dimensional single-cell multimodal data using an interpretable deep learning technique
Manqi Zhou
Hao Zhang
Zilong Bai
Fei Wang
Supervised latent factor modeling isolates cell-type-specific transcriptomic modules that underlie Alzheimer’s disease progression
Yasser Iturria-Medina
Jo Anne Stratton
David A. Bennett
Late onset Alzheimer’s disease (AD) is a progressive neurodegenerative disease, with brain changes beginning years before symptoms surface… (voir plus). AD is characterized by neuronal loss, the classic feature of the disease that underlies brain atrophy. However, GWAS reports and recent single-nucleus RNA sequencing (snRNA-seq) efforts have highlighted that glial cells, particularly microglia, claim a central role in AD pathophysiology. Here, we tailor pattern-learning algorithms to explore distinct gene programs by integrating the entire transcriptome, yielding distributed AD-predictive modules within the brain’s major cell-types. We show that these learned modules are biologically meaningful through the identification of new and relevant enriched signaling cascades. The predictive nature of our modules, especially in microglia, allows us to infer each subject’s progression along a disease pseudo-trajectory, confirmed by post-mortem pathological brain tissue markers. Additionally, we quantify the interplay between pairs of cell-type modules in the AD brain, and localized known AD risk genes to enriched module gene programs. Our collective findings advocate for a transition from cell-type-specificity to gene modules specificity to unlock the potential of unique gene programs, recasting the roles of recently reported genome-wide AD risk loci. Designing a supervised latent factor framework for snRNA-seq human brain, the authors find distinct Alzheimer’s-predictive gene modules across celltypes, suggesting subcelltype disease progression trajectories.
MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records
Ariane Marelli
Archer Y. Yang
Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as morta… (voir plus)lity or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge.
Multi-ancestry polygenic risk scores using phylogenetic regularization
Accurately predicting phenotype using genotype across diverse ancestry groups remains a significant challenge in human genetics. Many state-… (voir plus)of-the-art polygenic risk score models are known to have difficulty generalizing to genetic ancestries that are not well represented in their training set. To address this issue, we present a novel machine learning method for fitting genetic effect sizes across multiple ancestry groups simultaneously, while leveraging prior knowledge of the evolutionary relationships among them. We introduce DendroPRS, a machine learning model where SNP effect sizes are allowed to evolve along the branches of the phylogenetic tree capturing the relationship among populations. DendroPRS outperforms existing approaches at two important genotype-to-phenotype prediction tasks: expression QTL analysis and polygenic risk scores. We also demonstrate that our method can be useful for multi-ancestry modelling, both by fitting population-specific effect sizes and by more accurately accounting for covariate effects across groups. We additionally find a subset of genes where there is strong evidence that an ancestry-specific approach improves eQTL modelling.
Machine Learning Informed Diagnosis for Congenital Heart Disease in Large Claims Data Source
Ariane J. Marelli
Chao Li
Aihua Liu
Hanh Nguyen
Harry Moroz
James M. Brophy
Liming Guo
David L. Buckeridge
Archer Y. Yang
GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling
Yimin Fan
Shi Han
Single-cell Assay for Transposase-Accessible Chromatin with sequencing (scATAC-seq) has emerged as a powerful technique for investigating op… (voir plus)en chromatin landscapes at single-cell resolution. However, analyzing scATAC-seq data remain challenging due to its sparsity and noise. Genome Foundation Models (GFMs), pre-trained on massive DNA sequences, have proven effective at genome analysis. Given that open chromatin regions (OCRs) harbour salient sequence features, we hypothesize that leveraging GFMs’ sequence embeddings can improve the accuracy and generalizability of scATAC-seq modeling. Here, we introduce the Genome Foundation Embedded Topic Model (GFETM), an interpretable deep learning framework that combines GFMs with the Embedded Topic Model (ETM) for scATAC-seq data analysis. By integrating the DNA sequence embeddings extracted by a GFM from OCRs, GFETM demonstrates superior accuracy and generalizability and captures cell-state specific TF activity both with zero-shot inference and attention mechanism analysis. Finally, the topic mixtures inferred by GFETM reveal biologically meaningful epigenomic signatures of kidney diabetes.
Single-cell multi-omics topic embedding reveals cell-type-specific and COVID-19 severity-related immune signatures
Manqi Zhou
Hao Zhang
Zilong Bai
Fei Wang
Biomedical discovery through the integrative biomedical knowledge hub (iBKH)
Chang Su
Yu Hou
Manqi Zhou
Suraj Rajendran
Jacqueline R.M. A. Maasch
Zehra Abedi
Haotan Zhang
Zilong Bai
Anthony Cuturrufo
Winston Guo
Fayzan F. Chaudhry
Gregory Ghahramani
Feixiong Cheng
Rui Zhang
Steven T. DeKosky
Jiang Bian
Fei Wang
Summary The massive and continuously increasing volume of biomedical knowledge derived from biological experiments or gained from healthcare… (voir plus) practices has become an invaluable treasure for biomedicine. The emerging biomedical knowledge graphs (BKGs) provide an efficient and effective way to manage the abundant knowledge in biomedical and life science. In the present study, we harmonized and integrated data from diverse biomedical resources to curate a comprehensive BKG, named the integrative Biomedical Knowledge Hub (iBKH). To facilitate the usage of iBKH in biomedical research, we developed a web-based, easy-to-use, publicly available graphical portal that allows fast, interactive, and visualized knowledge retrieval in iBKH. Furthermore, an efficient and scalable graph learning pipeline was developed for novel knowledge discovery in iBKH. As a proof of concept, we performed our iBKH-based method for computational in silico drug repurposing for Alzheimer’s disease. The iBKH is publicly available at: http://ibkh.ai/ .