Portrait of Yue Li

Yue Li

Associate Academic Member
Assistant Professor, McGill University, School of Computer Science
Research Topics
AI in Health
Bayesian Models
Computational Biology
Deep Learning
Genetics
Large Language Models (LLM)
Multimodal Learning
Single-Cell Genomics

Biography

I completed my PhD degree in computer science and computational biology at the University of Toronto in 2014. Prior to joining McGill University, I was a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT (2015–2018).

In general, my research program covers three main research areas that involve applied machine learning in computational genomics and health. More specifically, it focuses on developing interpretable probabilistic learning models and deep learning models to model genetic, epigenetic, electronic health record and single-cell genomic data.

By systematically integrating multimodal and longitudinal data, I aim to have impactful applications in computational medicine, including building intelligent clinical recommender systems, forecasting patient health trajectories, making personalized polygenic risk predictions, characterizing multi-trait functional genetic mutations, and dissecting cell-type-specific regulatory elements that underpin complex traits and diseases in humans.

Current Students

Postdoctorate - McGill University
PhD - McGill University
Master's Research - McGill University
Master's Research - McGill University
Master's Research - McGill University
PhD - McGill University
Master's Research - McGill University
Principal supervisor :
PhD - McGill University
Master's Research - McGill University
Co-supervisor :
PhD - McGill University
Master's Research - McGill University
Postdoctorate - McGill University
Co-supervisor :
PhD - McGill University

Publications

Cell ontology guided transcriptome foundation model
Transcriptome foundation models (TFMs) hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by… (see more) self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present **s**ingle **c**ell, **Cell**-**o**ntology guided TFM (scCello). We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses. Source code and model weights are available at https://github.com/DeepGraphLearning/scCello.
GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling
Yimin Fan
Shi Han
Yu Li
Supervised latent factor modeling isolates cell-type-specific transcriptomic modules that underlie Alzheimer’s disease progression
Yasser Iturria-Medina
Jo Anne Stratton
David A. Bennett
Protocol to perform integrative analysis of high-dimensional single-cell multimodal data using an interpretable deep learning technique
Manqi Zhou
Hao Zhang
Zilong Bai
Yi Wang
MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records
Yixuan Li
Ariane Marelli
Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as morta… (see more)lity or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8211 subjects with 75,187 outpatient claim records of 1767 unique ICD codes; the MIMIC-III consisting of 1458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
Multi-ancestry polygenic risk scores using phylogenetic regularization
Bidirectional Generative Pre-training for Improving Healthcare Time-series Representation Learning
Ziyang Song
Qincheng Lu
Learning time-series representations for discriminative tasks, such as classification and regression, has been a long-standing challenge in … (see more)the healthcare domain. Current pre-training methods are limited in either unidirectional next-token prediction or randomly masked token prediction. We propose a novel architecture called Bidirectional Timely Generative Pre-trained Transformer (BiTimelyGPT), which pre-trains on biosignals and longitudinal clinical records by both next-token and previous-token prediction in alternating transformer layers. This pre-training task preserves original distribution and data shapes of the time-series. Additionally, the full-rank forward and backward attention matrices exhibit more expressive representation capabilities. Using biosignals and longitudinal clinical records, BiTimelyGPT demonstrates superior performance in predicting neurological functionality, disease diagnosis, and physiological signs. By visualizing the attention heatmap, we observe that the pre-trained BiTimelyGPT can identify discriminative segments from biosignal time-series sequences, even more so after fine-tuning on the task.
Machine Learning Informed Diagnosis for Congenital Heart Disease in Large Claims Data Source
Ariane Marelli
Chao Li
Aihua Liu
Hanh Nguyen
Harry Moroz
James M. Brophy
Liming Guo
Bidirectional Generative Pre-training for Improving Time Series Representation Learning
Ziyang Song
Qincheng Lu
Mike He Zhu
Bidirectional Generative Pre-training for Improving Healthcare Time-series Representation Learning
Ziyang Song
Qincheng Lu
Mike He Zhu
MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records
Yixuan Li
Ariane Marelli
Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as morta… (see more)lity or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8211 subjects with 75,187 outpatient claim records of 1767 unique ICD codes; the MIMIC-III consisting of 1458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare
Ziyang Song
Qincheng Lu
Hao Xu
Mike He Zhu
Motivation: Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success in Natural Language Processing a… (see more)nd Computer Vision domains. However, the development of PTMs on healthcare time-series data is lagging behind. This underscores the limitations of the existing transformer-based architectures, particularly their scalability to handle large-scale time series and ability to capture long-term temporal dependencies. Methods: In this study, we present Timely Generative Pre-trained Transformer (TimelyGPT). TimelyGPT employs an extrapolatable position (xPos) embedding to encode trend and periodic patterns into time-series representations. It also integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies. Materials: We evaluated TimelyGPT on two large-scale healthcare time series datasets corresponding to continuous biosignals and irregularly-sampled time series, respectively: (1) the Sleep EDF dataset consisting of over 1.2 billion timesteps; (2) the longitudinal healthcare administrative database PopHR, comprising 489,000 patients randomly sampled from the Montreal population. Results: In forecasting continuous biosignals, TimelyGPT achieves accurate extrapolation up to 6,000 timesteps of body temperature during the sleep stage transition, given a short look-up window (i.e., prompt) containing only 2,000 timesteps. For irregularly-sampled time series, TimelyGPT with a proposed time-specific inference demonstrates high top recall scores in predicting future diagnoses using early diagnostic records, effectively handling irregular intervals between clinical records. Together, we envision TimelyGPT to be useful in various health domains, including long-term patient health state forecasting and patient risk trajectory prediction. Availability: The open-sourced code is available at Github.