Portrait of Yue Li

Yue Li

Associate Academic Member
Assistant Professor, McGill University, School of Computer Science
Research Topics
Computational Biology

Biography

I completed my PhD degree in computer science and computational biology at the University of Toronto in 2014. Prior to joining McGill University, I was a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT (2015–2018).

In general, my research program covers three main research areas that involve applied machine learning in computational genomics and health. More specifically, it focuses on developing interpretable probabilistic learning models and deep learning models to model genetic, epigenetic, electronic health record and single-cell genomic data.

By systematically integrating multimodal and longitudinal data, I aim to have impactful applications in computational medicine, including building intelligent clinical recommender systems, forecasting patient health trajectories, making personalized polygenic risk predictions, characterizing multi-trait functional genetic mutations, and dissecting cell-type-specific regulatory elements that underpin complex traits and diseases in humans.

Current Students

PhD - McGill University
Master's Research - McGill University
Master's Research - McGill University
PhD - McGill University
Principal supervisor :
PhD - McGill University
Master's Research - McGill University
Principal supervisor :
PhD - McGill University
Master's Research - McGill University
Co-supervisor :
PhD - McGill University
Collaborating Alumni - McGill University
Master's Research - McGill University
PhD - McGill University
Master's Research - McGill University
PhD - McGill University

Publications

MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling
Ruohan Wang
Zilong Wang
Ziyang Song
Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups a… (see more)nd enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.
Cell ontology guided transcriptome foundation model
Xinyu Yuan
Zhihao Zhan
Zuobai Zhang
Manqi Zhou
Jianan Zhao
Boyu Han
Transcriptome foundation models (TFMs) hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by… (see more) self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present **s**ingle **c**ell, **Cell-o**ntology guided TFM (scCello). We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses.
TrajGPT: Healthcare Time-Series Representation Learning for Trajectory Prediction
Ziyang Song
Qincheng Lu
Mike He Zhu
In many domains, such as healthcare, time-series data is irregularly sampled with varying intervals between observations. This creates chall… (see more)enges for classical time-series models that require equally spaced data. To address this, we propose a novel time-series Transformer called **Trajectory Generative Pre-trained Transformer (TrajGPT)**. It introduces a data-dependent decay mechanism that adaptively forgets irrelevant information based on clinical context. By interpreting TrajGPT as ordinary differential equations (ODEs), our approach captures continuous dynamics from sparse and irregular time-series data. Experimental results show that TrajGPT, with its time-specific inference approach, accurately predicts trajectories without requiring task-specific fine-tuning.
TrajGPT: Healthcare Time-Series Representation Learning for Trajectory Prediction
Ziyang Song
Qincheng Lu
Mike He Zhu
In many domains, such as healthcare, time-series data is irregularly sampled with varying intervals between observations. This creates chall… (see more)enges for classical time-series models that require equally spaced data. To address this, we propose a novel time-series Transformer called **Trajectory Generative Pre-trained Transformer (TrajGPT)**. It introduces a data-dependent decay mechanism that adaptively forgets irrelevant information based on clinical context. By interpreting TrajGPT as ordinary differential equations (ODEs), our approach captures continuous dynamics from sparse and irregular time-series data. Experimental results show that TrajGPT, with its time-specific inference approach, accurately predicts trajectories without requiring task-specific fine-tuning.
TrajGPT: Irregular Time-Series Representation Learning for Health Trajectory Analysis
Ziyang Song
Qincheng Lu
Mike He Zhu
MiRGraph: A hybrid deep learning approach to identify microRNA-target interactions by integrating heterogeneous regulatory network and genomic sequences
Pei Liu
Ying Liu
Jiawei Luo
MicroRNAs (miRNAs) mediates gene expression regulation by targeting specific messenger RNAs (mRNAs) in the cytoplasm. They can function as b… (see more)oth tumor suppressors and oncogenes depending on the specific miRNA and its target genes. Detecting miRNA-target interactions (MTIs) is critical for unraveling the complex mechanisms of gene regulation and promising towards RNA therapy for cancer. There is currently a lack of MTIs prediction methods that simultaneously perform feature learning from heterogeneous gene regulatory network (GRN) and genomic sequences. To improve the prediction performance of MTIs, we present a novel transformer-based multiview feature learning method – MiRGraph, which consists of two main modules for learning the sequence-based and GRN-based feature embedding. For the former, we utilize the mature miRNA sequences and the complete 3’UTR sequence of the target mRNAs to encode sequence features using a hybrid transformer and convolutional neural network (CNN) (TransCNN) architecture. For the latter, we utilize a heterogeneous graph transformer (HGT) module to extract the relational and structural information from the GRN consisting of miRNA-miRNA, gene-gene and miRNA-target interactions. The TransCNN and HGT modules can be learned end-to-end to predict experimentally validated MTIs from MiRTarBase. MiRGraph outperforms existing methods in not only recapitulating the true MTIs but also in predicting strength of the MTIs based on the in-vitro measurements of miRNA transfections. In a case study on breast cancer, we identified plausible target genes of an oncomir.
MiRGraph: A transformer-based feature learning approach to identify microRNA-target interactions by integrating heterogeneous graph network and sequence information
Pei Liu
Ying Liu
Jiawei Luo
MicroRNAs (miRNAs) play a crucial role in the regulation of gene expression by targeting specific mRNAs. They can function as both tumor sup… (see more)pressors and oncogenes depending on the specific miRNA and its target genes. Detecting miRNA-target interactions (MTIs) is critical for unraveling the complex mechanisms of gene regulation and identifying therapeutic targets and diagnostic markers. There is currently a lack of MTIs prediction method that simultaneously performs feature learning on heterogeneous graph network and sequence information. To improve the prediction performance of MTIs, we present a novel transformer-based multi-view feature learning method, named MiRGraph. It consists of two main modules for learning the sequence and heterogeneous graph network, respectively. For learning the sequence-based feaature embedding, we utilize the mature miRNA sequence and the complete 3’UTR sequence of the target mRNAs to encode sequence features. Specifically, a transformer-based CNN (TransCNN) module is designed for miRNAs and genes respectively to extract their personalized sequence features. For learning the network-based feature embedding, we utilize a heterogeneous graph transformer (HGT) module to extract the relational and structural information in a heterogeneous graph consisting of miRNA-miRNA, gene-gene and miRNA-target interactions. We learn the TransCNN and HGT modules end-to-end by utilizing a feedforward network, which takes the combined embedded features of the miRNA-gene pair to predict MTIs. Comparisons with other existing MTIs prediction methods illustrates the superiority of MiRGraph under standard criteria. In a case study on breast cancer, we identified plausible target genes of an oncomir hsa-MiR-122-5p and plausible miRNAs that regulate the oncogene BRCA1.
Cell ontology guided transcriptome foundation model
Xinyu Yuan
Zhihao Zhan
Zuobai Zhang
Manqi Zhou
Jianan Zhao
Boyu Han
Transcriptome foundation models (TFMs) hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by… (see more) self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present **s**ingle **c**ell, **Cell**-**o**ntology guided TFM (scCello). We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses. Source code and model weights are available at https://github.com/DeepGraphLearning/scCello.
GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling
Yimin Fan
Adrien Osakwe
Shi Han
Yu Li
Supervised latent factor modeling isolates cell-type-specific transcriptomic modules that underlie Alzheimer’s disease progression
Liam Hodgson
Yasser Iturria-Medina
Jo Anne Stratton
David A. Bennett
Protocol to perform integrative analysis of high-dimensional single-cell multimodal data using an interpretable deep learning technique
Manqi Zhou
Hao Zhang
Zilong Bai
Dylan Mann-Krzisnik
Yi Wang
MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records
Yixuan Li
Ariane Marelli
Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as morta… (see more)lity or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8211 subjects with 75,187 outpatient claim records of 1767 unique ICD codes; the MIMIC-III consisting of 1458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.