Portrait of Yue Li

Yue Li

Associate Academic Member
Assistant Professor, McGill University, School of Computer Science
Research Topics
Computational Biology

Biography

I completed my PhD degree in computer science and computational biology at the University of Toronto in 2014. Prior to joining McGill University, I was a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT (2015–2018).

In general, my research program covers three main research areas that involve applied machine learning in computational genomics and health. More specifically, it focuses on developing interpretable probabilistic learning models and deep learning models to model genetic, epigenetic, electronic health record and single-cell genomic data.

By systematically integrating multimodal and longitudinal data, I aim to have impactful applications in computational medicine, including building intelligent clinical recommender systems, forecasting patient health trajectories, making personalized polygenic risk predictions, characterizing multi-trait functional genetic mutations, and dissecting cell-type-specific regulatory elements that underpin complex traits and diseases in humans.

Current Students

PhD - McGill University
Master's Research - McGill University
Master's Research - McGill University
PhD - McGill University
Principal supervisor :
PhD - McGill University
Master's Research - McGill University
Principal supervisor :
PhD - McGill University
Master's Research - McGill University
Co-supervisor :
PhD - McGill University
Collaborating Alumni - McGill University
Master's Research - McGill University
PhD - McGill University
Master's Research - McGill University
PhD - McGill University

Publications

Multi-ancestry polygenic risk scores using phylogenetic regularization
Elliot Layne
Shadi Zabad
Bidirectional Generative Pre-training for Improving Healthcare Time-series Representation Learning
Ziyang Song
Qincheng Lu
He Zhu
Learning time-series representations for discriminative tasks, such as classification and regression, has been a long-standing challenge in … (see more)the healthcare domain. Current pre-training methods are limited in either unidirectional next-token prediction or randomly masked token prediction. We propose a novel architecture called Bidirectional Timely Generative Pre-trained Transformer (BiTimelyGPT), which pre-trains on biosignals and longitudinal clinical records by both next-token and previous-token prediction in alternating transformer layers. This pre-training task preserves original distribution and data shapes of the time-series. Additionally, the full-rank forward and backward attention matrices exhibit more expressive representation capabilities. Using biosignals and longitudinal clinical records, BiTimelyGPT demonstrates superior performance in predicting neurological functionality, disease diagnosis, and physiological signs. By visualizing the attention heatmap, we observe that the pre-trained BiTimelyGPT can identify discriminative segments from biosignal time-series sequences, even more so after fine-tuning on the task.
Machine Learning Informed Diagnosis for Congenital Heart Disease in Large Claims Data Source
Ariane Marelli
Chao Li
Aihua Liu
Hanh Nguyen
Harry Moroz
James M. Brophy
Liming Guo
Bidirectional Generative Pre-training for Improving Time Series Representation Learning
Ziyang Song
Qincheng Lu
Mike He Zhu
MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records
Yixuan Li
Ariane Marelli
Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as morta… (see more)lity or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8211 subjects with 75,187 outpatient claim records of 1767 unique ICD codes; the MIMIC-III consisting of 1458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
TimelyGPT: Extrapolatable Transformer Pre-training for Long-term Time-Series Forecasting in Healthcare
Ziyang Song
Qincheng Lu
Hao Xu
Mike He Zhu
MDFD: Study of Distributed Non-IID Scenarios and Frechet Distance-Based Evaluation
Wei Wang
Mingwei Zhang
Ziwen Wu
Qianxi Chen
With the development of distributed machine learning and federated learning, the solution to the data island problem is promoted. People use… (see more) computer clusters to train machine learning models on data distributed in different regions. In the early stage of research, researchers usually assume that the data sets of each node are independent identically distribution (IID), but this is a strong assumption, which is challenging to meet in practical applications. Therefore, research on non-IID has become a hot spot in recent years. However, there is no uniform standard for designing and evaluating non-IID scenarios. This paper proposes a Frechet distance-independent non-IID distribution dataset metric MDFD. And we conducted experiments on different types of distributed machine-learning methods in different non-IID scenarios to verify the effectiveness of MDFD.
SDWD: Style Diversity Weighted Distance Evaluates the Intra-Class Data Diversity of Distributed GANs
Wei Wang
Ziwen Wu
Mingwei Zhang
Differential Chromatin Architecture and Risk Variants in Deep Layer Excitatory Neurons and Grey Matter Microglia Contribute to Major Depressive Disorder
Anjali Chawla
Doruk Cakmakci
Wenmin Zhang
Malosree Maitra
Reza Rahimian
Haruka Mitsuhashi
MA Davoli
Jenny Yang
Gary Gang Chen
Ryan Denniston
Deborah Mash
Naguib Mechawar
Matthew Suderman
Corina Nagy
Gustavo Turecki
GTM-decon: guided-topic modeling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes
Lakshmipuram Seshadri Swapna
Michael Huang
Guided-topic modelling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes
Lakshmipuram Seshadri Swapna
Michael Huang
Cell-type composition is an important indicator of health. We present Guided Topic Model for deconvolution (GTM-decon) to automatically infe… (see more)r cell-type-specific gene topic distributions from single-cell RNA-seq data for deconvolving bulk transcriptomes. GTM-decon performs competitively on deconvolving simulated and real bulk data compared with the state-of-the-art methods. Moreover, as demonstrated in deconvolving disease transcriptomes, GTM-decon can infer multiple cell-type-specific gene topic distributions per cell type, which captures sub-cell-type variations. GTM-decon can also use phenotype labels from single-cell or bulk data as a guide to infer phenotype-specific gene distributions. In a nested-guided design, GTM-decon identified cell-type-specific differentially expressed genes from bulk breast cancer transcriptomes.
Biomedical discovery through the integrative biomedical knowledge hub (iBKH).
Chang Su
Yufang Hou
Manqi Zhou
Suraj Rajendran
Jacqueline R.M. A. Maasch
Zehra Abedi
Haotan Zhang
Zilong Bai
Anthony Cuturrufo
Winston Guo
Fayzan F. Chaudhry
Gregory Ghahramani
Feixiong Cheng
Rui Zhang
Steven T. DeKosky
Jiang Bian
Yi Wang