Yue Li

Google Scholar

Vicky Dong

PhD - McGill University

Claris Gu

Master's Research - McGill University

Eric Huang

Master's Research - McGill University

Yixuan Li

PhD - McGill University

Principal supervisor :

Dylan Mann-Krzisnik

PhD - McGill University

Marshall Meng

Master's Research - McGill University

Principal supervisor :

Adrien Osakwe

PhD - McGill University

Google Scholar

Vishvak Raghavan

PhD - McGill University

Co-supervisor :

Jun Ding

Jack Song

Master's Research - McGill University

Co-supervisor :

Wilbur Wang

Master's Research - McGill University

Kunpeng Xu

Postdoctorate - McGill University

Co-supervisor :

Publications

SpaTM: Topic Models for Inferring Spatially Informed Transcriptional Programs

Adrien Osakwe

Wenqi Dong

Qihuang Zhang

Robert Sladek

Spatial transcriptomics has revolutionized our ability to characterize tissues and diseases by contextualizing gene expression with spatial … (see more)organization. Available methods require researchers to either train a model using histology-based annotations or use annotation-free clustering approaches to uncover spatial domains. However, few methods provide researchers with a way to jointly analyze spatial data from both annotation-free and annotation-guided perspectives using consistent inductive biases and levels of interpretability. A single framework with consistent inductive biases ensures coherence and transferability across tasks, reducing the risks of conflicting assumptions. To this end, we propose the Spatial Topic Model (SpaTM), a topic-modeling framework capable of annotation-guided and annotation-free analysis of spatial transcriptomics data. SpaTM can be used to learn gene programs that represent histology-based annotations while providing researchers with the ability to infer spatial domains with an annotation-free approach if manual annotations are limited or noisy. We demonstrate SpaTM’s interpretability with its use of topic mixtures to represent cell states and transcriptional programs and how its intuitive framework facilitates the integration of annotation-guided and annotation-free analyses of spatial data with downstream analyses such as cell type deconvolution. Finally, we demonstrate how both approaches can be used to extend the analysis of large-scale snRNA-seq atlases with the inference of cell proximity and spatial annotations in human brains with Major Depressive Disorder.

2025-01-27

bioRxiv (preprint)

Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms

Shadi Zabad

Chirayu Anant Haryan

Simon Gravel

Sanchit Misra

2025-01-22

bioRxiv (preprint)

Extrapolatable Transformer Pre-training for Ultra Long Time-Series Forecasting

Ziyang Song

Qincheng Lu

Hao Xu

Mike He Zhu

2024-12-16

Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (published)

arxiv.org

MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

Ruohan Wang

Zilong Wang

Ziyang Song

Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups a… (see more)nd enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.

2024-12-16

Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (published)

arxiv.org

scMoE: single-cell mixture of experts for learning hierarchical, cell-type-specific, and interpretable representations from heterogeneous scRNA-seq data

Michael Huang

2024-12-16

Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (published)

MiRGraph: A hybrid deep learning approach to identify microRNA-target interactions by integrating heterogeneous regulatory network and genomic sequences

Pei Liu

Yang Liu

Jiawei Luo

2024-12-03

2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (published)

Bidirectional Generative Pre-training for Improving Healthcare Time-series Representation Learning

Ziyang Song

Qincheng Lu

He Zhu

Learning time-series representations for discriminative tasks, such as classification and regression, has been a long-standing challenge in … (see more)the healthcare domain. Current pre-training methods are limited in either unidirectional next-token prediction or randomly masked token prediction. We propose a novel architecture called Bidirectional Timely Generative Pre-trained Transformer (BiTimelyGPT), which pre-trains on biosignals and longitudinal clinical records by both next-token and previous-token prediction in alternating transformer layers. This pre-training task preserves original distribution and data shapes of the time-series. Additionally, the full-rank forward and backward attention matrices exhibit more expressive representation capabilities. Using biosignals and longitudinal clinical records, BiTimelyGPT demonstrates superior performance in predicting neurological functionality, disease diagnosis, and physiological signs. By visualizing the attention heatmap, we observe that the pre-trained BiTimelyGPT can identify discriminative segments from biosignal time-series sequences, even more so after fine-tuning on the task.

2024-11-25

Proceedings of the 9th Machine Learning for Healthcare Conference (published)

proceedings.mlr.press

Abstract 4142894: Multimorbidity Trajectories Across the Lifespan in Patients with Congenital Heart Disease

Chao Li

Aihua Liu

Solomon Bendayan

Liming Guo

Judith Therrien

Robyn Tamblyn

Jay Brophy

Ariane Marelli

Background: Befitted from advances in medical care, patients with congenital heart disease (CHD) now survive to adulthood but face elevated… (see more) risks of both cardiac and non-cardiac complications. Understanding the trajectories of comorbidity development over a patient's lifespan is cornerstone to optimize care expected to improve long-term health outcomes. Research Aim: This study aims to investigate the temporal sequences and evolution of comorbidities in CHD patients across their lifespan. We hypothesize that multimorbidity trajectories in CHD patients are linked to CHD lesion severity and age at onset of specific comorbidities. Methods: Using the Quebec CHD database which comprised data in outpatient visits, hospitalization records and vital status from 1983 to 2017, we designed a longitudinal cohort study evaluating the development of 39 comorbidities coded using ICD-9/10. Temporal sequences were mapped using median age of onset. Associations between disease pairs were quantified by hazard ratios from Cox proportional hazard models adjusting for age, sex, genetic syndrome, competing risks of death, and taking into account the time-varying nature of the predictor diseases. Results: The cohort included 9,764 individuals with severe and 127,729 with non-severe CHD lesions. In severe CHD patients, most comorbidities developed between ages 25 and 40. Comorbidity progression began with childhood cardiovascular diseases, followed by systemic diseases such as diabetes, liver and kidney diseases, and advanced to heart failure and dementia in middle adulthood. In addition, mental disorders emerged in early adulthood and were associated with subsequent development of kidney diseases and dementia. Different trajectories were observed in non-severe CHD patients with 2-3 decades later disease onsets and non-differential onsets between cardiovascular and systemic complications (Figure). Conclusions: Distinct multimorbidity trajectories were observed in CHD patients by CHD lesion severity. In patients with severe CHD lesions, early systemic diseases significantly influenced subsequent complications. These findings highlight the need for well-timed surveillance guidelines and interventions to improve health outcomes.

2024-11-12

Circulation (published)

scMoE: single-cell mixture of experts for learning hierarchical, cell-type-specific, and interpretable representations from heterogeneous scRNA-seq data

Michael Huang

Advancements in single-cell transcriptomics methods have resulted in a wealth of single-cell RNA sequencing (scRNA-seq) data. Methods to lea… (see more)rn cell representation from atlas-level scRNA-seq data across diverse tissues can shed light into cell functions implicated in diseases such as cancer. However, integrating large-scale and heterogeneous scRNA-seq data is challenging due to the disparity of cell-types and batch effects. We present single-cell Mixture of Expert (scMoE), a hierarchical mixture of experts single-cell topic model. Our key contributions are the cell-type specific experts, which explicitly aligns topics with cell-types, and the integration of hierarchical cell-type lineages and domain knowledge. scMoE is both transferable and highly interpretable. We benchmarked our scMoE’s performance on 9 single-cell RNA-seq datasets for clustering and 3 simulated spatial datasets for spatial deconvolution. We additionally show that our model, using single-cell references, yields meaningful biological results by deconvolving 3 cancer bulk RNA-seq datasets and 2 spatial transcriptomics datasets. scMoE is able to identify cell-types of survival importance, find cancer subtype specific deconvolutional patterns, and capture meaningful spatially distinct cell-type distributions.

2024-10-25

bioRxiv (preprint)

ConvNTC: Convolutional neural tensor completion for predicting the disease-related miRNA pairs and cell-related drug pairs

Pei Liu

Xiao Liang

Jiawei Luo

2024-10-24

bioRxiv (preprint)

Cell ontology guided transcriptome foundation model

Manqi Zhou

Boyu Han

Transcriptome foundation models (TFMs) hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by… (see more) self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present **s**ingle **c**ell, **Cell-o**ntology guided TFM (scCello). We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses.

2024-10-11

NeurIPS.cc/2024/Workshop/FM4Science (poster)

openreview.net

TrajGPT: Healthcare Time-Series Representation Learning for Trajectory Prediction

Ziyang Song

Qincheng Lu

Mike He Zhu