Yue Li

2022-10-01

Journal of Biomedical Informatics (published)

Automatic Phenotyping by a Seed-guided Topic Model

Ziyang Song

Yuanyi Hu

Aman Verma

Electronic health records (EHRs) provide rich clinical information and the opportunities to extract epidemiological patterns to understand a… (see more)nd predict patient disease risks with suitable machine learning methods such as topic models. However, existing topic models do not generate identifiable topics each predicting a unique phenotype. One promising direction is to use known phenotype concepts to guide topic inference. We present a seed-guided Bayesian topic model called MixEHR-Seed with 3 contributions: (1) for each phenotype, we infer a dual-form of topic distribution: a seed-topic distribution over a small set of key EHR codes and a regular topic distribution over the entire EHR vocabulary; (2) we model age-dependent disease progression as Markovian dynamic topic priors; (3) we infer seed-guided multi-modal topics over distinct EHR data types. For inference, we developed a variational inference algorithm. Using MixEHR-Seed, we inferred 1569 PheCode-guided phenotype topics from an EHR database in Quebec, Canada covering 1.3 million patients for up to 20-year follow-up with 122 million records for 8539 and 1126 unique diagnostic and drug codes, respectively. We observed (1) accurate phenotype prediction by the guided topics, (2) clinically relevant PheCode-guided disease topics, (3) meaningful age-dependent disease prevalence. Source code is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Seed.

2022-08-14

Knowledge Discovery and Data Mining (published)

Modeling electronic health record data using a knowledge-graph-embedded topic model

Yuesong Zou

Ahmad Pesaranghader

Aman Verma

The rapid growth of electronic health record (EHR) datasets opens up promising opportunities to understand human diseases in a systematic wa… (see more)y. However, effective extraction of clinical knowledge from the EHR data has been hindered by its sparsity and noisy information. We present KG-ETM, an end-to-end knowledge graph-based multimodal embedded topic model. KG-ETM distills latent disease topics from EHR data by learning the embedding from the medical knowledge graphs. We applied KG-ETM to a large-scale EHR dataset consisting of over 1 million patients. We evaluated its performance based on EHR reconstruction and drug imputation. KG-ETM demonstrated superior performance over the alternative methods on both tasks. Moreover, our model learned clinically meaningful graph-informed embedding of the EHR codes. In additional, our model is also able to discover interpretable and accurate patient representations for patient stratification and drug recommendations.

2022-06-03

ArXiv (preprint)

arxiv.org

Inferring global-scale temporal latent topics from news reports to predict public health interventions for COVID-19

Zhi Wen

Guido Powell

Imane Chafi

Y. K. Li

2022-02-01

Patterns (published)

Supervised multi-specialist topic model with applications on large-scale electronic health record data

Ziyang Song

Xavier Sumba Toral

Yixin Xu

Aihua Liu

Liming Guo

Guido Powell

Aman Verma

Ariane Marelli

Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision … (see more)medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. Materials and Methods: We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. Results: We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. Availability and implementation: MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS

2021-08-01

Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (published)

arxiv.org

Global Surveillance of COVID-19 by mining news media using a multi-source dynamic embedded topic model

Pratheeksha Nair

Zhi Wen

Imane Chafi

Anya Okhmatovskaia

Guido Powell

Yannan Shen

2020-11-10

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (published)