Thibaud Godon

Guillaume Bachelot

Claudia Carpentier

Riikka Huusaari

Maxime Déraspe

Juho Rousu

Caroline Quach

Introduction The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinica… (see more)l, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery. Methods As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures. Results Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable. Discussion This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.

2025-09-22

Frontiers in Bioinformatics (published)

Extracting a COVID-19 signature from a multi-omic dataset

Baptiste Bauvin

Guillaume Bachelot

Claudia Carpentier

Riikka Huusaari

Maxime Déraspe

Juho Rousu

Caroline Quach

2025-09-22

Frontiers in Bioinformatics (published)

Extracting a COVID-19 signature from a multi-omic dataset

Baptiste Bauvin

Guillaume Bachelot

Claudia Carpentier

Riikka Huusaari

Maxime Déraspe

Juho Rousu

Caroline Quach

The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinical, proteomic,… (see more) and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery.As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures.Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable.This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.

2025-09-22

Frontiers in Bioinformatics (published)

Extracting a COVID-19 signature from a multi-omic dataset

Baptiste Bauvin

Guillaume Bachelot

Claudia Carpentier

Riikka Huusaari

Maxime Déraspe

Juho Rousu

Caroline Quach

Introduction The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinica… (see more)l, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery. Methods As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures. Results Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable. Discussion This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.

2025-09-22

Frontiers in Bioinformatics (published)

Extracting a COVID-19 signature from a multi-omic dataset

Baptiste Bauvin

Guillaume Bachelot

Claudia Carpentier

Riikka Huusaari

Maxime Déraspe

Juho Rousu

Caroline Quach

2025-09-22

Frontiers in Bioinformatics (published)

www.ncbi.nlm.nih.gov

On Selecting Robust Approaches for Learning Predictive Biomarkers in Metabolomics Data Sets.

Pier-Luc Plante

Pascal Germain

Alexandre Drouin

Metabolomics, the study of small molecules within biological systems, offers insights into metabolic processes and, consequently, holds grea… (see more)t promise for advancing health outcomes. Biomarker discovery in metabolomics represents a significant challenge, notably due to the high dimensionality of the data. Recent work has addressed this problem by analyzing the most important variables in machine learning models. Unfortunately, this approach relies on prior hypotheses about the structure of the data and may overlook simple patterns. To assess the true usefulness of machine learning methods, we evaluate them on a collection of 835 metabolomics data sets. This effort provides valuable insights for metabolomics researchers regarding where and when to use machine learning. It also establishes a benchmark for the evaluation of future methods. Nonetheless, the results emphasize the high diversity of data sets in metabolomics and the complexity of finding biologically relevant biomarkers. As a result, we propose a novel approach applicable across all data sets, offering guidance for future analyses. This method involves directly comparing univariate and multivariate models. We demonstrate through selected examples how this approach can guide data analysis across diverse data set structures, representative of the observed variability. Code and data are available for research purposes.

2025-06-12

Analytical Chemistry (published)

Invariant Causal Set Covering Machines

Baptiste Bauvin

Pascal Germain

Alexandre Drouin

2023-06-07

ArXiv (preprint)

arxiv.org

RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data

Pier-Luc Plante

Baptiste Bauvin

Élina Francovic-Fontaine

Alexandre Drouin

Background: Understanding the relationship between the Omics and the phenotype is a central problem in precision medicine. The high dimensio… (see more)nality of metabolomics data challenges learning algorithms in terms of scalability and generalization. Most learning algorithms do not produce interpretable models -- Method: We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules. -- Results : Applications on metabolomics data shows that it produces models that achieves high predictive performances. The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.

2022-08-11

ArXiv (preprint)