Publications

AInstein: Can AI Rediscover Scientific Concepts from First Principles?
Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models g… (see more)enuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.
Are Large Language Models Good Temporal Graph Learners?
Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. Wh… (see more)ile a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.
AURA: A Multi-modal Medical Agent for Understanding, Reasoning and Annotation
Conditional Adversarial Random Forest for Synthetic Electronic Health Record Generation
CrediBench: Building Web-Scale Network Datasets for Information Integrity
Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persua… (see more)sive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.
DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching
Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work in… (see more)troduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children's stories, and code-switched text. Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs. We also introduce DIVERS-CS, a diverse code-switching benchmark dataset spanning 10 language pairs, and show that existing models struggle to detect multiple languages within the same sentence. These results highlight the need for more robust and inclusive LID systems in real-world settings.
Extracting a COVID-19 signature from a multi-omic dataset
Baptiste Bauvin
Guillaume Bachelot
Claudia Carpentier
Riikka Huusaari
Maxime Déraspe
Juho Rousu
Caroline Quach
The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinical, proteomic,… (see more) and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery.As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures.Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable.This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.
Extracting a COVID-19 signature from a multi-omic dataset
Baptiste Bauvin
Guillaume Bachelot
Claudia Carpentier
Riikka Huusaari
Maxime Déraspe
Juho Rousu
Caroline Quach
Introduction The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinica… (see more)l, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery. Methods As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures. Results Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable. Discussion This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.
Extracting a COVID-19 signature from a multi-omic dataset
Baptiste Bauvin
Guillaume Bachelot
Claudia Carpentier
Riikka Huusaari
Maxime Déraspe
Juho Rousu
Caroline Quach
Introduction The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinica… (see more)l, proteomic, and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery. Methods As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures. Results Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable. Discussion This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.
Extracting a COVID-19 signature from a multi-omic dataset
Baptiste Bauvin
Guillaume Bachelot
Claudia Carpentier
Riikka Huusaari
Maxime Déraspe
Juho Rousu
Caroline Quach
Extracting a COVID-19 signature from a multi-omic dataset
Baptiste Bauvin
Guillaume Bachelot
Claudia Carpentier
Riikka Huusaari
Maxime Déraspe
Juho Rousu
Caroline Quach
Graph Dreamer: Temporal Graph World Models for Sample-Efficient and Generalisable Reinforcement Learning