Publications

AInstein: Can AI Rediscover Scientific Concepts from First Principles?

Shambhavi Mishra

Jose Dolz

Christopher Pal

Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models g… (voir plus)enuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.

2025-09-21

NeurIPS.cc/2025/Workshop/WiML (publié)

openreview.net

Are Large Language Models Good Temporal Graph Learners?

Zifeng Ding

Michael M. Bronstein

Reihaneh Rabbany

Guillaume Rabusseau

Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. Wh… (voir plus)ile a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.

2025-09-21

NeurIPS.cc/2025/Workshop/NPGML (poster)

doi.org

openreview.net

Conditional Adversarial Random Forest for Synthetic Electronic Health Record Generation

Cynthia García Ybarra

Christian Gagné

2025-09-21

NeurIPS.cc/2025/Workshop/WiML (publié)

openreview.net

CrediBench: Building Web-Scale Network Datasets for Information Integrity

James Zhou

Jean-François Godbout

Michael M. Bronstein

Reihaneh Rabbany

Shenyang Huang

Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persua… (voir plus)sive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.

2025-09-21

NPGML @ Neural Information Processing Systems (poster)

doi.org

openreview.net

Extracting a COVID-19 signature from a multi-omic dataset

Baptiste Bauvin

Thibaud Godon

Guillaume Bachelot

Claudia Carpentier

Riikka Huusaari

Maxime Déraspe

Juho Rousu

Caroline Quach

Jacques Corbeil

The complexity of COVID-19 requires approaches that extend beyond symptom-based descriptors. Multi-omic data, combining clinical, proteomic,… (voir plus) and metabolomic information, offer a more detailed view of disease mechanisms and biomarker discovery. As part of a large-scale Quebec initiative, we collected extensive datasets from COVID-19 positive and negative patient samples. Using a multi-view machine learning framework with ensemble methods, we integrated thousands of features across clinical, proteomic, and metabolomic domains to classify COVID-19 status. We further applied a novel feature relevance methodology to identify condensed signatures. Our models achieved a balanced accuracy of 89% ± 5% despite the high-dimensional nature of the data. Feature selection yielded 12- and 50-feature signatures that improved classification accuracy by at least 3% compared to the full feature set. These signatures were both accurate and interpretable. This work demonstrates that multi-omic integration, combined with advanced machine learning, enables the extraction of robust COVID-19 signatures from complex datasets. The condensed biomarker sets provide a practical path toward improved diagnosis and precision medicine, representing a significant advancement in COVID-19 biomarker discovery.

2025-09-21

Frontiers in Bioinformatics (publié)

doi.org

Graph Dreamer: Temporal Graph World Models for Sample-Efficient and Generalisable Reinforcement Learning

Anaïs Berkes

Donna Vakalis

Yoshua Bengio

David Rolnick

2025-09-21

NeurIPS.cc/2025/Workshop/WiML (publié)

openreview.net

Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models

Mina Arzaghi

Alireza Dehghanpour Farashah

Florian Carichon

Golnoosh Farnadi

Large Language Models (LLMs) are increasingly deployed in sensitive domains such as finance, where intrinsic representational biases can pro… (voir plus)pagate into extrinsic harms in downstream tasks. High-stakes applications such as credit scoring are especially vulnerable, as biased model behavior can reinforce existing inequities and result in harmful disparities across demographic groups \cite{blodgett2020language}. While prior research has questioned whether intrinsic bias truly translates into extrinsic unfairness \cite{goldfarb2020intrinsic}, this connection remains poorly understood. To address this gap, we propose a four-stage evaluation framework that systematically examines the relationship between intrinsic and extrinsic fairness. In Stage 1, we establish a baseline by training models such as logistic regression, LLM embeddings, and fine-tuned classifiers without any mitigation strategy, providing reference points for fairness and accuracy. In Stage 2, we evaluate task-level mitigation through Counterfactual Data Augmentation (CDA) \cite{gallegos2024bias}, which balances gender representation by generating counterfactual training instances, allowing us to assess improvements in extrinsic fairness. In Stage 3, we adapt concept unlearning \cite{dige2024mitigating} as an intrinsic bias mitigation method, encouraging LLMs to forget socioeconomic stereotypes while preserving fluency and predictive utility, and we evaluate how this intervention impacts downstream fairness. Finally, in Stage 4, we combine CDA with unlearning to test whether dual mitigation further enhances fairness. We conduct experiments on three datasets (Adult Census Income, ACS Employment, and German Credit) using instruction-tuned LLMs (LLaMA-3.1, Phi-3, and Gemma-2) in both frozen embedding and fine-tuned classifier settings, evaluating performance with predictive accuracy and group fairness metrics, including Demographic Parity, Accuracy Parity, and Equality of Odds. Our experiments demonstrate that intrinsic bias mitigation through unlearning is highly effective; in Phi-3, for instance, it reduces gender socioeconomic stereotype gaps by 94.9\% while maintaining language fluency. In downstream tasks, unlearning consistently improves group fairness metrics while preserving predictive accuracy, whereas CDA primarily enhances demographic parity but can introduce accuracy trade-offs. For instance, on the ACS Employment dataset, unlearned Gemma-2 improved Accuracy Parity from 0.199 to 0.104 (48\% gain), and combining CDA with unlearning on Llama-3.1 reduced Demographic Parity from 0.080 to 0.014 (82\% gain). On the Adult dataset, all three models maintained accuracy above 0.82 while showing reduced fairness gaps, and on German Credit, unlearning consistently outperformed CDA by improving group fairness metrics without sacrificing predictive performance. Overall, CDA and unlearning exhibit complementary effects, with their combination yielding the strongest fairness improvements across models and datasets. This work contributes to bias mitigation and fairness in LLMs in two ways. First, we adapt concept unlearning to mitigate socioeconomic stereotyping, showing that intrinsic bias reduction improves both representational and downstream fairness. Second, we introduce a unified evaluation framework that links intrinsic and extrinsic fairness, enabling systematic comparison of mitigation strategies. The framework is flexible, applying to both fine-tuned and frozen LLMs, and offers actionable guidance for deploying fairer models in finance and other high-stakes domains.

2025-09-21

NeurIPS.cc/2025/Workshop/WiML (publié)

openreview.net

LLMs can learn self-restraint through iterative self-reflection

Alexandre Piché

Aristides Milios

Dzmitry Bahdanau

Christopher Pal

2025-09-21

Transactions on Machine Learning Research (accepté)

doi.org

openreview.net

Modeling Open World Cognition as On-Demand Synthesis of Probabilistic Models

Lionel Wong

Katherine M. Collins

Lance Ying

Cedegao E. Zhang

Adrian Weller

Tobias Gerstenberg

Timothy J. O'Donnell

Alexander K. Lew

Jacob Andreas

Joshua B. Tenenbaum

Tyler BrookeWilson

When faced with novel situations, people can marshal relevant considerations from a wide range of background knowledge and use these for inf… (voir plus)erence and prediction. How do we draw in globally relevant information and reason over it coherently? We explore the hypothesis that people reason by constructing structured but small, ad-hoc mental models on the fly, tailored to novel situations. We propose a computational implementation of this idea -- a ``Model Synthesis Architecture'' (MSA) -- using language models to parameterize global, relevance-based retrieval of variables, and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA, along with ablations and baselines, as a model of human judgments across a sequence of experiments that requires progressively more open-ended and open-world reasoning about situations described in natural language. Across all experiments, the MSA captures human judgments, and outperforms the base LM alone – suggesting that MSAs offer a path towards capturing coherent human reasoning in open-ended domains.

2025-09-21

NeurIPS.cc/2025/Workshop/LAW (publié)

openreview.net