Publications

Intrinsic Neural Oscillations Predict Verbal Learning Performance and Encoding Strategy Use

Victor Oswald

Mathieu Landry

Hamza Abdelhedi

Sarah Lippé

Philippe Robaey

Karim Jerbi

2025-09-27

BioRxiv (preprint)

doi.org

WebArena Verified: Reliable Evaluation for Web Agents

Amine El hattami

Megh Thakkar

Nicolas Chapados

Christopher Pal

Autonomous web agents increasingly operate in multi-step browser workflows, yet widely used benchmarks can misestimate performance due to un… (see more)derspecified goals and brittle checkers—challenges characteristic of normal benchmark maturation rather than flaws in the paradigm. We present WebArena Verified, a reproducible re-evaluation of WebArena that preserves its containerized environments while strengthening measurement. We audit all 812 tasks, repair misaligned evaluations and clarify ambiguous instructions; replace substring matching with type- and normalization-aware comparators; verify backend state for state-changing tasks; and adopt a structured JSON schema with explicit status codes for deterministic scoring. We provide improved results reporting with template-level macro averages, 95\% confidence intervals, and failure-mode breakdowns. We also introduce WebArena Verified Hard, a 137-task subset that retains difficult cases while reducing evaluation cost by 83\%. On the baseline agent we evaluated, it reduces false negatives by approximately 11\%. WebArena Verified remains drop-in compatible with minimal change to existing agents, supporting faithful and comparable progress. We release our code, data, and evaluation tools in our public repository.

2025-09-27

NeurIPS.cc/2025/Workshop/SEA (poster)

openreview.net

Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Charles E. Gagnon

Steven H. H. Ding

Philippe Charland

Benjamin C. M. Fung

2025-09-26

ArXiv (preprint)

doi.org

arxiv.org

Planner Aware Path Learning in Diffusion Language Models Training

Fred Zhangzhi Peng

Zachary Bezemek

Jarrid Rector-Brooks

Shuibai Zhang

Anru R. Zhang

Michael M. Bronstein

Avishek Bose

Alexander Tong

2025-09-26

ArXiv (preprint)

doi.org

arxiv.org

Planning with Unified Multimodal Models

Yihao Sun

Zhilong Zhang

Yang Yu

Pierre-Luc Bacon

With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored … (see more)using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

2025-09-26

ArXiv (preprint)

doi.org

arxiv.org

Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

Priyan Vaithilingam

Munyeong Kim

Frida-Cecilia Acosta-Parenteau

Daniel Lee

Amine Mhedhbi

Elena L. Glassman

Ian Arawjo

2025-09-26

Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (published)

doi.org

arxiv.org

Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models

Alexander Htet Kyaw

Richa Gupta

Dhruv Shah

Anoop K. Sinha

Kory Mathewson

Stefanie Pender

Sachin Chitta

Yotto koga

Faez Ahmed

Lawrence Sass

Randall Davis

Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects invo… (see more)lving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components based on object functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6\% of the time, compared to 59.4\% for rule-based and 2.5\% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.

2025-09-26

NeurIPS.cc/2025/Creative_AI_Track (published)

doi.org

openreview.net

$\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training

Aur'elien Buck-Kaeffer

Je Qin Chooi

Dan Zhao

Maximilian Puelma Touzel

Kellin Pelrine

Jean-François Godbout

Reihaneh Rabbany

Zachary Yang

Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethi… (see more)cally or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.

2025-09-26

ArXiv (preprint)

doi.org

arxiv.org

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung YUN

Pierre-Luc St-Charles

Jinkyoo Park

Yoshua Bengio

Minsu Kim

2025-09-25

ArXiv (preprint)

doi.org

arxiv.org

Continual Pre-training of MoEs: How robust is your router?

Benjamin Therien

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

Eugene Belilovsky

Irina Rish

2025-09-25

TMLR (accepted)

doi.org

openreview.net

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

Pooneh Mousavi

Mirco Ravanaelli

Yusuf Cem Sübakan

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliab… (see more)le explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

2025-09-25

ArXiv (preprint)

doi.org

arxiv.org

Acute respiratory distress syndrome in patients with cancer: the YELENNA prospective multinational observational cohort study.

Peter Schellongowski

Michael Darmon

Philipp Eller

Laveena Munshi

Tobias Liebregts

Victoria Metaxa

Luca Montini

Tobias Lahmer

Andry Van de Louw

Martin Balik

Peter Pickkers

Pleun Hemelaar

Hemang Yadav

Andreas Barratt-Due

Thomas Karvunidis

Jordi Riera

Gennaro Martucci

Ignacio Martin-Loeches

Pedro Castro

Nina Buchtele … (see 24 more)

Virginie Lemiale

Stefan Hatzl

Guillaume Dumas

Thomas Staudinger

Elie Azoulay

Gottfried Gürkan Christian Elisabeth Alexis Gennaro Giovanna Heinz Sengölge Zauner Lobmeyr Maillard De Pascale

Gottfried Heinz

G. Sengölge

Christian Zauner

Elisabeth Lobmeyr

Alexis Maillard

G. De Pascale

G. Panarello

Philippe R. Bauer

M. Flaksa

Brozek

Fabio S. Taccone

I. Crippa

Andreas Barrat-Due

Sandra García-Roche

Cándido Díaz-Lagares

Andrés Pacheco

A. Téllez

I. Loeches

2025-09-24

Intensive Care Medicine (published)

doi.org

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Publications

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Popular keywords:

Publications