Publications

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent v… (see more)ersion updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.
Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training
Harnessing agent-based frameworks in CellAgentChat to unravel cell-cell interactions from single-cell and spatial transcriptomics
Health data issues in Africa: time for digitization, standardization and harmonization
Abdoelnaser Degoot
Ismaël Koné
Shakuntala Baichoo
Mercy Ngungu
Nzisa Liku
Judit Kumuthini
Joyce Nakatumba-Nabende
Bubacarr Bah
How Overconfidence in Initial Choices and Underconfidence Under Criticism Modulate Change of Mind in Large Language Models
Dharshan Kumaran
Stephen M Fleming
Larisa Markeeva
Joseph Heyward
Andrea Banino
Mrinal Mathur
Simon Kayode Osindero
Benedetto De Martino
Petar Veličković
Viorica Patraucean
Large language models (LLMs) exhibit strikingly conflicting behaviors: they can appear steadfastly overconfident in their initial answers wh… (see more)ilst at the same time being prone to excessive doubt when challenged. To investigate this apparent paradox, we developed a novel experimental paradigm, exploiting the unique ability to obtain confidence estimates from LLMs without creating memory of their initial judgments -- something impossible in human participants. We show that LLMs -- Gemma 3, GPT4o and o1-preview -- exhibit a pronounced choice-supportive bias that reinforces and boosts their estimate of confidence in their answer, resulting in a marked resistance to change their mind. We further demonstrate that LLMs markedly overweight inconsistent compared to consistent advice, in a fashion that deviates qualitatively from normative Bayesian updating. Finally, we demonstrate that these two mechanisms -- a drive to maintain consistency with prior commitments and hypersensitivity to contradictory feedback -- parsimoniously capture LLM behavior in a different domain. Together, these findings furnish a mechanistic account of LLM confidence that explains both their stubbornness and excessive sensitivity to criticism.
HVAC-GRACE: Transferable Building Control via Heterogeneous Graph Neural Network Policies
Buildings consume 40% of global energy, with HVAC systems responsible for up to half of that demand. As energy use grows, optimizing HVAC ef… (see more)ficiency is critical to meeting climate goals. While reinforcement learning (RL) offers a promising alternative to rule-based control, real-world adoption is limited by poor sample efficiency and generalisation. We introduce HVAC-GRACE, a graph-based RL framework that models buildings as heterogeneous graphs and integrates spatial message passing directly into temporal GRU gates. This enables each zone to learn control actions informed by both its own history and its structural context. Our architecture supports zero-shot transfer by learning topology-agnostic functions—but initial experiments reveal that this benefit depends on sufficient conditioned zone connectivity to maintain gradient flow. These findings highlight both the promise and the architectural requirements of scalable, transferable RL for building control
Integrating equity, diversity, and inclusion throughout the lifecycle of artificial intelligence for healthcare: a scoping review
Elham Emami
Dana Jafarpour
Raymond Tolentino
Genevieve Gore
The lack of Equity, Diversity, and Inclusion (EDI) principles in the lifecycle of Artificial Intelligence (AI) technologies in healthcare is… (see more) a growing concern. Despite its importance, there is still a gap in understanding the initiatives undertaken to address this issue. This review aims to explore what and how EDI principles have been integrated into the design, development, and implementation of AI studies in healthcare. We followed the scoping review framework by Levac et al. and the Joanna Briggs Institute. A comprehensive search was conducted until April 29, 2022, across MEDLINE, Embase, PsycInfo, Scopus, and SCI-EXPANDED. Only research studies in which the integration of EDI in AI was the primary focus were included. Non-research articles were excluded. Two independent reviewers screened the abstracts and full texts, resolving disagreements by consensus or by consulting a third reviewer. To synthesize the findings, we conducted a thematic analysis and used a narrative description. We adhered to the PRISMA-ScR checklist for reporting scoping reviews. The search yielded 10,664 records, with 42 studies included. Most studies were conducted on the American population. Previous research has shown that AI models improve when socio-demographic factors such as gender and race are considered. Despite frameworks for EDI integration, no comprehensive approach systematically applies EDI principles in AI model development. Additionally, the integration of EDI into the AI implementation phase remains under-explored, and the representation of EDI within AI teams has been overlooked. This review reports on what and how EDI principles have been integrated into the design, development, and implementation of AI technologies in healthcare. We used a thorough search strategy and rigorous methodology, though we acknowledge limitations such as language and publication bias. A comprehensive framework is needed to ensure that EDI principles are considered throughout the AI lifecycle. Future research could focus on strategies to reduce algorithmic bias, assess the long-term impact of EDI integration, and explore policy implications to ensure that AI technologies are ethical, responsible, and beneficial for all.
LLMs and Stack Overflow discussions: Reliability, impact, and challenges
Leuson Da Silva
Jordan Samhi
LLMs and Stack Overflow discussions: Reliability, impact, and challenges
Leuson Da Silva
Jordan Samhi
Mixed-integer Second-Order Cone Programming for Multi-period Scheduling of Flexible AC Transmission System Devices
Mohamad Charara
Martin De Montigny
Nivine Abou Daher
Model approximation in MDPs with unbounded per-step cost
Ashutosh Nayyar
Yi Ouyang
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process …
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
Jean-Philippe Corbeil
Amin Dada
Jean-Michel Attendu
Asma Ben Abacha
Lucas Caccia
Franccois Beaulieu
Thomas Lin
Jens Kleesiek
Paul Vozila
High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language… (see more) models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.