The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Grounding Computer Use Agents on Human Demonstrations
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen eleme… (see more)nts. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.
Modeling user preferences across domains remains a key challenge in slate recommendation (i.e. recommending an ordered sequence of items) re… (see more)search. We investigate how Large Language Models (LLM) can effectively act as world models of user preferences through pairwise reasoning over slates. We conduct an empirical study involving several LLMs on three tasks spanning different datasets. Our results reveal relationships between task performance and properties of the preference function captured by LLMs, hinting towards areas for improvement and highlighting the potential of LLMs as world models in recommender systems.
Large Language Models (LLMs) are rapidly being adopted by users across the globe, who interact with them in a diverse range of languages. At… (see more) the same time, there are well-documented imbalances in the training data and optimisation objectives of this technology, raising doubts as to whether LLMs can represent the cultural diversity of their broad user base. In this study, we look at LLMs and cultural values and examine how prompt language and cultural framing influence model responses and their alignment with human values in different countries. We probe 10 LLMs with 63 items from the Hofstede Values Survey Module and World Values Survey, translated into 11 languages, and formulated as prompts with and without different explicit cultural perspectives. Our study confirms that both prompt language and cultural perspective produce variation in LLM outputs, but with an important caveat: While targeted prompting can, to a certain extent, steer LLM responses in the direction of the predominant values of the corresponding countries, it does not overcome the models' systematic bias toward the values associated with a restricted set of countries in our dataset: the Netherlands, Germany, the US, and Japan. All tested models, regardless of their origin, exhibit remarkably similar patterns: They produce fairly neutral responses on most topics, with selective progressive stances on issues such as social tolerance. Alignment with cultural values of human respondents is improved more with an explicit cultural perspective than with a targeted prompt language. Unexpectedly, combining both approaches is no more effective than cultural framing with an English prompt. These findings reveal that LLMs occupy an uncomfortable middle ground: They are responsive enough to changes in prompts to produce variation, but too firmly anchored to specific cultural defaults to adequately represent cultural diversity.
The Delphi method is a structured forecasting process that engages experts in iterative prediction and reflection. Each round, experts submi… (see more)t forecasts to a mediator, receive an aggregated and synthesized response highlighting key arguments, and update their forecasts based on collective insight. However, Delphi panels are labour intensive, slow and hard to reproduce, requiring diverse knowledgeable participants to engage periodically across weeks or months. To address these constraints, we propose **DeLLMphi**, a forecasting method that replaces human experts and mediators with LLMs. We show (i) that providing example superforecaster reasoning traces and predictions helps to elicit more accurate forecasts from LLM experts, (ii) that the mediator plays the crucial role of surfacing different lines of reasoning and points of disagreement, and (iii) that multiple rounds and experts lead to better forecasts, showing that multi-turn interaction is key to DeLLMphi.
Longitudinal functional connectivity during rest and task is differentially related to Alzheimer's pathology and episodic memory in older adults
Larissa Fischer
Jenna N. Adams
Eóin N. Molloy
Jennifer Tremblay-Mercier
Jordana Remz
Alexa Pichet Binette
M. Natasha Rajah
Sylvia Villeneuve
Anne Maass
PREVENT-AD Research Group
Changes in functional connectivity (FC) strength involving the medial temporal lobe (MTL) and posteromedial cortex (PMC) are related to earl… (see more)y Alzheimer’s pathology and alterations in episodic memory performance in cognitively unimpaired older adults, but their dynamics remain unclear. We examined how longitudinal changes in FC involving MTL and PMC during resting-state, episodic memory encoding, and retrieval relate to subsequent amyloid- and tau-PET burden, longitudinal episodic memory performance, and the APOE4 genotype in 152 cognitively unimpaired older adults from the PREVENT-AD cohort. We found APOE4- and fMRI paradigm-dependent associations of change in FC strength with pathology burden and change in episodic memory performance. Decreasing FC over time, or “hypoconnectivity”, within PMC during rest in APOE4 carriers and during retrieval in APOE4 non-carriers was related to more amyloid and tau, respectively. Conversely, increasing FC over time, or “hyperconnectivity”, within MTL during encoding in APOE4 carriers and between MTL and PMC during retrieval independent of APOE4 status was related to more tau. Further, increasing FC between MTL and PMC during rest, unlike during encoding, was beneficial for episodic memory. Our study highlights that pathology-related episodic memory network changes manifest differently during rest and task and have differential implications for episodic memory trajectories.
The online version contains supplementary material available at 10.1038/s41598-025-21596-0.