The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design
High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often inc… (see more)lude unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic structures with recoverable synthetic sequences. In that way, we create a new dataset designed specifically for training expressive, fully atomistic protein generators. By retraining La-Proteína, which models discrete residue type and side chain structure in a continuous latent space, on this dataset, we achieve new state-of-the-art results, with improvements of +54% in structural diversity and +27% in co-designability. To validate the broad utility of our approach, we further introduce Proteína-Atomística, a unified flow-based framework that jointly learns the distribution of protein backbone structure, discrete sequences, and atomistic side chains without latent variables. We again find that training on our new sequence-structure data dramatically boosts benchmark performance, improving Proteína-Atomística’s structural diversity by +73% and co-designability by +5%. Our work highlights the critical importance of aligned sequence-structure data for training high-performance de novo protein design models. All data will be publicly released.
The widespread success of LLMs on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that re… (see more)produce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term _class-based (mis)generalization_, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model's internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits --- one prioritizing direct query-based reasoning, the other incorporating contextual cues --- whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues — what we term _stochastic chameleons_.
Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often ava… (see more)ilable in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via naïve direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over naïve prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.
2025-09-22
BERT2S @ Neural Information Processing Systems (poster)
The integration of AI into daily life has generated considerable attention and excitement, while also raising concerns about automating algo… (see more)rithmic harms and re-entrenching existing social inequities. While top-down solutions such as regulatory policies and improved algorithm design are common, the fact that AI trains on social data creates an opportunity for a grassroots approach, Algorithmic Collective Action, where users deliberately modify the data they share to steer a platform's learning process in their favor. This paper considers how these efforts interact with a firm's use of a differentially private model to protect user data, motivated by the growing regulatory focus on privacy and data protection. In particular, we investigate how the use of Differentially Private Stochastic Gradient Descent (DPSGD) affects the collective’s ability to influence the learning process. Our findings show that while differential privacy contributes to the protection of individual data, it introduces challenges for effective algorithmic collective action. We characterize lower bounds on the success of these actions as a function of the collective's size and the firm's privacy parameters, verifying these trends experimentally by training deep neural network classifiers across several datasets.
FEval-TTC: Fair Evaluation Protocol for Test-Time Compute
Pavel Rumiantsev
Soumyasundar Pal
Yingxue Zhang
Mark J. Coates
The performance of Large Language Models (LLMs) and the associated dollar costs of API calls can fluctuate over time, potentially invalidati… (see more)ng conclusions drawn in prior research.
To address this, we propose a _**F**air **Eval**uation protocol for **T**est-**T**ime **C**ompute_ (FEval-TTC), designed to ensure consistent assessment of test-time compute (TTC) methods, regardless of such fluctuations.
FEval-TTC focuses on evaluation of TTC methods that utilize underlying Chains-of-Thought (CoT).
It supports evaluations across multiple LLMs on a diverse set of mathematical and commonsense reasoning datasets.
The few-shot prompting and answer extraction processes are standardized across datasets, reducing both time and monetary overhead for researchers.
Furthermore, we provide a cost modeling procedure that estimates both the token and dollar cost per query, facilitating equitable comparisons of prevalent TTC methods.
We open-source FEval-TTC for public use at [anonymized code link](https://drive.google.com/file/d/1DUeteFA7lnx5MubuR0lh6OPN6XKfpqGC/view?usp=sharing).
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (see more)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduce **CHASE**, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60\% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70\% when we scale up the context size to 50k tokens.
Behavior arises from coordinated synaptic changes in recurrent neural populations. Inferring the underlying dynamics from limited, noisy, an… (see more)d high-dimensional neural recordings is a major challenge, as experimental data often provide only partial access to brain states. While data-driven recurrent neural networks (dRNNs) have been effective for modeling such dynamics, they are typically limited to single-task domains and struggle to generalize across behavioral conditions. Here, we propose a hierachical model that captures neural dynamics across multiple behavioral contexts by learning a shared embedding space over RNN weights. We demonstrate that our model captures diverse neural dynamics with a single, unified model using both simulated datasets of many tasks and neural recordings during monkey reaching. Using the learned task embeddings, we demonstrate accurate classification of dynamical regimes and generalization to unseen samples. Crucially, spectral analysis on the learnt weights provide valuable insights into network computations, highlighting the potential of joint embedding–weight learning for scalable inference of brain dynamics.