Portrait de David Scott Krueger n'est pas disponible

David Scott Krueger

Membre académique principal
Professeur adjoint, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)

Biographie

David Krueger est professeur adjoint en IA robuste, raisonnable et responsable au département d'informatique et de recherche opérationnelle (DIRO) et un membre académique principal à Mila - Institut québécois d'intelligence artificielle, au Center for Human-Compatible AI (CHAI) de l'université de Berkeley et au Center for the Study of Existential Risk (CSER). Ses travaux portent sur la réduction du risque d'extinction de l'humanité par l'intelligence artificielle (x-risque IA) par le biais de la recherche technique ainsi que de l'éducation, de la sensibilisation, de la gouvernance et de la défense des droits humains.

Ses recherches couvrent de nombreux domaines de l'apprentissage profond, de l'alignement de l'IA, de la sécurité de l'IA et de l'éthique de l'IA, notamment les modes de défaillance de l'alignement, la manipulation algorithmique, l'interprétabilité, la robustesse et la compréhension de la manière dont les systèmes d'IA apprennent et se généralisent. Il a été présenté dans les médias, notamment dans l'émission Good Morning Britain d'ITV, Inside Story d'Al Jazeera, France 24, New Scientist et l'Associated Press.

David a terminé ses études supérieures à l'Université de Montréal et à Mila - Institut québécois d'intelligence artificielle, où il a travaillé avec Yoshua Bengio, Roland Memisevic et Aaron Courville.

Étudiants actuels

Doctorat - Université de Montréal
Superviseur⋅e principal⋅e :

Publications

Managing AI Risks in an Era of Rapid Progress
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
Yuval Noah Harari
Trevor Darrell
Ya-Qin Zhang
Lan Xue
Shai Shalev-Shwartz
Gillian K. Hadfield
Jeff Clune
Frank Hutter
Atilim Güneş Baydin
Sheila McIlraith
Qiqi Gao
Ashwin Acharya
Anca Dragan … (voir 5 de plus)
Philip Torr
Stuart Russell
Daniel Kahneman
Jan Brauner
Sören Mindermann
What Mechanisms Does Knowledge Distillation Distill?
Cindy Wu
Ekdeep Singh Lubana
Bruno Mlodozeniec
Robert Kirk
Knowledge distillation is a commonly-used compression method in ML due to the popularity of increasingly large-scale models, but it is uncle… (voir plus)ar if all the information a teacher model contains is distilled into the smaller student model. We aim to formalize the concept of ‘knowledge’ to investigate how knowledge is transferred during distillation, focusing on shared invariant outputs to counterfactual changes of dataset latent variables (we call these latents mechanisms). We define a student model to be a good stand-in model for a teacher if it shares the teacher’s learned mechanisms, and find that Jacobian matching and contrastive representation learning are viable methods by which to train such models. While these methods do not result in perfect transfer of mechanisms, we show they often improve student fidelity or mitigate simplicity bias (as measured by the teacher-to-student KL divergence and accuracy on various out-of-distribution test datasets), especially on datasets with spurious statistical correlations.
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Usman Anwar
Abulhair Saparov
Javier Rando
Daniel Paleka
Miles Turpin
Peter Hase
Ekdeep Singh Lubana
Erik Jenner
Stephen Casper
Oliver Sourbut
Benjamin L. Edelman
Zhaowei Zhang
Mario Gunther
Anton Korinek
Jose Hernandez-Orallo
Lewis Hammond
Eric J Bigelow
Alexander Pan
Lauro Langosco
Tomasz Korbak … (voir 18 de plus)
Heidi Zhang
Ruiqi Zhong
Sean 'o H'eigeartaigh
Gabriel Recchia
Giulio Corsi
Alan Chan
Markus Anderljung
Lilian Edwards
Danqi Chen
Samuel Albanie
Jakob Nicolaus Foerster
Florian Tramèr
He He
Atoosa Kasirzadeh
Yejin Choi
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are o… (voir plus)rganized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose
Safety Cases: How to Justify the Safety of Advanced AI Systems
Joshua Clymer
Nick Gabrieli
Thomas Larsen
As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them… (voir plus). To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.
A Generative Model of Symmetry Transformations
James U. Allingham
Bruno Mlodozeniec
Shreyas Padhy
Javier Antor'an
Richard E. Turner
Eric T. Nalisnick
Jos'e Miguel Hern'andez-Lobato
Correctly capturing the symmetry transformations of data can lead to efficient models with strong generalization capabilities, though method… (voir plus)s incorporating symmetries often require prior knowledge. While recent advancements have been made in learning those symmetries directly from the dataset, most of this work has focused on the discriminative setting. In this paper, we construct a generative model that explicitly aims to capture symmetries in the data, resulting in a model that learns which symmetries are present in an interpretable way. We provide a simple algorithm for efficiently learning our generative model and demonstrate its ability to capture symmetries under affine and color transformations. Combining our symmetry model with existing generative models results in higher marginal test-log-likelihoods and robustness to data sparsification.
Blockwise Self-Supervised Learning at Scale
Shoaib Ahmed Siddiqui
Yann LeCun
Stephane Deny
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
Benjamin Bucknall
Andreas A. Haupt
Kevin Wei
J'er'emy Scheurer
Marius Hobbhahn
Lee Sharkey
Satyapriya Krishna
Marvin von Hagen
Silas Alberti
Alan Chan
Qinyi Sun
Michael Gerovitch
David Bau
Max Tegmark
Dylan Hadfield-Menell
External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depe… (voir plus)nds on the degree of system access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. Meanwhile, outside-the-box access to its training and deployment information (e.g., methodology, code, documentation, hyperparameters, data, deployment details, findings from internal evaluations) allows for auditors to scrutinize the development process and design more targeted evaluations. In this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. We also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. Given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
Kevin Wei
Lewis Hammond
Herbie Bradley
Emma Bluemke
Nitarshan Rajkumar
Noam Kolt
Lennart Heim
Markus Anderljung
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain
Robert Kirk
Ekdeep Singh Lubana
Robert P. Dick
Hidenori Tanaka
Edward Grefenstette
Tim Rocktäschel
Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning… (voir plus) systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a `wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such ``wrapped capabilities'' are relevant leads to sample-efficient revival of the capability, i.e., the model begins reusing these capabilities after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.
Reward Model Ensembles Help Mitigate Overoptimization
Thomas Coste
Usman Anwar
Robert Kirk
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As pa… (voir plus)rt of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the “true” reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger “gold” reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
Thomas Krendl Gilbert
Jérémy Scheurer
Javier Rando
Rachel Freedman
Tomasz Korbak
David Lindner
Pedro Freire
Tony Tong Wang
Samuel Marks
Charbel-Raphael Segerie
Micah Carroll
Andi Peng
Phillip Christoffersen
Mehul Damani
Stewart Slocum
Usman Anwar
Anand Siththaranjan … (voir 12 de plus)
Max Nadeau
Eric J Michaud
Jacob Pfau
Dmitrii Krasheninnikov
Xin Chen
Lauro Langosco
Peter Hase
Erdem Biyik
Anca Dragan
Dorsa Sadigh
Dylan Hadfield-Menell
(Out-of-context) Meta-learning in Language Models
Dmitrii Krasheninnikov
Egor Krasheninnikov
Bruno Mlodozeniec
Brown et al. (2020) famously introduced the phenomenon of in-context meta-learning in large language models (LLMs). Our work establishes the… (voir plus) existence of a phenomenon we call out-of-context meta-learning via carefully designed synthetic experiments with large language models. We show that out-of-context meta-learning leads LLMs to more readily “internalize” the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and apply it in appropriate contexts. We further demonstrate internalization in a synthetic computer vision setting, and propose two hypotheses for the emergence of internalization: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based methods may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks.