Portrait de David Scott Krueger

David Scott Krueger

Membre académique principal
Professeur adjoint, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Sujets de recherche
Apprentissage de représentations
Apprentissage profond

Biographie

David Krueger est professeur adjoint en IA robuste, raisonnable et responsable au département d'informatique et de recherche opérationnelle (DIRO) et un membre académique principal à Mila - Institut québécois d'intelligence artificielle, au Center for Human-Compatible AI (CHAI) de l'université de Berkeley et au Center for the Study of Existential Risk (CSER). Ses travaux portent sur la réduction du risque d'extinction de l'humanité par l'intelligence artificielle (x-risque IA) par le biais de la recherche technique ainsi que de l'éducation, de la sensibilisation, de la gouvernance et de la défense des droits humains.

Ses recherches couvrent de nombreux domaines de l'apprentissage profond, de l'alignement de l'IA, de la sécurité de l'IA et de l'éthique de l'IA, notamment les modes de défaillance de l'alignement, la manipulation algorithmique, l'interprétabilité, la robustesse et la compréhension de la manière dont les systèmes d'IA apprennent et se généralisent. Il a été présenté dans les médias, notamment dans l'émission Good Morning Britain d'ITV, Inside Story d'Al Jazeera, France 24, New Scientist et l'Associated Press.

David a terminé ses études supérieures à l'Université de Montréal et à Mila - Institut québécois d'intelligence artificielle, où il a travaillé avec Yoshua Bengio, Roland Memisevic et Aaron Courville.

Étudiants actuels

Doctorat - UdeM
Superviseur⋅e principal⋅e :

Publications

Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Usman Anwar
Abulhair Saparov
Javier Rando
Daniel Paleka
Miles Turpin
Peter Hase
Ekdeep Singh Lubana
Erik Jenner
Stephen Casper
Oliver Sourbut
Benjamin L. Edelman
Zhaowei Zhang
Mario Günther
Anton Korinek
Jose Hernandez-Orallo
Lewis Hammond
Eric J Bigelow
Alexander Pan
Lauro Langosco
Tomasz Korbak … (voir 22 de plus)
Heidi Chenyu Zhang
Ruiqi Zhong
Sean O hEigeartaigh
Gabriel Recchia
Giulio Corsi
Alan Chan
Markus Anderljung
Lilian Edwards
Aleksandar Petrov
Christian Schroeder de Witt
Danqi Chen
Sumeet Ramesh Motwani
Jakob Nicolaus Foerster
Philip Torr
Florian Tramèr
Samuel Albanie
Yejin Choi
He He
Atoosa Kasirzadeh
Implicitly Bayesian Prediction Rules in Deep Learning
Bruno Mlodozeniec
Richard Turner
The Bayesian approach leads to coherent updates of predictions under new data, which makes adhering to Bayesian principles appealing in deci… (voir plus)sion-making contexts. Traditionally, integrating Bayesian principles into models like deep neural networks involves setting priors on parameters and approximating posteriors. This is done despite the fact that, typically, priors on parameters reflect any prior beliefs only insofar as they dictate function space behaviour. In this paper, we rethink this approach and consider what properties characterise a prediction rule as being Bayesian. Algorithms meeting such criteria can be deemed implicitly Bayesian — they make the same predictions as some Bayesian model, without explicitly manifesting priors and posteriors. We argue this might be a more fruitful approach towards integrating Bayesian principles into deep learning. In this paper, we propose how to measure how close a general prediction rule is to being implicitly Bayesian, and empirically evaluate multiple prediction strategies using our approach. We also show theoretically that agents relying on non-implicitly Bayesian prediction rules can be easily exploited in adversarial betting settings.
Implicit meta-learning may lead language models to trust more reliable sources
Dmitrii Krasheninnikov
Egor Krasheninnikov
Bruno Mlodozeniec
We demonstrate that large language models (LLMs) may learn indicators of document usefulness and modulate their updates accordingly. We intr… (voir plus)oduce random strings ("tags") as indicators of usefulness in a synthetic fine-tuning dataset. Fine-tuning on this dataset leads to **implicit meta-learning (IML)**: in further fine-tuning, the model updates to make more use of text that is tagged as useful. We perform a thorough empirical investigation of this phenomenon, finding (among other things) that (i) it occurs in both pretrained LLMs and those trained from scratch, as well as on a vision task, and (ii) larger models and smaller batch sizes tend to give more IML. We also use probing to examine how IML changes the way models store knowledge in their parameters. Finally, we reflect on what our results might imply about the capabilities, risks, and controllability of future AI systems.
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
Benjamin Bucknall
Andreas Haupt
Kevin Wei
Jérémy Scheurer
Marius Hobbhahn
Lee Sharkey
Satyapriya Krishna
Marvin Von Hagen
Silas Alberti
Alan Chan
Qinyi Sun
Michael Gerovitch
David Bau
Max Tegmark
Dylan Hadfield-Menell
External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depe… (voir plus)nds on the degree of system access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. Meanwhile, outside-the-box access to its training and deployment information (e.g., methodology, code, documentation, hyperparameters, data, deployment details, findings from internal evaluations) allows for auditors to scrutinize the development process and design more targeted evaluations. In this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. We also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. Given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
Kevin Wei
Lewis Hammond
Herbie Bradley
Emma Bluemke
Nitarshan Rajkumar
Noam Kolt
Lennart Heim
Markus Anderljung
Implicitly Bayesian Prediction Rules in Deep Learning
Bruno Mlodozeniec
Richard E. Turner
The Bayesian approach leads to coherent updates of predictions under new data, which makes adhering to Bayesian principles appealing in deci… (voir plus)sion-making contexts. Traditionally, integrating Bayesian principles into models like deep neural networks involves setting priors on parameters and approximating posteriors. This is done despite the fact that, typically, priors on parameters reflect any prior beliefs only insofar as they dictate function space behaviour. In this paper, we rethink this approach and consider what properties characterise a prediction rule as being Bayesian. Algorithms meeting such criteria can be deemed implicitly Bayesian — they make the same predictions as some Bayesian model, without explicitly manifesting priors and posteriors. We argue this might be a more fruitful approach towards integrating Bayesian principles into deep learning. In this paper, we propose how to measure how close a general prediction rule is to being implicitly Bayesian, and empirically evaluate multiple prediction strategies using our approach. We also show theoretically that agents relying on non-implicitly Bayesian prediction rules can be easily exploited in adversarial betting settings.
What Mechanisms Does Knowledge Distillation Distill?
Cindy Wu
Ekdeep Singh Lubana
Bruno Mlodozeniec
Robert Kirk
Knowledge distillation is a commonly-used compression method in ML due to the popularity of increasingly large-scale models, but it is uncle… (voir plus)ar if all the information a teacher model contains is distilled into the smaller student model. We aim to formalize the concept of ‘knowledge’ to investigate how knowledge is transferred during distillation, focusing on shared invariant outputs to counterfactual changes of dataset latent variables (we call these latents mechanisms). We define a student model to be a good stand-in model for a teacher if it shares the teacher’s learned mechanisms, and find that Jacobian matching and contrastive representation learning are viable methods by which to train such models. While these methods do not result in perfect transfer of mechanisms, we show they often improve student fidelity or mitigate simplicity bias (as measured by the teacher-to-student KL divergence and accuracy on various out-of-distribution test datasets), especially on datasets with spurious statistical correlations.
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Usman Anwar
Abulhair Saparov
Javier Rando
Daniel Paleka
Miles Turpin
Peter Hase
Ekdeep Singh Lubana
Erik Jenner
Stephen Casper
Oliver Sourbut
Benjamin L. Edelman
Zhaowei Zhang
Mario Günther
Anton Korinek
Jose Hernandez-Orallo
Lewis Hammond
Eric J Bigelow
Alexander Pan
Lauro Langosco
Tomasz Korbak … (voir 18 de plus)
Heidi Zhang
Ruiqi Zhong
Sean 'o H'eigeartaigh
Gabriel Recchia
Giulio Corsi
Alan Chan
Markus Anderljung
Lilian Edwards
Danqi Chen
Samuel Albanie
Jakob Nicolaus Foerster
Florian Tramèr
He He
Atoosa Kasirzadeh
Yejin Choi
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are o… (voir plus)rganized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose
Safety Cases: How to Justify the Safety of Advanced AI Systems
Joshua Clymer
Nick Gabrieli
Thomas Larsen
As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them… (voir plus). To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.
A Generative Model of Symmetry Transformations
James U. Allingham
Bruno Mlodozeniec
Shreyas Padhy
Javier Antor'an
Richard E. Turner
Eric T. Nalisnick
Jos'e Miguel Hern'andez-Lobato
Correctly capturing the symmetry transformations of data can lead to efficient models with strong generalization capabilities, though method… (voir plus)s incorporating symmetries often require prior knowledge. While recent advancements have been made in learning those symmetries directly from the dataset, most of this work has focused on the discriminative setting. In this paper, we construct a generative model that explicitly aims to capture symmetries in the data, resulting in a model that learns which symmetries are present in an interpretable way. We provide a simple algorithm for efficiently learning our generative model and demonstrate its ability to capture symmetries under affine and color transformations. Combining our symmetry model with existing generative models results in higher marginal test-log-likelihoods and robustness to data sparsification.
Blockwise Self-Supervised Learning at Scale
Shoaib Ahmed Siddiqui
Yann LeCun
Stephane Deny
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain
Robert Kirk
Ekdeep Singh Lubana
Robert P. Dick
Hidenori Tanaka
Tim Rocktäschel
Edward Grefenstette
Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning… (voir plus) systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a `wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such ``wrapped capabilities'' are relevant leads to sample-efficient revival of the capability, i.e., the model begins reusing these capabilities after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.