Publications

Parseval Regularization for Continual Reinforcement Learning
Wesley Chung
Lynn Cherif
Periodic agent-state based Q-learning for POMDPs
Amit Sinha
Matthieu Geist
Predicting Future Actions of Reinforcement Learning Agents
Stephen Chung
Scott Niekum
QGFN: Controllable Greediness with Action Values
Elaine Lau
Stephen Zhewen Lu
Ling Pan
Emmanuel Bengio
Generative Flow Networks (GFlowNets; GFNs) are a family of energy-based generative methods for combinatorial objects, capable of generating … (see more)diverse and high-utility samples. However, consistently biasing GFNs towards producing high-utility samples is non-trivial. In this work, we leverage connections between GFNs and reinforcement learning (RL) and propose to combine the GFN policy with an action-value estimate,
RGFN: Synthesizable Molecular Generation Using GFlowNets
Michał Koziarski
Andrei Rekesh
Dmytro Shevchuk
Almer M. van der Sloot
Piotr Gaiński
Cheng-Hao Liu
Mike Tyers
Robert A. Batey
Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences
Damien Ferbach
Quentin Bertrand
Joey Bose
The rapid progress in generative models has resulted in impressive leaps in generation quality, blurring the lines between synthetic and rea… (see more)l data. Web-scale datasets are now prone to the inevitable contamination by synthetic data, directly impacting the training of future generated models. Already, some theoretical results on self-consuming generative models (a.k.a., iterative retraining) have emerged in the literature, showcasing that either model collapse or stability could be possible depending on the fraction of generated data used at each retraining step. However, in practice, synthetic data is often subject to human feedback and curated by users before being used and uploaded online. For instance, many interfaces of popular text-to-image generative models, such as Stable Diffusion or Midjourney, produce several variations of an image for a given query which can eventually be curated by the users. In this paper, we theoretically study the impact of data curation on iterated retraining of generative models and show that it can be seen as an \emph{implicit preference optimization mechanism}. However, unlike standard preference optimization, the generative model does not have access to the reward function or negative samples needed for pairwise comparisons. Moreover, our study doesn't require access to the density function, only to samples. We prove that, if the data is curated according to a reward model, then the expected reward of the iterative retraining procedure is maximized. We further provide theoretical results on the stability of the retraining loop when using a positive fraction of real data at each step. Finally, we conduct illustrative experiments on both synthetic datasets and on CIFAR10 showing that such a procedure amplifies biases of the reward model.
Simplifying Constraint Inference with Inverse Reinforcement Learning
Adriana Hugessen
Harley Wiltzer
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Leo Schwinn
David Dobre
Sophie Xhonneux
Stephan Günnemann
Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be direc… (see more)tly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs. Trigger Warning: the appendix contains LLM-generated text with violence and harassment.
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt
Fabien Roger
Dmitrii Krasheninnikov
The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More
Ouail Kitouni
Niklas Nolte
Adina Williams
Diane Bouchacourt
Mark Ibrahim
The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms
Elizabeth Collins-Woodfin
Inbar Seroussi
Begoña García Malaxechebarría
Andrew Mackenzie
Elliot Paquette
On the Scalability of Certified Adversarial Robustness with Generated Data
Thomas Altstidl
David Dobre
Arthur Kosmala
Bjoern Eskofier
Leo Schwinn
Certified defenses against adversarial attacks offer formal guarantees on the robustness of a model, making them more reliable than empirica… (see more)l methods such as adversarial training, whose effectiveness is often later reduced by unseen attacks. Still, the limited certified robustness that is currently achievable has been a bottleneck for their practical adoption. Gowal et al. and Wang et al. have shown that generating additional training data using state-of-the-art diffusion models can considerably improve the robustness of adversarial training. In this work, we demonstrate that a similar approach can substantially improve deterministic certified defenses but also reveal notable differences in the scaling behavior between certified and empirical methods. In addition, we provide a list of recommendations to scale the robustness of certified training approaches. Our approach achieves state-of-the-art deterministic robustness certificates on CIFAR-10 for the